Tuesday, October 4, 2016

MySQL 8.0 and TokuDB

A stripped down version of TokuDB is running on MySQL 8.0.


mysql> select @@version\G;
*************************** 1. row ***************************
@@version: 8.0.0-dmr-debug
1 row in set (0.00 sec)

mysql> show create table test.t\G;
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> show engines\G;
*************************** 6. row ***************************
      Engine: TokuDB
     Support: YES
     Comment: Percona TokuDB Storage Engine with Fractal Tree(tm) Technology
Transactions: YES
          XA: YES
  Savepoints: YES

mysql> create table test.t (id int primary key) engine=tokudb;
Query OK, 0 rows affected (0.08 sec)

mysql> show create table test.t\G;
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `id` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=TokuDB DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Look at where the data files are:


$ ls -1 data/
auto.cnf        
ib_logfile1                
mysqld_safe.pid         
sebastian.pid  
test                
__tokudb_lock_dont_delete_me_data        
__tokudb_lock_dont_delete_me_temp
ib_buffer_pool  
ibtmp1                     
performance_sche_3.SDI  
sys            
test_5.SDI          
__tokudb_lock_dont_delete_me_environment  
tokudb.rollback
ibdata1         
log000000000001.tokulog29  
performance_schema      
sys_4.SDI      
tokudb.directory    
__tokudb_lock_dont_delete_me_logs
ib_logfile0     
mysql                      
sebastian.err           
tc.log         
tokudb.environment  
__tokudb_lock_dont_delete_me_recovery

$ ls -1 data/test
t_323.SDI  
t_main_10_2_1d.tokudb  
t_status_10_1_1d.tokudb

Friday, September 30, 2016

MySQL 8.0 and the thread sanitizer

MySQL 8.0 now supports the thread sanitizer.   This is good news as the thread sanitizer provides MySQL developers another tool to help find bugs in the multi-threaded MySQL server.  What happens when we build MySQL 8.0 with the thread sanitizer enabled and try to run some basic MySQL tests?  Unfortunately, no MySQL tests run since the MySQL bootstrap fails with lots of data races and other issues raised by the thread sanitizer.  When these issues are suppressed, some of the basic MySQL and InnoDB tests pass.  Some of these issues are real bugs and need to be investigated.

Both gcc 6.1 and clang 3.9 support the thread sanitizer.  Just add  the 'WITH_TSAN=ON' cmake option when configuring and building MySQL 8.0, and the MySQL code will be compiled with the thread sanitizer.

The next step after building MySQL is to run some MySQL tests.  Unfortunately, MySQL tests need to bootstrap the MySQL data directory, and there are lots of thread sanitizer issues during the bootstrap that prohibit the bootstrap to succeed.   This means that no MySQL tests can be run until these issues are fixed or suppressed.  MySQL 8.0 does not include a basic suppression file for the thread sanitizer as it does for the address sanitizer and for valgrind.  So, how many suppressions are needed to get a simple test to run?  The answer is about 4 or 5 suppressions (see below).  The problem is that the InnoDB suppression covers ANY data race in the InnoDB storage engine.  Since InnoDB is the primary storage engine for MySQL, this is not acceptable.  I hope that the InnoDB developers address this.

I used the following thread sanitizer suppressions when running the main and InnoDB MySQL test suites.  Anyone interesting in using the thread sanitizer to find bugs in MySQL 8.0 software can use these suppressions as a starting point in their investigation.  Good luck!

# ignore races in the pthread barrier implementation
race:pthread_barrier

# ignore possible locking deadlock
deadlock:plugin_thdvar_init

# ignore all races in the innodb storage engine.  this is overkill
race:innobase

# ignore races in the perf schema
race:^pfs
race:^PFS

# ignore race on charsets_dir
race:^charsets_dir$

# ignore race on THR_KEY_mysys_initialized. should be an atomic variable
race:^THR_KEY_mysys_initialized$

Tuesday, September 13, 2016

Races in the Fractal Tree Lock Manager

We recently found two races in the fractal tree lock manager that significantly affect transactional throughput for some applications that use a small number of concurrent transactions.  These races manifest as transactions unnecessarily waiting for an available lock.  Since the stall is resolved by a lock timeout and the lock timeout period can be very long, the throughput of the application can be extremely low.  We discuss these races and the changes to the lock manager that fixes them.

The release before wait race involves two transactions that are contending for the same lock.   Here is the sequence of events for transactions A and B, and lock L that demonstrates the release before wait race:

    A acquires L
    B tries to acquire L, conflicts with A
    A releases L
    A retries pending lock requests, B's is not found
    B adds itself to the pending lock request set
    B waits for L even though L is free

The race occurs when a lock is released by transaction A after transaction B tried to acquire it but before transaction B has a chance to register it's pending lock request.  There are several ways to fix this problem, but we want to optimize for the common situation of minimal lock conflicts, which is what the lock acquisition algorithm currently does.  Our solution to the release before wait race is for transaction B to retry its lock request after its lock request has been added to the pending lock set.

The retry while retry race involves three transactions are are contending for the same lock.  Here is the sequence of events for transactions A, B, and C, and lock L that demonstrates the retry while retry race:

    A acquires L
    B tries to acquire L, conflicts with A
    C tries to acquire L, conflicts with A
    A releases L
    A retries pending lock requests, grants L to B, gets suspended by the kernel before it finishes lock
        the lock retry loop
    B releases L
    B tries to retry pending lock requests, fails because lock retry is already busy
    A finishes lock retry
    A and B do something else, C is stuck waiting for L even though L is free

The race occurs in the current lock retry algorithm which assumes that if some transaction is running lock retry, then my transaction does not also need to run it.  There is a chance that some pending lock requests will be skipped, but these lock requests will eventually time out.  For applications with small numbers of concurrent transactions, timeouts will frequently occur, and the application throughput will be very small.

The solution to the retry race is to use a group retry algorithm.  All threads run through the retry logic.  Sequence numbers are used to group retries into groups such that one transaction can run the retry logic on behalf of several transactions.  This amortizes the lock retry cost over all of the retired transactions.

One of the applications affected by these lock manager races is MariaDB's optimistic parallel replication algorithm.  The number of concurrent transactions is limited by the number of parallel threads and can be relatively small, which makes the optimistic scheduler  susceptible to the fractal tree lock manager races. 

The optimistic scheduler looks outside of the group commit boundaries in the binary log to assign replication work to replication threads.  Sometimes, the parallel work has locking conflicts.  When this occurs, the conflicting transactions are aborted and the work schedule is changed to avoid the conflicts.   When the optimistic scheduler needs to abort the replication work, it needs a mechanism to kill any thread that may be waiting for lock inside of a storage engine.   MariaDB calls the handlerton kill query method, which causes TokuDB to find the pending lock request associated with a client thread and kill it.  This allows the client thread to subsequently abort its transaction.



Thursday, May 5, 2016

My top author list for Planet MySQL

Who are the top individual authors of influential recent posts to planet MySQL?  The planet MySQL page includes a list of the top 20 authors as well as a list of the top 10 vendor blogs.  However, since posts to the vendor blogs make up at least 1/4 of all of the posts, and the authors of vendor blog posts are not included in the top author list, I decided to compute my own top author list.  I include the hidden authors from the vendor blogs when computing my top author list.

The first problem is to identify the hidden authors for posts from each vendor blog.  This requires that the author information be extracted from the individual posts, and this requires a specialized parser for each vendor blog to extract the author name from the document.

The second problem is to rank the authors using some criteria such as the number of posts in a given recent time range.   I could run a page rank algorithm if I had a map of the web, but I don't have a web map. So I include the 'like' and 'dislike' counters maintained for each post by planet MySQL into the criteria.  In addition, I could use the comments on a post to indicate the posts interest level, but I did not process the comments.

I don't know what the ranking algorithm is used by planet MySQL (is the algorithm obvious?), so I use my own ranking algorithm.  I rank the authors by assigning a score to each post and adding up the scores by author.  I select posts within some recent time range like the last 3 months.

My 'score' function is:

score(post) = (1 + A(like,dislike))*B(now,post_date)

A(like,dislike) = like_scale_factor*(abs(like) + abs(dislike))

B(now,post_date) = exp(-((now - post_date)*time_scale_factor))

Note that the constant '1' indicates the value of the post itself.  

The 'A' function computes the interest level of the post based on whether or not a reader took the time to poke the 'like' or 'dislike' button.  

The 'B' function is a negative exponential function that assigns precedence to recent posts.  Old posts will have a high score only if there is a lot of 'like' or 'dislike' counts.

It will be interesting to fiddle with the 'score' function, especially the scale factors.

My top authors list for posts between Feb 5 2016 and May 5 2016 inclusive

Given the 'score' function described above and a time range for the last 3 months, my list of the top 25 individual authors on planet MySQL is:

+-------------------------+-------+
| author                  | score |
+-------------------------+-------+
| Joao Osorio             |    22 |
| Todd Farmer             |    20 |
| Valeriy Kravchuk        |    20 |
| Colin Charles              16 |
| Daniel Bartholomew      |    16 |
| Morgan Tocker           |    15 |
| Fahd Mirza              |    15 |
| Hrvoje Matijakovic      |    15 |
| Mark Callaghan          |    13 |
| Dave Avery              |    13 |
| Frederic Descamps       |    12 |
| Peter Zaitsev           |    11 |
| Shahriyar Rzayev        |    11 |
| Dave Stokes             |    10 |
| Valerie Parham-Thompson |    10 |
| Vadim Tkachenko         |    10 |
| Sveta Smirnova          |     9 |
| Giuseppe Maxia          |     8 |
| Marcelo Altmann         |     8 |
| Mario Beck              |     7 |
| Seppo Jaakola           |     7 |
| Dimitri Kravtchuk       |     7 |
| Shlomi Noach            |     6 |
| Rasmus Johansson        |     6 |
| Peter Gulutzan          |     6 |
+-------------------------+-------+

My list is similar to the planet MySQL top author list except that there are no team blogs in my list. In addition, my list contains several authors that contribute heavily to the team blogs since I extracted the authors from the posts to several team blogs including the 'MySQL Performance Blog', the 'Pythian Group' blog, the 'MariaDB' blogs, and several 'MySQL' blogs.  Note that since the 'Severalnines' blog posts do not contain individual author names, all of its posts are not included in my list.

One can also compute the top authors for the last year or the last 10 years from the planet MySQL meta-data database using this post scoring process.  These results are not included in this post.

Posts by year

Let's change the topic.  One can compute interesting trends from the planet MySQL meta-data, like the number of posts to planet MySQL per year.  Obviously, there is some erroneous timestamp data (1970?) in the planet MySQL database.

+-----------------+----------+
| year(post_date) | count(*) |
+-----------------+----------+
|            1970 |        2 |
|            2001 |        2 |
|            2003 |        5 |
|            2004 |       67 |
|            2005 |     1409 |
|            2006 |     3489 |
|            2007 |     3715 |
|            2008 |     5744 |
|            2009 |     5052 |
|            2010 |     3192 |
|            2011 |     3137 |
|            2012 |     2987 |
|            2013 |     2695 |
|            2014 |     2308 |
|            2015 |     1686 |
|            2016 |      558 |

+-----------------+----------+




Wednesday, February 24, 2016

mysql_real_connect is not thread safe

The "mysql_real_connect" function is used by MySQL clients, including multi-threaded clients,  to connect to a MySQL server.  The "mysql_real_connect" function is allowed to be run by concurrent threads without application level serialization.  Unfortunately, the thread sanitizer found a data race in the "mysql_real_connect" function when processing the character set directory option.  This data race means that the "mysql_real_connect" function is not thread safe.   The data race occurs in the MySQL client connection code because  the character set directory option is stored in a global variable and is used without serialization by the client side connection code.  I suspect that the  character set directory option is not typically used by most MySQL clients.  However, since the option exists, it should work correctly.

Here are some details of the bug.  When MySQL 5.7.11 is built with the thread sanitizer enabled and the mysqlslap test is run, the thread sanitizer reports a data race in the client side connection software.  The data race occurs in the "mysql_set_character_set_with_default_collation" function, which is called when the client is executing the "mysql_real_connect" function. The data race occurs because the character set directory option to the "mysql_real_connect" function is temporarily stored in the "charset_dir" global variable so that it can be used by some other internal functions. Since there is no serialization on this global variable, the thread sanitizer reports a data race. The effect of this bug is that the character set directory for one thread can be used by another thread erroneously.

A possible bug fix could serialize the "mysql_set_characeter_set_with_default_collation" function since it is not thread safe. Alternatively, the charset directory could be passed as a function parameter rather than as a global variable to various internal functions.

The data race in the client side connection software can also be demonstrated when running a simplified mysqlslap test with helgrind.

The thread sanitizer and helgrind race detectors found a data race in the client side MySQL connection software.  These race detectors can be used to find data races in multi-thread software, like the client and server side MySQL software.  Unfortunately, these race detectors are not integrated into the MySQL software.  It would be nice to identify data races in the MySQL software before MySQL is running in production.  I created MySQL feature request #80409 to support the thread sanitizer in the MySQL software.

Versions:
MySQL 5.7.11, MySQL 5.6.29 and probably prior versions
clang 3.9
Ubuntu 15.10