Friday, September 30, 2016

MySQL 8.0 and the thread sanitizer

MySQL 8.0 now supports the thread sanitizer.   This is good news as the thread sanitizer provides MySQL developers another tool to help find bugs in the multi-threaded MySQL server.  What happens when we build MySQL 8.0 with the thread sanitizer enabled and try to run some basic MySQL tests?  Unfortunately, no MySQL tests run since the MySQL bootstrap fails with lots of data races and other issues raised by the thread sanitizer.  When these issues are suppressed, some of the basic MySQL and InnoDB tests pass.  Some of these issues are real bugs and need to be investigated.

Both gcc 6.1 and clang 3.9 support the thread sanitizer.  Just add  the 'WITH_TSAN=ON' cmake option when configuring and building MySQL 8.0, and the MySQL code will be compiled with the thread sanitizer.

The next step after building MySQL is to run some MySQL tests.  Unfortunately, MySQL tests need to bootstrap the MySQL data directory, and there are lots of thread sanitizer issues during the bootstrap that prohibit the bootstrap to succeed.   This means that no MySQL tests can be run until these issues are fixed or suppressed.  MySQL 8.0 does not include a basic suppression file for the thread sanitizer as it does for the address sanitizer and for valgrind.  So, how many suppressions are needed to get a simple test to run?  The answer is about 4 or 5 suppressions (see below).  The problem is that the InnoDB suppression covers ANY data race in the InnoDB storage engine.  Since InnoDB is the primary storage engine for MySQL, this is not acceptable.  I hope that the InnoDB developers address this.

I used the following thread sanitizer suppressions when running the main and InnoDB MySQL test suites.  Anyone interesting in using the thread sanitizer to find bugs in MySQL 8.0 software can use these suppressions as a starting point in their investigation.  Good luck!

# ignore races in the pthread barrier implementation
race:pthread_barrier

# ignore possible locking deadlock
deadlock:plugin_thdvar_init

# ignore all races in the innodb storage engine.  this is overkill
race:innobase

# ignore races in the perf schema
race:^pfs
race:^PFS

# ignore race on charsets_dir
race:^charsets_dir$

# ignore race on THR_KEY_mysys_initialized. should be an atomic variable
race:^THR_KEY_mysys_initialized$

Tuesday, September 13, 2016

Races in the Fractal Tree Lock Manager

We recently found two races in the fractal tree lock manager that significantly affect transactional throughput for some applications that use a small number of concurrent transactions.  These races manifest as transactions unnecessarily waiting for an available lock.  Since the stall is resolved by a lock timeout and the lock timeout period can be very long, the throughput of the application can be extremely low.  We discuss these races and the changes to the lock manager that fixes them.

The release before wait race involves two transactions that are contending for the same lock.   Here is the sequence of events for transactions A and B, and lock L that demonstrates the release before wait race:

    A acquires L
    B tries to acquire L, conflicts with A
    A releases L
    A retries pending lock requests, B's is not found
    B adds itself to the pending lock request set
    B waits for L even though L is free

The race occurs when a lock is released by transaction A after transaction B tried to acquire it but before transaction B has a chance to register it's pending lock request.  There are several ways to fix this problem, but we want to optimize for the common situation of minimal lock conflicts, which is what the lock acquisition algorithm currently does.  Our solution to the release before wait race is for transaction B to retry its lock request after its lock request has been added to the pending lock set.

The retry while retry race involves three transactions are are contending for the same lock.  Here is the sequence of events for transactions A, B, and C, and lock L that demonstrates the retry while retry race:

    A acquires L
    B tries to acquire L, conflicts with A
    C tries to acquire L, conflicts with A
    A releases L
    A retries pending lock requests, grants L to B, gets suspended by the kernel before it finishes lock
        the lock retry loop
    B releases L
    B tries to retry pending lock requests, fails because lock retry is already busy
    A finishes lock retry
    A and B do something else, C is stuck waiting for L even though L is free

The race occurs in the current lock retry algorithm which assumes that if some transaction is running lock retry, then my transaction does not also need to run it.  There is a chance that some pending lock requests will be skipped, but these lock requests will eventually time out.  For applications with small numbers of concurrent transactions, timeouts will frequently occur, and the application throughput will be very small.

The solution to the retry race is to use a group retry algorithm.  All threads run through the retry logic.  Sequence numbers are used to group retries into groups such that one transaction can run the retry logic on behalf of several transactions.  This amortizes the lock retry cost over all of the retired transactions.

One of the applications affected by these lock manager races is MariaDB's optimistic parallel replication algorithm.  The number of concurrent transactions is limited by the number of parallel threads and can be relatively small, which makes the optimistic scheduler  susceptible to the fractal tree lock manager races. 

The optimistic scheduler looks outside of the group commit boundaries in the binary log to assign replication work to replication threads.  Sometimes, the parallel work has locking conflicts.  When this occurs, the conflicting transactions are aborted and the work schedule is changed to avoid the conflicts.   When the optimistic scheduler needs to abort the replication work, it needs a mechanism to kill any thread that may be waiting for lock inside of a storage engine.   MariaDB calls the handlerton kill query method, which causes TokuDB to find the pending lock request associated with a client thread and kill it.  This allows the client thread to subsequently abort its transaction.