Tuesday, December 29, 2015

MySQL 5.7 InnoDB Versus the Thread Sanitizer

InnoDB and the Thread Sanitizer do not work well together in MySQL 5.7.  There are hundreds of possible data races reported by the Thread Sanitizer in the InnoDB code for the simplest case of initializing the MySQL server and shutting it down.  Luckily, most of these data races are benign.  The problem of finding an interesting data race in the InnoDB software is similar to finding evidence of a sub-atomic particle in a particle accelerator; many Thread Sanitizer events must be analyzed before something interesting is found.

One example of a benign data race is InnoDB's implementation of fuzzy counters.  A fuzzy counter is a counter whose value does not have to be precise.  Since fuzzy counters are not thread safe, their value is not precise.  However, the fuzzy counter implementation uses the trick of spreading the values over multiple cache lines, so performance scales on multiple threads.  Unfortunately, fuzzy counters do not use C++ atomic operations, even with relaxed memory ordering semantics, so the Thread Sanitizer is not given enough information to ignore them.  Luckily, a single Thread Sanitizer suppression can ignore them.

There are lot of other counters in the InnoDB software that are not fuzzy counters and are not thread safe.  These counters must also be ignored for now.

Can we find interesting issues in InnoDB with the Thread Sanitizer by identifying and ignoring the benign data races?  Here are two interesting data races.

First, the Thread Sanitizer reports a data race on the 'buf_page_cleaner_is_active' global variable that occurs when InnoDB is initialized.  The initialization thread creates the page clean coordinator thread.  These two threads use a notification pattern using the 'buf_page_cleaner_is_active' variable to signal progress from the page clean coordinator thread back to the initialization thread.  Unfortunately, the racy notification pattern can be broken by aggressive compiler optimization and by memory ordering issues on some multiple core processors.   The MySQL server code should use C++ atomic variables or memory barriers to avoid the data races that occur when using the notification design pattern.  The MySQL server uses this pattern all over the place.  This is a potential bug with aggressive compiler optimizations when running on certain processors.

Next, the Thread Sanitizer reports a data race on the 'monitor_set_tbl' variable.  This data race is reported because the 'monitor_set_tbl' global variable is initialized by the InnoDB initialization thread AFTER its value is used by some InnoDB background threads that are created BEFORE the variable is initialized.  Since there is no synchronization between the threads WRT the 'monitor_set_tbl' variable,  the Thread Sanitizer reports the data race.  This is an obvious bug.

This blog discussed how the Thread Sanitizer found data races in the InnoDB storage engine.  There are a lot of benign data races reported due to performance counters in the storage engine.  Once these benign data races are filtered out, one can find interesting issues.  Two of these issues are discussed in this blog, but there are lots of additional issues to analyze.  So far, I have only used the Thread Sanitizer with the MySQL server in the simplest scenario of MySQL instance initialization.  It will be interesting to run some more complex workloads like sysbench.

I look forward to the new year.

Software configuration:
MySQL 5.7
Thread Sanitizer in clang 3.8
Ubuntu 14.04
Cmake 2.8.12.2

MySQL configuration:
cmake -DDISABLE_SHARED=ON -DWITH_TSAN=ON -DMUTEXTYPE=sys ...


Monday, December 21, 2015

MySQL 5.7 Performance Schema Versus the Thread Sanitizer

This blog continues the investigation into possible data races in the MySQL server reported by the Thread Sanitizer.  The previous blog investigated several possible data races that were detected when the MySQL server is initialized.  This blog investigates three data races in the performance schema that are reported when the MySQL server is shut down.  The first data race is a racy increment of thread statistics.  The second data race is a problem that may leave the state of the performance schema memory allocator incorrect.  The final data race is in the lock free pinbox data structure.

Performance schema statistics aggregation

The Thread Sanitizer reported a data race when a thread exits and performance schema account statistics are aggregated.  The data race occurs when the 'PFS_account::m_disconnected_count' instance variable is incremented by an racy increment statement 'm_disconnected_count++'.  The increment is not thread safe as there is no associated lock or mutex.  Is a fuzzy counter good enough, or do we need an accurate counter?  Unfortunately, there is no comment or annotation regarding this variable, so the intention of the original programmer is unknown to me.   If an accurate counter is needed, then there are several techniques to implement an accurate counter that scales performance with multiple threads.

Performance schema buffer container

The Thread Sanitizer reported a couple of data races in the performance schema's object allocator.  One of the objectives of the performance schema is that it is malloc free as least in its fast path (see comment in pfs.cc).  The performance schema allocates various objects from a pool of preallocated memory called a buffer container.  The buffer container's internal organization is an array of 'pages'.  Each 'page' is an array of objects.  The intent of this data structure is to avoid thread contention when objects are allocated and freed.

The data races are on the buffer container's 'page->m_full' and 'm_full' variables, which attempt to capture whether or not the container has free objects (m_full == false) or all of the objects are allocated (m_full == true).   The Thread Sanitizer reports data races in the deallocate function which sets 'm_full = false' since it just freed an object.  This will conflict with the allocate function which sets 'm_full = true' when there are no free objects.  Since the 'm_full' variables are racy, allocation of objects from the pool may fail when in fact there are free objects in the pool.

Performance schema pinbox

The Thread Sanitizer reported a data race in the performance schema's pinbox.  Is this a real race?  We need to understand the pinbox data structure and algorithm.  The pinbox is a non-blocking stack based allocator.  It is non-blocking since the performance schema is intended to be mutex free (see pfs.cc)

The data race is reported in the pinbox's stack push operation.  The push operation reads the current top of the stack, computes a new top of stack, and then uses an compare and swap to write the new top of stack.  If the compare and swap fails, then some other thread updated the stack top and the push operation in the current thread repeats.  The Thread Sanitizer reports a data race between the racy read of the top of stack and the write of the new top of stack.  A similar data race exists in the pinbox's stack pop operation since it also manipulates the top of stack.

The pinbox's stack push and pop operations look correct to me, so the report of the possible data race should be ignored.  It can be suppressed with a Thread Sanitizer suppression.  Alternatively, the top of stack could be implemented with a C++ atomic variable where the racy read is replaced with a relaxed memory order load.  This results in a equivalent algorithm and it provides the Thread Sanitizer sufficient information to prohibit it from reporting a data race.  In fact, the boost lock free stack does this.

Conclusion

Since debugging data races in concurrent programs is hard, I am a big fan of using tools like the Thread Sanitizer to identify possible concurrency problems.   Some changes to the MySQL server code, like the use of C++ atomic variables, will be needed to make effective use of the Thread Sanitizer.

Versions

MySQL 5.7.10
Thread Sanitizer in Clang 3.8
Ubuntu 14.04

Monday, December 7, 2015

MySQL 5.7 Initialization Versus the Thread Sanitizer

The MySQL server is a large multi-threaded program primarily written in the C++ programming language.  Unfortunately,  data races and deadlocks in concurrent C++ programs are common.  Fortunately, the Thread Sanitizer can be used to find data races and deadlocks.  MySQL 5.7 does not support the Thread Sanitizer, but is is easy to add to the MySQL configuration using a simple build procedure.  When MySQL 5.7 is built with the Thread Sanitizer and database initialization is run, the Thread Sanitizer reports hundreds of possible data races and deadlocks in the MySQL server.

Database initialization occurs when the MySQL data is generated for a new MySQL instance.  This is done in MySQL 5.7 when running 'mysqld --initialize'.   Since the MySQL server (mysqld) is run to initialize the database and the MySQL server is compiled with the Thread Sanitizer, many possible data races and deadlocks are reported while the MySQL server is run.  This blog analyzes four of the issues:  a problem with pthread barrier destruction, a potential lock deadlock, a data race on error reporting, and a data race on the global THD count variable.

The Thread Sanitizer reports a data race inside pthread barriers when a pthread barrier is destroyed.  MySQL 5.7 only has one use of pthread barriers which is to synchronize the main thread with the initialization of a background timer notification thread.  IMO, the Thread Sanitizer reports this race because many pthread barrier implementations have bugs WRT the barrier wait and barrier destroy operations.  MySQL can either choose to not use pthread barriers or suppress this error since the MySQL code using pthread barriers is correct.  The pthread barriers blog discusses this issue in more detail.

The Thread Sanitizer reports a possible deadlock with 3 locks.   This problem was also detected by helgrind and reported in MySQL bug 77872.  Variations of this 3 way locking deadlock have been detected in the field and reported to MySQL.  The Thread Sanitizer is faster than helgrind and it reports all of the thread stacks involved in the deadlock while helgrind does not.  These two Thread Sanitizer advantages should make debugging with the Thread Sanitizer feasible. 

The Thread Sanitizer reports a data race on the log_error_verbosity global variable between the bootstrap and signal handling threads.  These two threads run concurrently.  The bootstrap thread does a racy write of the log_error_verbosity variable while the signal handling thread does a racy read of the log_error_verbosity variable.  The Thread Sanitizer reports a data race since it can not compute a happens before relationship WRT the writes and reads of the log_error_verbosity variable done by these two threads.  This looks like a bug to me.

The Thread Sanitizer reports a data race on the global thd count when shutting down the MySQL server.  The MySQL server does a racy read of the global_thd_count variable without holding a lock while most of the other reads and writes of the global_thd_count variable are done while holding a lock.  Since the MySQL code states that the global_thd_count variable is racy, this racy read is by design.  If MySQL uses a relaxed memory atomic variable to store the global_thd_count, then the Thread Sanitizer would not consider this a data race.  Otherwise, a Thread Sanitizer suppression can be used to ignore this issue.  It would be nice to annotate racy variables in the code rather than in a comment on the code or in a data race suppression rule.

In conclusion, a configuration option that supports MySQL builds with the Thread Sanitizer was added to MySQL 5.7.  Hundreds of possible data races and deadlocks were reported by the Thread Sanitizer when running MySQL initialization.  This blog only examined four of the issues that illustrate the value of the tool.  The next blogs will look into the Thread Sanitizer issues WRT the performance schema and the InnoDB storage engine.

Tools

MySQL 5.7.9
Thread Sanitizer in Clang 3.8
Ubuntu 14.04

Wednesday, November 11, 2015

Uninitialized data problem in the LZMA compressor used by TokuFT

The LZMA algorithm implemented by the xz compression software package is one of the compression algorithms used by TokuDB for MySQL and TokuMX for MongoDB.  Unfortunately, valgrind's memcheck reports an uninitialized variable problem in the xz software encoder when running applications, including this simple test case.  The problem is benign with respect to correct functionality.  However, the problem is interesting with respect to how the gcc compiler generates code for logical AND operations in C language programs.   The summary of the problem is that valgrind reports a conditional jump using uninitialized data because the xz software encoder did not initialize all of its variables and gcc reordered the evaluation of the operands in a logical AND operator.

The problem in lz_encoder.c line 226 can be reduced to the following code.  The LZMA encoder uses a match finder data structure when compressing data.  When a match finder structure is allocated, the match finder constructor initializes 'mf->buffer' variable but does not initialize 'mf->size' variable.  So, these two variables have the following values after construction.

    mf->buffer = NULL
    mf->size is not initialized so its value is undefined

Later, when the LZMA encoder is used, it sometimes needs to allocate a new 'mf->buffer'.  It first reads the size of the buffer.

    old_size = mf->size

After this, 'old_size' is undefined since 'mf->size' is undefined.  A new size for the buffer is computed using some other values which happen to be initialized.

    mf->size = compute new size not using the old_size variable

Valgrind's memcheck reports the conditional jump using uninitialized data error when executing the following if statement:

    if (mf->buffer != NULL && old_size != mf->size) {
        do something
    }

The gcc compiler generates code that evaluates the 'old_size != mf->size' expression before the 'mf->buffer != NULL' expression.  This was verified by examining the instructions generated by the compiler.  Since the 'old_size' variable has an undefined value, the result of the comparison is undefined, and memcheck reports it.  Luckily, since 'mf->buffer != NULL' is false, the if statement condition is false regardless of the undefined 'old_size' value.

The C language definition says operands for a logical AND operator are evaluated left to right.  See section 6.5.13.  In addition, if the leftmost operand evaluates to 0, then the operands to its right are not evaluated.  This is known as short circuit evaluation.  The C languages definition says that a sequence point exists between the left and right operands in a logical AND operator.  See section 5.1.2.3.  The sequence point implies that all values and side effects of the left operand are computed before all values and side effects of the right operand.  Is the result of the evaluation of an operand a side effect?

The gcc compiler seems to be violating the left to right execution order.  Is it OK?  Since there are no side effects in the 'mf->buffer != NULL' and 'old_size != mf->size' operands, then the evaluation order does not matter.  Since the execution order is only switched when using aggressive compiler optimization, the compiler probably decided that switching the evaluation order could lead to faster execution.

Computing with undefined values is generally dangerous since the program may compute the wrong result.  Luckily, this is not the case in the xz encoder since the undefined value results in the same top level behavior.  However, if one wants to use a memory checker like valgrind's  memcheck or the memory sanitizer to verify a program, one should fix bugs related to undefined values.  The fix for the undefined data in the xz encoder has been committed to the xz github repository.  As you can see, it merely initializes an additional variable in the match finder constructor.

Versions

Ubuntu 14.04
gcc 4.9.3
xz util v5.2.2


Monday, October 26, 2015

MySQL 5.7 Versus the Address Sanitizer

MySQL 5.7 supports the Address Sanitizer, which checks for several memory related software errors including memory leaks.  It is really nice to see support for the Address Sanitizer built into MySQL.  Unfortunately, when running the mysql tests included in MySQL 5.7.9, the Address Sanitizer reports several memory leaks, which causes some of  the tests to fail.  Here are some of the memory leaks in the MySQL 5.7.9 software found by the Address Sanitizer.

Memory leaks in 'mysqlbinlog'

Several of the 'mysqlbinlog' tests fail due to memory leaks in the 'mysqlbinlog' client program.  Here are the details from the 'main.mysqlbinlog' test.  It appears the binlog event objects are not always deleted after being created by the 'dump_remote_log_entries' function.

See MySQL bugs 78966 and 78223 for status.

Memory leaks in 'mysqlpump'

All of the 'mysqlpump' tests fail due to memory leaks in the 'mysqlpump' client program.  Here are the details from the 'main.mysqlpump_basic' test.  The first memory leak occurs in the 'quote_name' function, which gets a object pointer and then just dicards it.  Other memory leaks are caused by a similar disregard for object management.  Unfortunately, C++ requires the programmer to manage memory.  Use of the Address Sanitizer when developing and testing new C++ software like 'mysqlpump' will quickly identify the memory leaks.

See MySQL bugs 78965 and 78224 for status.

Memory leaks in InnoDB

Several InnoDB tests fail due to memory leaks in the InnoDB storage engine when running the mysqld server.  Here are the details from the 'log_file_name' test.  The memory leaks occur because InnoDB calls the 'exit' system call from its initialization function rather than passing an error back to the mysqld server code.   Since InnoDB can abruptly terminate mysqld, this sets a precedent that allows any storage engine or plugin to do the same when it encounters a problem.  IMO, a plugin should not be allowed to terminate the mysql server especially for a 'simple' case of an error during plugin initialization.

Conclusions

Since it is really easy to leak memory in C++ programs, C++ developers should use a memory verification tool like valgrind's memcheck or the Address Sanitizer to find memory leaks.  Since support for the Address Sanitizer is built into MySQL, it should be used when hacking on MySQL software.  Furthermore, since the executing cost of the Address Sanitizer is reasonable (IMO), it could be used in an automated MySQL test system.

I have only run the 'main' and 'innodb' mysql tests suites with the Address Sanitizer.  There are plenty of additional tests suites that are part of the MySQL software that should be run with the Address Sanitizer.

Tools

MySQL 5.7.9
Address Sanitizer built into Clang 3.8
Ubuntu 14.04

Friday, October 2, 2015

TokuBackup Versus the Sanitizers

Percona recently open sourced TokuBackup, a library that may be used to take a snapshot of a set of files while these files are being changed by a running application like MySQL or MongoDB.  Making TokuBackup open source allows a larger set of users to improve and extend the software.  This blog describes how the Address Sanitizer and Thread Sanitizer found bugs in the TokuBackup library.

The TokuBackup library is used by Percona Server for MySQL and Percona Server for MongoDB to create a snapshot of the data files used to store the fractal tree data files and log files.  The basic idea of TokuBackup is to intercept the system calls that change the state of a set of directories and their files while a copy is made of these same directories.   The end result is that the a snapshot of the database files made while the database is still running.  Please see the Hot Backup blog for a nice description of the algorithm.

The TokuBackup repository includes a set of glass box and black box tests that can be used to test TokuBackup independently of MySQL or MongoDB.   Let's apply some of my favorite test tools to TokuBackup and its glass box and black box tests to see if any bugs are found.

TokuBackup Code Coverage:

After compiling TokuBackup with code coverage and running all of the black box and glass box tests, gcov finds that about 89% of the TokuBackup source lines are executed. This code coverage data can be used to identify test weaknesses.  Here is one example.

The gcov tool shows that the truncate function is not executed by the TokuBackup black box and glass box tests.  This is a problem since the fractal tree software uses it.  Since there is no test coverage of the truncate function with the included tests, no claims that truncate actually works correctly with TokuBackup can be made.

Since TokuBackup is open source, we can see what file related system calls are intercepted and those that are omitted.  TokuBackup does not intercept the linux kernel's asynchronous I/O system calls.  Because of this omission, TokuBackup can not be used to backup running InnoDB databases that use native async io since TokuBackup does not capture the file changes made by the asynchronous I/O calls.  IMO, this may be a useful and interesting feature to add to the TokuBackup library.

TokuBackup versus the Address Sanitizer:

The Address Sanitizer is a tool built into clang that finds memory related bugs at run time.   Here is a comparison of the capabilities of the Address Sanitizer and valgrind's memcheck tool.  One of the distinguishing features of the Address Sanitizer compared to memcheck is the order of magnitude speedup in execution time.

After compiling TokuBackup with the Address Sanitizer enabled and running all of the black box and glass box tests, the Address Sanitizer finds a memory leak when running the many_directories tests.  The bug fix is obvious once the leak is known.

The same memory leak is also found when running the many_directories test with valgrind's memcheck tool which is built into the TokuBackup test infrastructure.  It appears that these tests are not being run, so the bugs are not being found.

TokuBackup versus the Thread Sanitizer:

The Thread Sanitizer is a tool built into clang that finds data races in multi-threaded programs.    Since TokuBackup runs inside of the MySQL server and the MongoDB server, and these servers are heavily multi-threaded programs, it makes sense to use the Thread Sanitizer to find data races in TokuBackup.

The Thread Sanitizer reported over 80 issues when running the TokuBackup glass box and black box tests.  These issues can be divided into four categories.

Data races in the tests cases:
A common design pattern used by the TokuBackup tests is to launch some work on a background thread and synchronize the state of the background thread with the main thread through a global variable.  Unlocked shared global variables can be racy.  One can use c++ atomic variables for this design pattern so that coupling between threads is explicit and not racy.

Data races in glass box test infrastructure:
The glass box test infrastructure incorporates test code into TokuBackup so that various concurrent execution scenarios can be sequenced.  The TokuBackup glass box test infrastructure is racy.  One can use Thread Sanitizer suppressions for all of the glass box infrastructure to suppress data races caused by the glass box infrastructure.  Similarly, TokuBackup uses helgrind suppressions for the glass box infrastructure so that the glass box races are not reported by helgrind.

Data races on error reporting:
Errors that occur while a TokuBackup is running cause the error state of the backup to be set.  The error state is racy since the error state is a set of shared heap variables that are unprotected by a mutex.  TokuBackup included helgrind annotations so that data races on the error state is automatically suppressed.  IMO, this just makes the TokuBackup code more complicated and less understood.  A better solution is to use a mutex to ensure that the error state manipulations are not data racy.

Use after free memory violation:
The Thread Sanitizer reported a heap use after free violation in the TokuBackup copier.  A use after free violation is a read of memory after it has been freed.   The best case outcome from this error is that the program crashes.  Unforunately, the base case will only occur if the memory that has been freed is no longer mapped into the address space of the running program.   It is more likely that the address is still mapped, so the read will return random data.  The worst case is that the random data will result in the broken backup.

The use after free bug is due to improper management of the 'todo' work stack by concurrent threads. The 'todo' work stack is a c++ vector protected by a mutex (since c++ vectors are not multi-thread safe).  The 'todo' work stack contains the set of all files that need to be copied.  The use after free violation implies that the work stack no longer contained valid file name pointers.  This implies that the backup could be corrupted.

The bug occurs because the copier reads the top of the 'todo' stack, copies the files, and then pops the stack.   While the copier is copying the files, a file rename operation could occur on another thread.  The file rename operation pushes a new name onto the 'todo' stack.  You can see that the copier read of the 'todo' stack and the copier pop of the stack assume that the 'todo' stack has not changed in between.  However, the rename operation did in fact change the stack.  The copier's 'todo' data structure and algorithm are defeated.

The tsan-2 branch of TokuBackup includes some possible fixes for the above issues.

Summary:

The Sanitizers found bugs in TokuBackup, a software library that has been shipping for a few years.  IMO, this shows the value in running existing tests with the Sanitizers to find bugs.

There are a lot of racy variables in TokuBackup that use helgrind annotations to suppress helgrind race reports.  One can use Thread Sanitizer suppressions to suppress benign data races. In an ideal world, one should use atomic variables as it allows the programmer to express shared variables that can be manipulated safely by concurrent threads and it allows a race detector like the Thread Sanitizer to work effectively.

Versions:

Ubuntu 14.04
clang 3.8
gcc 4.9
TSAN branch of TokuBackup

Thursday, September 10, 2015

MySQL 5.7 and Barriers

When MySQL 5.7.8 is built and run with the Thread Sanitizer, one of the first issues reported is a data race with MySQL's use of pthread barrier wait and pthread barrier destroy.   The data race report is:

WARNING: ThreadSanitizer: data race (pid=30104)
  Write of size 1 at 0x7ffd33054298 by main thread:
    #0 pthread_barrier_destroy /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1236 (mysqld+0x000000d5a71b)
    #1 start_helper_thread /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:124:3 (mysqld+0x0000016f0e82)
    #2 my_timer_initialize /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:158:7 (mysqld+0x0000016f0da0)
    #3 init_server_components() /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:3785:7 (mysqld+0x000000ddba75)
    #4 mysqld_main(int, char**) /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:4645:7 (mysqld+0x000000dda276)
    #5 main /home/rfp/my578.tsan/mysql-server/sql/main.cc:25:10 (mysqld+0x000000dd4b9e)

  Previous read of size 1 at 0x7ffd33054298 by thread T1:
    #0 pthread_barrier_wait /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1243 (mysqld+0x000000d69b4e)
    #1 timer_notify_thread_func /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:73:3 (mysqld+0x0000016f11a9)

  Location is stack of main thread.

  Thread T1 (tid=30106, running) created by main thread at:
    #0 pthread_create /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848 (mysqld+0x000000d732d3)
    #1 my_thread_create /home/rfp/my578.tsan/mysql-server/mysys/my_thread.c:92:10 (mysqld+0x0000016f0c60)
    #2 spawn_thread_noop /home/rfp/my578.tsan/mysql-server/mysys/psi_noop.c:188:10 (mysqld+0x0000016efda0)
    #3 inline_mysql_thread_create /home/rfp/my578.tsan/mysql-server/include/mysql/psi/mysql_thread.h:1296:11 (mysqld+0x0000016f1112)
    #4 start_helper_thread /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:113:7 (mysqld+0x0000016f0e38)
    #5 my_timer_initialize /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:158:7 (mysqld+0x0000016f0da0)
    #6 init_server_components() /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:3785:7 (mysqld+0x000000ddba75)
    #7 mysqld_main(int, char**) /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:4645:7 (mysqld+0x000000dda276)
    #8 main /home/rfp/my578.tsan/mysql-server/sql/main.cc:25:10 (mysqld+0x000000dd4b9e)

SUMMARY: ThreadSanitizer: data race /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:124:3 in start_helper_thread

The MySQL timer initialization code uses a pthread barrier to synchronize the main thread with a timer notification thread that the main thread creates.  The design pattern is:

main:                                                                        timer_notify_thread_func:
barrier_init
pthread_create timer_notify_thread_func
do stuff                                                                     do other stuff
barrier_wait                                                              barrier_wait
barrier_destroy

The Thread Sanitizer reports a potential data race between a barrier wait and the barrier destroy.   IMO, there is no data race because a barrier wait creates a happens before relationship by definition of how barriers work.  No thread can pass the barrier until all threads enter the barrier.  If there is a data race within the barrier, then the implementation is flawed.  

The Thread Sanitizer is being pessimistic about pthread barrier implementations which have been flawed, so it reports a potential error.  One can either (1) not use the above design pattern, or (2) assume that the barrier implementation actually works and suppress this race report.  Given the recent change to the glibc barrier implementation, I would avoid using them.

Tools

clang 3.8
Ubuntu 14.04
MySQL 5.7.8


Thursday, September 3, 2015

A Monster Test Crashes the Thread Sanitizer

One of the TokuFT (now PerconaFT) tests crashed the Thread Sanitizer that is included in clang 3.5.   The crash occurred at run time after the test was compiled.   I instrumented TokuFT to trace all of the mutex initialization, lock, and unlock functions called by the failing test.  This trace was used to create a monster test case with 12,258 mutexes and over 445K mutex operations.  Unfortunately, this crash does not occur with the latest clang 3.8 (thanks Dmitry for investigating this), so it is time to upgrade to a version of clang with this bug fixed.

In addition, I ran into another Thread Sanitizer crash in its deadlock detector and a workaround was quickly produced.  Thanks Kostya.

Lessons remembered: check if the problem was reported previously, and always try the latest and greatest version of the software.   The problem could already be fixed.

Monday, August 24, 2015

The Address Sanitizer Finds Bugs in TokuFT

This blog describes how the Address Sanitizer found bugs in the TokuFT (now PerconaFT) storage library.  TokuFT is the storage library used by the TokuDB for MySQL and TokuMX for MongoDB products.  TokuFT is currently tested with valgrind's memcheck tool.  However, memcheck is very slow compared to native execution, so memcheck is not always used for large tests.  This leads to missed bugs.

The Address Sanitizer is a memory error detector which detects memory leaks, out of bounds memory access, and several other types of memory related bugs.  Recent versions of the clang and gcc compilers have integrated the Address Sanitizer into the compiler.  The Address Sanitizer claims an order of magnitude increase in speed compared to valgrind's memcheck.  This is interesting since it opens up execution scenarios that are problematic with valgrind given the execution time expansion.

The Address Sanitizer found bugs in the TokuFT library when running existing tests.  These bugs were not previously found because the valgrind tools were deemed too slow and were not used for these tests.

TokuFT test failures detected with the address sanitizer

TokuFT has been developed and tested for awhile.   It is always interesting to find new bugs in existing code by using new tools like the Address Sanitizer.

Memory leak in the bnc-insert-test.  This bug is found when running the basic TokuFT regression tests with the Address Sanitizer compiled into TokuFT.  For some reason, the 'bnc-insert-test' was not run with valgrind for some reason (probably execution time), so the bug was not found.

Memory leak in the context status initialization.   This bug is found by the TokuFT stress tests with the Address Sanitizer compiled into TokuFT.  The bug was not previously found because valgrind is not used by the stress tests.

Future work

Cmake file hackery.   IMO, an option should be added to the TokuFT cmake files to support the Address Sanitizer in addition to valgrind's memcheck, helgrind, and DRD tools.  Since these tools continue to improve, TokuFT should continue to support all of them.

Stress tests.  The TokuFT stress tests (and perhaps other tests) should be run with the Address Sanitizer enabled since the execution speed is claimed to only be 2x slower than native compiled code execution.

Address Sanitizer and MySQL.  I did not realize until today that MySQL 5.6 (and 5.7) include the 'WITH_ASAN' option that will compile MySQL with the Address Sanitizer enabled.  This makes testing MySQL with memory verification feasible for large tests.  I assume that since this option has been available in MySQL for awhile, it must be part of the MySQL test plan.  Is it?

Tool versions

Ubuntu 14.04
Clang 3.5

Build and test

# setup the compiler options to build with the address sanitizer
export CC=clang-3.5
export CXX=clang++-3.5
export CXXFLAGS=-fsanitize=address
export ASAN_OPTIONS=detect_odr_violation=0

# get the FT source
git clone git@github.com:percona/perconaft

# run the basic tests
mkdir perconaft-asan && chdir perconaft-asan
cmake -DCMAKE_BUILD_TYPE=Debug ../perconaft
ctest -j8 --verbose

# run the stress tests as described here.
Directions on how to run the FT stress tests







Thursday, August 13, 2015

Races in the TokuFT Race Detector

TokuFT (now called PerconaFT) is the write optimized storage component used by TokuDB for MySQL and TokuMX for MongoDB.  Since TokuFT is its own component, TokuFT can be tested independently of TokuDB and TokuMX.  Some of the TokuFT tests use valgrind's memcheck, helgrind, and DRD tools to identify bugs.

Helgrind and DRD find data races in multi-threaded programs at runtime rather than at compile time.  In my experience with these tools, data races are not predictable and it sometimes takes many test runs of the same program to find a problem.  So, I ran the group commit count test in a loop with helgrind until it failed.

One of the data races that helgrind reports is within TokuFT's lock manager. The purpose of the lock manager is to allow concurrent transactions to execute correctly.   Since a data race in the lock manager could break concurrent transactions,  the problem should be investigated and resolved.

The lock manager data race reported by helgrind implies that two concurrent threads are updating the transaction identifier for the single transaction optimization (the 'sto_txnid').  A review of the code shows that this is impossible since the writes of the transaction identifier ('sto_txnid') in the 'sto_begin' and 'sto_end' functions occur when holding a mutex on the lock tree.  The helgrind report shows that the same mutex is being held when the data race is reported.  So, what is going on?

It turns out that there is a code path in the lock manager that does an unlocked racy read of the 'sto_txnid' that occurs when acquired locks are released.  Helgrind would normally report a read race on an unlocked racy read.  However, the TokuFT developers decided that a racy read is appropriate to presumably improve performance, so they attempted to inform helgrind and DRD to ignore the read race in this function. As we will see, the method used to ignore the read race causes helgrind to report an erroneous write data race, which caused me to waste a lot of time debugging this problem.

Hopefully, the following sequence of tests will demonstrate the problem without any knowledge of or code from TokuFT.

Spooky races

The spooky race program demonstrates how helgrind reports a false write data race on a variable that is clearly protected by a mutex.  The false write data race is reported because of a misuse of the helgrind disable and enable checking API.  TokuFT is trying to inform helgrind to ignore a variable when doing a racy read.  Helgrind's enable checking API causes helgrind to view the variable as completely new, which causes helgrind to think that there is a race.  In addition, helgrind does not implement the ignore reads annotation, so that method can not be used to temporarily disable data race checking.

 Luckily, the ignore reads annotation works nicely with DRD, so the spooky race test does not fail with DRD.

The false write data race program demonstrates how a misuse of helgrind's disable and enable API can cause helgrind to generate a false write data race.  Helgrind documented the APIs; the TokuFT developers misused them.

The false read data race program demonstrates an additional helgrind problem.  Helgrind's disable and enable checking APIs are not multi-thread safe, so one can generate a read data race when two reader threads are interleaved.  Again, helgrind documented the APIs; the TokuFT developers misused them.

Conclusions

TokuFT has several instances of racy reads.  These racy reads are annotated by the TokuFT developers to be ignored by the race detector.  Unfortunately, the helgrind disable enable checking API can not be used because it causes new false data race reports.  In addition, helgrind does not implement the ignore reads API.  The only way (as far as I know) to use helgrind with TokuFT is to start collecting a huge number of helgrind suppressions.  This is problematic.

Luckily, DRD works fine with TokuFT since DRD implements the ignore reads API.  DRD is the preferred way to test TokuFT software.

What about using the thread sanitizer which is now included in clang and gcc?   We need to either remove all of the racy reads, or generate TSAN suppressions for them.  Experiments in progress.

Tools

gcc 4.8
valgrind 3.10.1





Monday, July 20, 2015

Helgrind and MySQL 5.7

Valgrind's helgrind tool identifies data races and lock deadlocks in multi-threaded programs.  Since the MySQL server is a multi-threaded program, can helgrind find any issues in the MySQL server?

A simple test that does nothing except start up the MySQL server and shut it down is run on the MySQL test suite with memcheck and with helgrind.  For this simple test on MySQL 5.7.7, the memcheck run is nice and clean but the helgrind run reports a lot of issues.  The helgrind issues include a data race in the safe mutex lock function, a potential lock deadlock problem, several performance schema issues, and several InnoDB issues.

Valgrind test

Valgrind's memcheck tool is used by the MySQL test suite to find memory related bugs in the MySQL server.   What is the state of MySQL 5.7.7 with respect to memcheck?

The 'donothing' test is run with valgrind.

    mtr --valgrind-mysqld donothing

    main.donothing [ pass ]
    valgrind_report [ pass ]

The memcheck run is nice and clean with this simple test.

Helgrind test

The MySQL test suite does not support helgrind tests, so I hacked mysql-test-run.pl to support it.

The 'donothing' test is run with helgrind.

    mtr --helgrind --testcase-timeout=3600 donothing

    main.donothing [ fail ] Found warnings/errors in server log file!

There are a lot of issues in the helgrind run with this simple test.  Let's look into the safe mutex data race, the lock deadlock issue, and one of the data races in InnoDB.

Safe mutex data race

The safe mutex is the implementation of a MySQL mutex that includes extra data that is useful for debugging locking problems.  For example, the file name and line number of the code that locks a safe mutex is stored in the safe mutex object.

The safemutex data race occurs in 'safe_mutex_lock' function because 'mp->file' is read without holding a mutex but is written later while holding a mutex.

The fix to the safe mutex data race is simple:  just hold the 'mp->global_mutex' when reading mp->file'.

See MySQL bug 77862 for progress.

Lock deadlock issue

Helgrind builds a directed graph of locks that are taken within the MySQL server.  A cycle in the graph implies the possibility of deadlock.  Some of these lock deadlocks in the MySQL server are not detected at runtime but result in a server hang.

The helgrind run on the simple 'donothing' test reported a couple of  lock order issues.   The lock ordering rules in the MySQL server source are:

Rule #1: LOCK_system_variables_hash (R) before LOCK_global_system_variables (M)
Used in the 'alloc_and_copy_thd_dynamic_variables' function.

Rule #2: LOCK_plugin (M) before LOCK_system_variables_hash (R)
Used in the 'find_sys_var_ex function'.

Helgrind does not report all of the locks involved in the deadlock.   So,  we find the following lock order rule in the MySQL source.

Rule #3: LOCK_global_system_variables (M) before LOCK_plugin (M)
Used in the 'THD::init' function.

There no deadlock situation yet given the read locks on LOCK_system_variables_hash lock.  It appears that helgrind does not distinguish between read and write locks on a reader/writer lock.  We find another set of functions in the MySQL server than can replace lock order rule #2 and lead to a deadlock.

Modified rule #2: LOCK_plugin (M) before LOCK_system_variables_hash (W)
Used in the 'mysql_install_plugin' and 'plugin_del functions'.

Helgrind identified the possibility of a locking deadlock in MySQL 5.7 involving three locks.  One can search the code for SQL operations that call these low level functions and use a tool like the random query generator to try to generate a deadlock.

This locking deadlock is similar to MySQL bug 51591 and MySQL bug 71236.  It is quite easy to create software with low level locking deadlocks.  It appears that there is a long history of deadlocks with similar parts of the MySQL server software.

See MySQL bug 77872 for progress.

InnoDB issues

There are several helgrind issues with InnoDB.  Here is a simple InnoDB data race on the 'os_thread_count' variable. The 'os_thread_create_function' reads the 'os_thread_count' variable without holding the 'thread_mutex'.

See MySQL bug 77866 for progress.

Next steps

Helgrind identified several issues using the simplest test program.   Does anyone have any experience using helgrind or some other race detector on the MySQL or MariaDB server?

Tool versions

MySQL 5.7.7
Ubuntu 14.04
gcc 4.8.4
valgrind 3.10.1

Monday, June 29, 2015

TokuDB and Binlog Group Commit in MySQL 5.7

The MySQL 5.7.6 release notes describe a change to the binary log group commit algorithm that can improve the performance of write intensive applications when the binary log is enabled (noted with Bug #19424075).  This change is probably inspired by the experiment reported in MySQL bug #73202.

There is a long history of tweaking the binary log group commit algorithm and the InnoDB storage engine to scale performance.  Mark Callaghan describes the work done on this problem at Facebook.  Kristian Nielsen describes how this problem was fixed in MariaDB.  Mats Kindahl describes how this problem was fixed in MySQL 5.6.  Now, there is yet another tweak to the algorithm.

Since the binary log group commit algorithm in MySQL 5.7 changes the interaction between the MySQL transaction commit logic and the storage engine, a small change to TokuDB is required to use the new algorithm.  I decided to run the sysbench non-indexed update test with the InnoDB and TokuDB storage engines on MySQL 5.7 to test this new algorithm.

InnoDB experiment

The sysbench non-indexed update test is run on a small table that fits entirely in the InnoDB in memory buffer pool.   A small table is chosen to focus the test on the group commit algorithm and its disk sync's.  A consumer grade SSD with about 3 millisecond fsync time is used to store the data.  The sysbench update throughput as a function of the number of concurrent threads updating the table is reported.



















The new binary log group commit algorithm increases application throughput by about 20%.

TokuDB experiment

The sysbench non-indexed update test is run on a small table that fits entirely in the TokuDB in memory cache. The test labelled 'mysql 5.7.7 tokudb flush' required a change to the TokuDB prepare function to work with the new binary log group commit algorithm.


















Sysbench throughput on Percona Server 5.6.24 and MySQL 5.7.7 are about the same as expected.  When TokuDB is changed to use the MySQL 5.7 binary log group commit algorithm,  sysbench non-indexed update throughput increases by about 30%.

What is the new binary log group commit algorithm?

MySQL uses a two-phase commit algorithm to coordinate the state of the MySQL binary log with the state of the storage engine recovery log.

Here is the MySQL 5.6 commit logic and its interaction with the TokuDB storage engine.  The interaction with InnoDB is similar.

MySQL commit logic calls TokuDB prepare:
(1) write a prepare recovery log entry to the in memory TokuDB recovery log.
(2) write and sync the TokuDB recovery log with a group commit algorithm.

MySQL commit logic writes the binary log:
(3) write transactional changes to the binary log.
(4) sync the binary log with a group commit algorithm.

MySQL commit logic calls TokuDB commit:
(5) write a commit recovery log entry to the TokuDB recovery log.  A sync of the TokuDB recovery log is unnecessary here for proper recovery.

Transaction durability occurs when the storage engine recovery log and the binary log are synced to disk in steps (2) and (4).

During crash recovery, the MySQL transaction coordinator gets a list of all of the prepared transactions from the storage engine and will commit those prepared transactions that are present in the binary log.  If the prepared transaction is not in the binary log, as is the case if the crash occurs before the binary log is synced to disk, then the transaction is aborted.

Disk syncs are really expensive compared to memory operations.  A group commit algorithm is used to amortize the sync cost over multiple transactions.   Since the expected sync time is about the same for both steps (2) and (4), we expect about the same number of threads to be group committed together.

Here is the MySQL 5.7 commit logic and its interaction with the TokuDB storage engine.

MySQL commit logic calls TokuDB prepare:
(1) write a prepare recovery log entry to the in memory TokuDB recovery log.

MySQL commit logic calls TokuDB flush logs:
(2) write and sync the TokuDB recovery log.

MySQL commit logic writes the binary log:
(3) write transactional changes to the binary log
(4) sync the binary log with a group commit algorithm

MySQL commit logic calls TokuDB commit:
(5) write a commit recovery log entry to the TokuDB recovery log.  A sync of the TokuDB recovery log is unnecessary for proper recovery.

MySQL 5.7 groups threads together when executing steps (2) through (5).  A leader thread is chosen to execute these actions.  Steps (2) through (4) are executed by the leader thread on behalf of many other threads.  Since there is only one point where threads are collected for group commit (step 4), the sync cost is amortized over more threads than on MySQL 5.6 which has two places where group commit occurs.

TokuDB changes

TokuDB detects that the new two-phase commit algorithm is being used and does not sync its recovery log in the storage engine prepare (step 2).  TokuDB assumes that MySQL will call its flush logs function to write and sync its recovery log.  See this gitub commit for details.





Thursday, June 18, 2015

TokuDB and JSON data in MySQL 5.7


MySQL recently added support for JSON data with MySQL 5.7.8.  It would be cool to store JSON data in TokuDB tables.

First, I had to get TokuDB running on MySQL 5.7.8.   TokuDB currently runs on Percona Server 5.6, and various flavors of MariaDB.  The only issues porting TokuDB to MySQL 5.7 were adopting the changes to various internal APIs that storage engines use.  Since I did not make any patches to the MySQL code,  some TokuDB features including clustered secondary keys and selection of various compression algorithms for TokuDB tables are missing.  Compression with zlib is always used.

Second, I decided to store JSON data in a TokuDB blob since the JSON field class is derived from the blob field class.  The changes to support JSON as blobs in TokuDB are minimal.

Here is an example from Hatlen's blog of storing JSON data in a TokuDB table.

mysql> create table employees (data JSON) engine=tokudb;
Query OK, 0 rows affected (0.06 sec)

mysql> insert into employees values ('{"id": 1, "name": "Jane"}');
Query OK, 1 row affected (0.02 sec)

mysql> insert into employees values ('{"id": 2, "name": "Joe"}');
Query OK, 1 row affected (0.00 sec)

mysql> select * from employees;
+---------------------------+
| data                      |
+---------------------------+
| {"id": 1, "name": "Jane"} |
| {"id": 2, "name": "Joe"}  |
+---------------------------+
2 rows in set (0.00 sec)

mysql> select json_extract(data, '$.name') from employees;
+-----------------------------+
| jsn_extract(data, '$.name') |
+-----------------------------+
| "Jane"                      |
| "Joe"                       |
+-----------------------------+
2 rows in set (0.00 sec)

If you want to play with JSON data in TokuDB tables, here is a script to build MySQL 5.7.8 with TokuDB.  Note that all of these components are experimental and only some of the TokuDB tests pass.