Monday, August 24, 2015

The Address Sanitizer Finds Bugs in TokuFT

This blog describes how the Address Sanitizer found bugs in the TokuFT (now PerconaFT) storage library.  TokuFT is the storage library used by the TokuDB for MySQL and TokuMX for MongoDB products.  TokuFT is currently tested with valgrind's memcheck tool.  However, memcheck is very slow compared to native execution, so memcheck is not always used for large tests.  This leads to missed bugs.

The Address Sanitizer is a memory error detector which detects memory leaks, out of bounds memory access, and several other types of memory related bugs.  Recent versions of the clang and gcc compilers have integrated the Address Sanitizer into the compiler.  The Address Sanitizer claims an order of magnitude increase in speed compared to valgrind's memcheck.  This is interesting since it opens up execution scenarios that are problematic with valgrind given the execution time expansion.

The Address Sanitizer found bugs in the TokuFT library when running existing tests.  These bugs were not previously found because the valgrind tools were deemed too slow and were not used for these tests.

TokuFT test failures detected with the address sanitizer

TokuFT has been developed and tested for awhile.   It is always interesting to find new bugs in existing code by using new tools like the Address Sanitizer.

Memory leak in the bnc-insert-test.  This bug is found when running the basic TokuFT regression tests with the Address Sanitizer compiled into TokuFT.  For some reason, the 'bnc-insert-test' was not run with valgrind for some reason (probably execution time), so the bug was not found.

Memory leak in the context status initialization.   This bug is found by the TokuFT stress tests with the Address Sanitizer compiled into TokuFT.  The bug was not previously found because valgrind is not used by the stress tests.

Future work

Cmake file hackery.   IMO, an option should be added to the TokuFT cmake files to support the Address Sanitizer in addition to valgrind's memcheck, helgrind, and DRD tools.  Since these tools continue to improve, TokuFT should continue to support all of them.

Stress tests.  The TokuFT stress tests (and perhaps other tests) should be run with the Address Sanitizer enabled since the execution speed is claimed to only be 2x slower than native compiled code execution.

Address Sanitizer and MySQL.  I did not realize until today that MySQL 5.6 (and 5.7) include the 'WITH_ASAN' option that will compile MySQL with the Address Sanitizer enabled.  This makes testing MySQL with memory verification feasible for large tests.  I assume that since this option has been available in MySQL for awhile, it must be part of the MySQL test plan.  Is it?

Tool versions

Ubuntu 14.04
Clang 3.5

Build and test

# setup the compiler options to build with the address sanitizer
export CC=clang-3.5
export CXX=clang++-3.5
export CXXFLAGS=-fsanitize=address
export ASAN_OPTIONS=detect_odr_violation=0

# get the FT source
git clone git@github.com:percona/perconaft

# run the basic tests
mkdir perconaft-asan && chdir perconaft-asan
cmake -DCMAKE_BUILD_TYPE=Debug ../perconaft
ctest -j8 --verbose

# run the stress tests as described here.
Directions on how to run the FT stress tests







Thursday, August 13, 2015

Races in the TokuFT Race Detector

TokuFT (now called PerconaFT) is the write optimized storage component used by TokuDB for MySQL and TokuMX for MongoDB.  Since TokuFT is its own component, TokuFT can be tested independently of TokuDB and TokuMX.  Some of the TokuFT tests use valgrind's memcheck, helgrind, and DRD tools to identify bugs.

Helgrind and DRD find data races in multi-threaded programs at runtime rather than at compile time.  In my experience with these tools, data races are not predictable and it sometimes takes many test runs of the same program to find a problem.  So, I ran the group commit count test in a loop with helgrind until it failed.

One of the data races that helgrind reports is within TokuFT's lock manager. The purpose of the lock manager is to allow concurrent transactions to execute correctly.   Since a data race in the lock manager could break concurrent transactions,  the problem should be investigated and resolved.

The lock manager data race reported by helgrind implies that two concurrent threads are updating the transaction identifier for the single transaction optimization (the 'sto_txnid').  A review of the code shows that this is impossible since the writes of the transaction identifier ('sto_txnid') in the 'sto_begin' and 'sto_end' functions occur when holding a mutex on the lock tree.  The helgrind report shows that the same mutex is being held when the data race is reported.  So, what is going on?

It turns out that there is a code path in the lock manager that does an unlocked racy read of the 'sto_txnid' that occurs when acquired locks are released.  Helgrind would normally report a read race on an unlocked racy read.  However, the TokuFT developers decided that a racy read is appropriate to presumably improve performance, so they attempted to inform helgrind and DRD to ignore the read race in this function. As we will see, the method used to ignore the read race causes helgrind to report an erroneous write data race, which caused me to waste a lot of time debugging this problem.

Hopefully, the following sequence of tests will demonstrate the problem without any knowledge of or code from TokuFT.

Spooky races

The spooky race program demonstrates how helgrind reports a false write data race on a variable that is clearly protected by a mutex.  The false write data race is reported because of a misuse of the helgrind disable and enable checking API.  TokuFT is trying to inform helgrind to ignore a variable when doing a racy read.  Helgrind's enable checking API causes helgrind to view the variable as completely new, which causes helgrind to think that there is a race.  In addition, helgrind does not implement the ignore reads annotation, so that method can not be used to temporarily disable data race checking.

 Luckily, the ignore reads annotation works nicely with DRD, so the spooky race test does not fail with DRD.

The false write data race program demonstrates how a misuse of helgrind's disable and enable API can cause helgrind to generate a false write data race.  Helgrind documented the APIs; the TokuFT developers misused them.

The false read data race program demonstrates an additional helgrind problem.  Helgrind's disable and enable checking APIs are not multi-thread safe, so one can generate a read data race when two reader threads are interleaved.  Again, helgrind documented the APIs; the TokuFT developers misused them.

Conclusions

TokuFT has several instances of racy reads.  These racy reads are annotated by the TokuFT developers to be ignored by the race detector.  Unfortunately, the helgrind disable enable checking API can not be used because it causes new false data race reports.  In addition, helgrind does not implement the ignore reads API.  The only way (as far as I know) to use helgrind with TokuFT is to start collecting a huge number of helgrind suppressions.  This is problematic.

Luckily, DRD works fine with TokuFT since DRD implements the ignore reads API.  DRD is the preferred way to test TokuFT software.

What about using the thread sanitizer which is now included in clang and gcc?   We need to either remove all of the racy reads, or generate TSAN suppressions for them.  Experiments in progress.

Tools

gcc 4.8
valgrind 3.10.1