Monday, October 26, 2015

MySQL 5.7 Versus the Address Sanitizer

MySQL 5.7 supports the Address Sanitizer, which checks for several memory related software errors including memory leaks.  It is really nice to see support for the Address Sanitizer built into MySQL.  Unfortunately, when running the mysql tests included in MySQL 5.7.9, the Address Sanitizer reports several memory leaks, which causes some of  the tests to fail.  Here are some of the memory leaks in the MySQL 5.7.9 software found by the Address Sanitizer.

Memory leaks in 'mysqlbinlog'

Several of the 'mysqlbinlog' tests fail due to memory leaks in the 'mysqlbinlog' client program.  Here are the details from the 'main.mysqlbinlog' test.  It appears the binlog event objects are not always deleted after being created by the 'dump_remote_log_entries' function.

See MySQL bugs 78966 and 78223 for status.

Memory leaks in 'mysqlpump'

All of the 'mysqlpump' tests fail due to memory leaks in the 'mysqlpump' client program.  Here are the details from the 'main.mysqlpump_basic' test.  The first memory leak occurs in the 'quote_name' function, which gets a object pointer and then just dicards it.  Other memory leaks are caused by a similar disregard for object management.  Unfortunately, C++ requires the programmer to manage memory.  Use of the Address Sanitizer when developing and testing new C++ software like 'mysqlpump' will quickly identify the memory leaks.

See MySQL bugs 78965 and 78224 for status.

Memory leaks in InnoDB

Several InnoDB tests fail due to memory leaks in the InnoDB storage engine when running the mysqld server.  Here are the details from the 'log_file_name' test.  The memory leaks occur because InnoDB calls the 'exit' system call from its initialization function rather than passing an error back to the mysqld server code.   Since InnoDB can abruptly terminate mysqld, this sets a precedent that allows any storage engine or plugin to do the same when it encounters a problem.  IMO, a plugin should not be allowed to terminate the mysql server especially for a 'simple' case of an error during plugin initialization.

Conclusions

Since it is really easy to leak memory in C++ programs, C++ developers should use a memory verification tool like valgrind's memcheck or the Address Sanitizer to find memory leaks.  Since support for the Address Sanitizer is built into MySQL, it should be used when hacking on MySQL software.  Furthermore, since the executing cost of the Address Sanitizer is reasonable (IMO), it could be used in an automated MySQL test system.

I have only run the 'main' and 'innodb' mysql tests suites with the Address Sanitizer.  There are plenty of additional tests suites that are part of the MySQL software that should be run with the Address Sanitizer.

Tools

MySQL 5.7.9
Address Sanitizer built into Clang 3.8
Ubuntu 14.04

Friday, October 2, 2015

TokuBackup Versus the Sanitizers

Percona recently open sourced TokuBackup, a library that may be used to take a snapshot of a set of files while these files are being changed by a running application like MySQL or MongoDB.  Making TokuBackup open source allows a larger set of users to improve and extend the software.  This blog describes how the Address Sanitizer and Thread Sanitizer found bugs in the TokuBackup library.

The TokuBackup library is used by Percona Server for MySQL and Percona Server for MongoDB to create a snapshot of the data files used to store the fractal tree data files and log files.  The basic idea of TokuBackup is to intercept the system calls that change the state of a set of directories and their files while a copy is made of these same directories.   The end result is that the a snapshot of the database files made while the database is still running.  Please see the Hot Backup blog for a nice description of the algorithm.

The TokuBackup repository includes a set of glass box and black box tests that can be used to test TokuBackup independently of MySQL or MongoDB.   Let's apply some of my favorite test tools to TokuBackup and its glass box and black box tests to see if any bugs are found.

TokuBackup Code Coverage:

After compiling TokuBackup with code coverage and running all of the black box and glass box tests, gcov finds that about 89% of the TokuBackup source lines are executed. This code coverage data can be used to identify test weaknesses.  Here is one example.

The gcov tool shows that the truncate function is not executed by the TokuBackup black box and glass box tests.  This is a problem since the fractal tree software uses it.  Since there is no test coverage of the truncate function with the included tests, no claims that truncate actually works correctly with TokuBackup can be made.

Since TokuBackup is open source, we can see what file related system calls are intercepted and those that are omitted.  TokuBackup does not intercept the linux kernel's asynchronous I/O system calls.  Because of this omission, TokuBackup can not be used to backup running InnoDB databases that use native async io since TokuBackup does not capture the file changes made by the asynchronous I/O calls.  IMO, this may be a useful and interesting feature to add to the TokuBackup library.

TokuBackup versus the Address Sanitizer:

The Address Sanitizer is a tool built into clang that finds memory related bugs at run time.   Here is a comparison of the capabilities of the Address Sanitizer and valgrind's memcheck tool.  One of the distinguishing features of the Address Sanitizer compared to memcheck is the order of magnitude speedup in execution time.

After compiling TokuBackup with the Address Sanitizer enabled and running all of the black box and glass box tests, the Address Sanitizer finds a memory leak when running the many_directories tests.  The bug fix is obvious once the leak is known.

The same memory leak is also found when running the many_directories test with valgrind's memcheck tool which is built into the TokuBackup test infrastructure.  It appears that these tests are not being run, so the bugs are not being found.

TokuBackup versus the Thread Sanitizer:

The Thread Sanitizer is a tool built into clang that finds data races in multi-threaded programs.    Since TokuBackup runs inside of the MySQL server and the MongoDB server, and these servers are heavily multi-threaded programs, it makes sense to use the Thread Sanitizer to find data races in TokuBackup.

The Thread Sanitizer reported over 80 issues when running the TokuBackup glass box and black box tests.  These issues can be divided into four categories.

Data races in the tests cases:
A common design pattern used by the TokuBackup tests is to launch some work on a background thread and synchronize the state of the background thread with the main thread through a global variable.  Unlocked shared global variables can be racy.  One can use c++ atomic variables for this design pattern so that coupling between threads is explicit and not racy.

Data races in glass box test infrastructure:
The glass box test infrastructure incorporates test code into TokuBackup so that various concurrent execution scenarios can be sequenced.  The TokuBackup glass box test infrastructure is racy.  One can use Thread Sanitizer suppressions for all of the glass box infrastructure to suppress data races caused by the glass box infrastructure.  Similarly, TokuBackup uses helgrind suppressions for the glass box infrastructure so that the glass box races are not reported by helgrind.

Data races on error reporting:
Errors that occur while a TokuBackup is running cause the error state of the backup to be set.  The error state is racy since the error state is a set of shared heap variables that are unprotected by a mutex.  TokuBackup included helgrind annotations so that data races on the error state is automatically suppressed.  IMO, this just makes the TokuBackup code more complicated and less understood.  A better solution is to use a mutex to ensure that the error state manipulations are not data racy.

Use after free memory violation:
The Thread Sanitizer reported a heap use after free violation in the TokuBackup copier.  A use after free violation is a read of memory after it has been freed.   The best case outcome from this error is that the program crashes.  Unforunately, the base case will only occur if the memory that has been freed is no longer mapped into the address space of the running program.   It is more likely that the address is still mapped, so the read will return random data.  The worst case is that the random data will result in the broken backup.

The use after free bug is due to improper management of the 'todo' work stack by concurrent threads. The 'todo' work stack is a c++ vector protected by a mutex (since c++ vectors are not multi-thread safe).  The 'todo' work stack contains the set of all files that need to be copied.  The use after free violation implies that the work stack no longer contained valid file name pointers.  This implies that the backup could be corrupted.

The bug occurs because the copier reads the top of the 'todo' stack, copies the files, and then pops the stack.   While the copier is copying the files, a file rename operation could occur on another thread.  The file rename operation pushes a new name onto the 'todo' stack.  You can see that the copier read of the 'todo' stack and the copier pop of the stack assume that the 'todo' stack has not changed in between.  However, the rename operation did in fact change the stack.  The copier's 'todo' data structure and algorithm are defeated.

The tsan-2 branch of TokuBackup includes some possible fixes for the above issues.

Summary:

The Sanitizers found bugs in TokuBackup, a software library that has been shipping for a few years.  IMO, this shows the value in running existing tests with the Sanitizers to find bugs.

There are a lot of racy variables in TokuBackup that use helgrind annotations to suppress helgrind race reports.  One can use Thread Sanitizer suppressions to suppress benign data races. In an ideal world, one should use atomic variables as it allows the programmer to express shared variables that can be manipulated safely by concurrent threads and it allows a race detector like the Thread Sanitizer to work effectively.

Versions:

Ubuntu 14.04
clang 3.8
gcc 4.9
TSAN branch of TokuBackup