Monday, July 20, 2015

Helgrind and MySQL 5.7

Valgrind's helgrind tool identifies data races and lock deadlocks in multi-threaded programs.  Since the MySQL server is a multi-threaded program, can helgrind find any issues in the MySQL server?

A simple test that does nothing except start up the MySQL server and shut it down is run on the MySQL test suite with memcheck and with helgrind.  For this simple test on MySQL 5.7.7, the memcheck run is nice and clean but the helgrind run reports a lot of issues.  The helgrind issues include a data race in the safe mutex lock function, a potential lock deadlock problem, several performance schema issues, and several InnoDB issues.

Valgrind test

Valgrind's memcheck tool is used by the MySQL test suite to find memory related bugs in the MySQL server.   What is the state of MySQL 5.7.7 with respect to memcheck?

The 'donothing' test is run with valgrind.

    mtr --valgrind-mysqld donothing

    main.donothing [ pass ]
    valgrind_report [ pass ]

The memcheck run is nice and clean with this simple test.

Helgrind test

The MySQL test suite does not support helgrind tests, so I hacked mysql-test-run.pl to support it.

The 'donothing' test is run with helgrind.

    mtr --helgrind --testcase-timeout=3600 donothing

    main.donothing [ fail ] Found warnings/errors in server log file!

There are a lot of issues in the helgrind run with this simple test.  Let's look into the safe mutex data race, the lock deadlock issue, and one of the data races in InnoDB.

Safe mutex data race

The safe mutex is the implementation of a MySQL mutex that includes extra data that is useful for debugging locking problems.  For example, the file name and line number of the code that locks a safe mutex is stored in the safe mutex object.

The safemutex data race occurs in 'safe_mutex_lock' function because 'mp->file' is read without holding a mutex but is written later while holding a mutex.

The fix to the safe mutex data race is simple:  just hold the 'mp->global_mutex' when reading mp->file'.

See MySQL bug 77862 for progress.

Lock deadlock issue

Helgrind builds a directed graph of locks that are taken within the MySQL server.  A cycle in the graph implies the possibility of deadlock.  Some of these lock deadlocks in the MySQL server are not detected at runtime but result in a server hang.

The helgrind run on the simple 'donothing' test reported a couple of  lock order issues.   The lock ordering rules in the MySQL server source are:

Rule #1: LOCK_system_variables_hash (R) before LOCK_global_system_variables (M)
Used in the 'alloc_and_copy_thd_dynamic_variables' function.

Rule #2: LOCK_plugin (M) before LOCK_system_variables_hash (R)
Used in the 'find_sys_var_ex function'.

Helgrind does not report all of the locks involved in the deadlock.   So,  we find the following lock order rule in the MySQL source.

Rule #3: LOCK_global_system_variables (M) before LOCK_plugin (M)
Used in the 'THD::init' function.

There no deadlock situation yet given the read locks on LOCK_system_variables_hash lock.  It appears that helgrind does not distinguish between read and write locks on a reader/writer lock.  We find another set of functions in the MySQL server than can replace lock order rule #2 and lead to a deadlock.

Modified rule #2: LOCK_plugin (M) before LOCK_system_variables_hash (W)
Used in the 'mysql_install_plugin' and 'plugin_del functions'.

Helgrind identified the possibility of a locking deadlock in MySQL 5.7 involving three locks.  One can search the code for SQL operations that call these low level functions and use a tool like the random query generator to try to generate a deadlock.

This locking deadlock is similar to MySQL bug 51591 and MySQL bug 71236.  It is quite easy to create software with low level locking deadlocks.  It appears that there is a long history of deadlocks with similar parts of the MySQL server software.

See MySQL bug 77872 for progress.

InnoDB issues

There are several helgrind issues with InnoDB.  Here is a simple InnoDB data race on the 'os_thread_count' variable. The 'os_thread_create_function' reads the 'os_thread_count' variable without holding the 'thread_mutex'.

See MySQL bug 77866 for progress.

Next steps

Helgrind identified several issues using the simplest test program.   Does anyone have any experience using helgrind or some other race detector on the MySQL or MariaDB server?

Tool versions

MySQL 5.7.7
Ubuntu 14.04
gcc 4.8.4
valgrind 3.10.1