Thursday, September 10, 2015

MySQL 5.7 and Barriers

When MySQL 5.7.8 is built and run with the Thread Sanitizer, one of the first issues reported is a data race with MySQL's use of pthread barrier wait and pthread barrier destroy.   The data race report is:

WARNING: ThreadSanitizer: data race (pid=30104)
  Write of size 1 at 0x7ffd33054298 by main thread:
    #0 pthread_barrier_destroy /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1236 (mysqld+0x000000d5a71b)
    #1 start_helper_thread /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:124:3 (mysqld+0x0000016f0e82)
    #2 my_timer_initialize /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:158:7 (mysqld+0x0000016f0da0)
    #3 init_server_components() /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:3785:7 (mysqld+0x000000ddba75)
    #4 mysqld_main(int, char**) /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:4645:7 (mysqld+0x000000dda276)
    #5 main /home/rfp/my578.tsan/mysql-server/sql/main.cc:25:10 (mysqld+0x000000dd4b9e)

  Previous read of size 1 at 0x7ffd33054298 by thread T1:
    #0 pthread_barrier_wait /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:1243 (mysqld+0x000000d69b4e)
    #1 timer_notify_thread_func /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:73:3 (mysqld+0x0000016f11a9)

  Location is stack of main thread.

  Thread T1 (tid=30106, running) created by main thread at:
    #0 pthread_create /home/rfp/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:848 (mysqld+0x000000d732d3)
    #1 my_thread_create /home/rfp/my578.tsan/mysql-server/mysys/my_thread.c:92:10 (mysqld+0x0000016f0c60)
    #2 spawn_thread_noop /home/rfp/my578.tsan/mysql-server/mysys/psi_noop.c:188:10 (mysqld+0x0000016efda0)
    #3 inline_mysql_thread_create /home/rfp/my578.tsan/mysql-server/include/mysql/psi/mysql_thread.h:1296:11 (mysqld+0x0000016f1112)
    #4 start_helper_thread /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:113:7 (mysqld+0x0000016f0e38)
    #5 my_timer_initialize /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:158:7 (mysqld+0x0000016f0da0)
    #6 init_server_components() /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:3785:7 (mysqld+0x000000ddba75)
    #7 mysqld_main(int, char**) /home/rfp/my578.tsan/mysql-server/sql/mysqld.cc:4645:7 (mysqld+0x000000dda276)
    #8 main /home/rfp/my578.tsan/mysql-server/sql/main.cc:25:10 (mysqld+0x000000dd4b9e)

SUMMARY: ThreadSanitizer: data race /home/rfp/my578.tsan/mysql-server/mysys/posix_timers.c:124:3 in start_helper_thread

The MySQL timer initialization code uses a pthread barrier to synchronize the main thread with a timer notification thread that the main thread creates.  The design pattern is:

main:                                                                        timer_notify_thread_func:
barrier_init
pthread_create timer_notify_thread_func
do stuff                                                                     do other stuff
barrier_wait                                                              barrier_wait
barrier_destroy

The Thread Sanitizer reports a potential data race between a barrier wait and the barrier destroy.   IMO, there is no data race because a barrier wait creates a happens before relationship by definition of how barriers work.  No thread can pass the barrier until all threads enter the barrier.  If there is a data race within the barrier, then the implementation is flawed.  

The Thread Sanitizer is being pessimistic about pthread barrier implementations which have been flawed, so it reports a potential error.  One can either (1) not use the above design pattern, or (2) assume that the barrier implementation actually works and suppress this race report.  Given the recent change to the glibc barrier implementation, I would avoid using them.

Tools

clang 3.8
Ubuntu 14.04
MySQL 5.7.8


Thursday, September 3, 2015

A Monster Test Crashes the Thread Sanitizer

One of the TokuFT (now PerconaFT) tests crashed the Thread Sanitizer that is included in clang 3.5.   The crash occurred at run time after the test was compiled.   I instrumented TokuFT to trace all of the mutex initialization, lock, and unlock functions called by the failing test.  This trace was used to create a monster test case with 12,258 mutexes and over 445K mutex operations.  Unfortunately, this crash does not occur with the latest clang 3.8 (thanks Dmitry for investigating this), so it is time to upgrade to a version of clang with this bug fixed.

In addition, I ran into another Thread Sanitizer crash in its deadlock detector and a workaround was quickly produced.  Thanks Kostya.

Lessons remembered: check if the problem was reported previously, and always try the latest and greatest version of the software.   The problem could already be fixed.