Monday, December 17, 2018
Hangs when using trylock reader writer lock functions
The pthreads reader writer locks have a couple of problems that can lead to application hangs. The trylock functions can cause other threads to hang indefinitely in the rdlock or wrlock functions even after the lock is no longer held. If these hanging threads hold other locks, the application can deadlock and grind to a halt. This bug was reported in glibc bug 23844 and has existed since glibc 2.25 was released. The 2.25 version includes a new reader writer lock implementation which replaces large critical sections with compare and exchange atomic instructions. The code paths for an uncontended lock should be very fast. Unfortunately, a couple of bugs that can hang the application need to be fixed.
The trywrlock function is a non-blocking function that tries to grab a write lock on a reader writer lock. Unfortunately, this function is missing logic that the wrlock function has when the phase of the reader writer lock transitions from read phase to write phase. This missing logic is intended to synchronize the state of the lock with other reader threads so that when the write lock is released, blocked reader threads are awakened. Since the logic is missing, the reader threads are not awakened.
The tryrdlock function is a non-blocking function that tries to grab a read lock on a reader writer lock. Unfortunately, this function is missing logic that the rdlock function has when the phase of the reader writer lock transitions from write phase to read phase. This missing logic awakens reader threads that may be waiting for the read phase. It just so happened that the thread executing tryrdlock got there first. Since the logic is missing, the reader threads are not awakened.
There are patches to the trywrlock and tryrdlock functions in the glibc bug report that fix these bugs. Hopefully, these patches will be included in a new glibc release soon.
The MySQL server uses reader writer locks for various purposes. I wonder if it is affected by these reader writer lock bugs.
Wednesday, November 7, 2018
Spins and hangs with one kind of reader writer lock
We discovered problems with one kind of pthread reader writer lock (rwlock) on linux. Sometimes, a thread requesting a read lock that is already locked will spin on the lock state rather than block on the lock's futex. This spin continues until the current lock holders complete their work and release their locks. This wastes resources. The second problem is worse than the first: sometimes a thread requesting a read lock can be blocked indefinitely on the lock's futex, even after the lock is free. If the blocked thread is holding other locks, the application can deadlock and grind to a halt.
There are three kinds of pthread reader writer locks. A "prefer reader" kind of lock (the default kind of lock) will allow additional read locks to be granted while there is a waiting write lock request. Read lock requests are given preference over write lock requests. A "prefer writer" kind of lock gives write lock requests precendence over read lock requests. Finally, a "prefer writer nonrecursive" kind of lock is like a "prefer writer" kind that throttles new read locks when there is a write request pending. The two problems occur because of a bug in the glibc pthreads implementation of the "prefer writer nonrecursive" kind of lock. The other kinds of rwlocks do not exhibit these problems.
The rwlock code manipulates user level lock state using atomic operations, and is optimized for the no lock contention cases. When there is lock contention, the calling thread is blocked on a futex. The state of a rwlock, including a count of the number of readers and whether the rwlock is in read phase or write phase, is stored in rwlock->readers. There is other state stored in each rwlock; however, it it not necessary to include it in this discussion.
Here is the pseudo-code for the acquire read lock function (pthread_rwlock_rdlock) that supports the "prefer writer nonrecursive" kind of lock. It may be useful to also compare the pseudo-code with the C code in glibc/nptl/pthread_rwlock_common.c.
1. Load a copy of the lock state into a local variable "r".
r = atomic load(rwlock->readers)
2. Check if the lock is not in write phase, and there is a pending writer, and there are readers. All of this is encoded in rwlock->readers.
3. Set the reader waiting flag.
atomic compare and exchange(rwlock->readers, r, r | RWAITING)
The atomic compare and exchange does this atomically when it succeeds:
if rwlock->readers == r then rwlock->readers = r | RWAITING
Wait until some other thread clears the RWAITING flag when manipulating this rwlock.
4. while (atomic load(rwlock->readers) & RWAITING)
5. futex wait(rwlock->readers, r)
The futex wait does this atomically:
if rwlock->readers == r then sleep on some kernel object else return an error.
6. Increment the reader count and if in read phase return to the caller. Otherwise wait for the lock to transistion from write phase to read phase.
The spin problem occurs in steps (4) and (5), where rwlock->readers != r, because rwlock->readers has RWAITING flag set (step 3) but the local copy r may not (step 1). This causes the futex_wait in step (5) to just return rather than block, because that is how a futex wait works. So, the code continually executes (4) and (5) until other threads release a read lock (changing rwlock->readers), or acquires and releases a writer lock. Both of these operations changes rwlock->readers, turns off the RWAITING flag, and wakes up the futex if the RWAITING flag was originally set.
The calling thread may end up stalled on the futex_wait (5) indefinitely. This occurs when the state of a rwlock (rwlock->readers) is changed between steps (4) and (5) to the original value read in step (1), and the futex is awakened by some other thread before this thread calls futex wait. Then this thread will get blocked in futex wait (5) since __readers == r, and may never get awakened again depending on the application (because the RWAITING flag is clear).
The rwlock spins and hangs were originally discovered by sporadic failures of one of the Tokutek fractal tree test cases runing on Ubuntu 18.10 with glibc 2.28. It is likely that the bug has some effect on TokuDB as well. The rwlock bug was probably introduced in glibc 2.25 with a new rwlock implementation. Since there exists a bug fix for glibc bug 23861, it is likely to be included in the glibc 2.29 release.
There are three kinds of pthread reader writer locks. A "prefer reader" kind of lock (the default kind of lock) will allow additional read locks to be granted while there is a waiting write lock request. Read lock requests are given preference over write lock requests. A "prefer writer" kind of lock gives write lock requests precendence over read lock requests. Finally, a "prefer writer nonrecursive" kind of lock is like a "prefer writer" kind that throttles new read locks when there is a write request pending. The two problems occur because of a bug in the glibc pthreads implementation of the "prefer writer nonrecursive" kind of lock. The other kinds of rwlocks do not exhibit these problems.
The rwlock code manipulates user level lock state using atomic operations, and is optimized for the no lock contention cases. When there is lock contention, the calling thread is blocked on a futex. The state of a rwlock, including a count of the number of readers and whether the rwlock is in read phase or write phase, is stored in rwlock->readers. There is other state stored in each rwlock; however, it it not necessary to include it in this discussion.
Here is the pseudo-code for the acquire read lock function (pthread_rwlock_rdlock) that supports the "prefer writer nonrecursive" kind of lock. It may be useful to also compare the pseudo-code with the C code in glibc/nptl/pthread_rwlock_common.c.
1. Load a copy of the lock state into a local variable "r".
r = atomic load(rwlock->readers)
2. Check if the lock is not in write phase, and there is a pending writer, and there are readers. All of this is encoded in rwlock->readers.
3. Set the reader waiting flag.
atomic compare and exchange(rwlock->readers, r, r | RWAITING)
The atomic compare and exchange does this atomically when it succeeds:
if rwlock->readers == r then rwlock->readers = r | RWAITING
Wait until some other thread clears the RWAITING flag when manipulating this rwlock.
4. while (atomic load(rwlock->readers) & RWAITING)
5. futex wait(rwlock->readers, r)
The futex wait does this atomically:
if rwlock->readers == r then sleep on some kernel object else return an error.
6. Increment the reader count and if in read phase return to the caller. Otherwise wait for the lock to transistion from write phase to read phase.
The spin problem occurs in steps (4) and (5), where rwlock->readers != r, because rwlock->readers has RWAITING flag set (step 3) but the local copy r may not (step 1). This causes the futex_wait in step (5) to just return rather than block, because that is how a futex wait works. So, the code continually executes (4) and (5) until other threads release a read lock (changing rwlock->readers), or acquires and releases a writer lock. Both of these operations changes rwlock->readers, turns off the RWAITING flag, and wakes up the futex if the RWAITING flag was originally set.
The calling thread may end up stalled on the futex_wait (5) indefinitely. This occurs when the state of a rwlock (rwlock->readers) is changed between steps (4) and (5) to the original value read in step (1), and the futex is awakened by some other thread before this thread calls futex wait. Then this thread will get blocked in futex wait (5) since __readers == r, and may never get awakened again depending on the application (because the RWAITING flag is clear).
The rwlock spins and hangs were originally discovered by sporadic failures of one of the Tokutek fractal tree test cases runing on Ubuntu 18.10 with glibc 2.28. It is likely that the bug has some effect on TokuDB as well. The rwlock bug was probably introduced in glibc 2.25 with a new rwlock implementation. Since there exists a bug fix for glibc bug 23861, it is likely to be included in the glibc 2.29 release.
Subscribe to:
Posts (Atom)