We discovered problems with one kind of pthread reader writer lock (rwlock) on linux. Sometimes, a thread requesting a read lock that is already locked will spin on the lock state rather than block on the lock's futex. This spin continues until the current lock holders complete their work and release their locks. This wastes resources. The second problem is worse than the first: sometimes a thread requesting a read lock can be blocked indefinitely on the lock's futex, even after the lock is free. If the blocked thread is holding other locks, the application can deadlock and grind to a halt.
There are three kinds of pthread reader writer locks. A "prefer reader" kind of lock (the default kind of lock) will allow additional read locks to be granted while there is a waiting write lock request. Read lock requests are given preference over write lock requests. A "prefer writer" kind of lock gives write lock requests precendence over read lock requests. Finally, a "prefer writer nonrecursive" kind of lock is like a "prefer writer" kind that throttles new read locks when there is a write request pending. The two problems occur because of a bug in the glibc pthreads implementation of the "prefer writer nonrecursive" kind of lock. The other kinds of rwlocks do not exhibit these problems.
The rwlock code manipulates user level lock state using atomic operations, and is optimized for the no lock contention cases. When there is lock contention, the calling thread is blocked on a futex. The state of a rwlock, including a count of the number of readers and whether the rwlock is in read phase or write phase, is stored in rwlock->readers. There is other state stored in each rwlock; however, it it not necessary to include it in this discussion.
Here is the pseudo-code for the acquire read lock function (pthread_rwlock_rdlock) that supports the "prefer writer nonrecursive" kind of lock. It may be useful to also compare the pseudo-code with the C code in glibc/nptl/pthread_rwlock_common.c.
1. Load a copy of the lock state into a local variable "r".
r = atomic load(rwlock->readers)
2. Check if the lock is not in write phase, and there is a pending writer, and there are readers. All of this is encoded in rwlock->readers.
3. Set the reader waiting flag.
atomic compare and exchange(rwlock->readers, r, r | RWAITING)
The atomic compare and exchange does this atomically when it succeeds:
if rwlock->readers == r then rwlock->readers = r | RWAITING
Wait until some other thread clears the RWAITING flag when manipulating this rwlock.
4. while (atomic load(rwlock->readers) & RWAITING)
5. futex wait(rwlock->readers, r)
The futex wait does this atomically:
if rwlock->readers == r then sleep on some kernel object else return an error.
6. Increment the reader count and if in read phase return to the caller. Otherwise wait for the lock to transistion from write phase to read phase.
The spin problem occurs in steps (4) and (5), where rwlock->readers != r, because rwlock->readers has RWAITING flag set (step 3) but the local copy r may not (step 1). This causes the futex_wait in step (5) to just return rather than block, because that is how a futex wait works. So, the code continually executes (4) and (5) until other threads release a read lock (changing rwlock->readers), or acquires and releases a writer lock. Both of these operations changes rwlock->readers, turns off the RWAITING flag, and wakes up the futex if the RWAITING flag was originally set.
The calling thread may end up stalled on the futex_wait (5) indefinitely. This occurs when the state of a rwlock (rwlock->readers) is changed between steps (4) and (5) to the original value read in step (1), and the futex is awakened by some other thread before this thread calls futex wait. Then this thread will get blocked in futex wait (5) since __readers == r, and may never get awakened again depending on the application (because the RWAITING flag is clear).
The rwlock spins and hangs were originally discovered by sporadic failures of one of the Tokutek fractal tree test cases runing on Ubuntu 18.10 with glibc 2.28. It is likely that the bug has some effect on TokuDB as well. The rwlock bug was probably introduced in glibc 2.25 with a new rwlock implementation. Since there exists a bug fix for glibc bug 23861, it is likely to be included in the glibc 2.29 release.