We recently found two races in the fractal tree lock manager that significantly affect transactional throughput for some applications that use a small number of concurrent transactions. These races manifest as transactions unnecessarily waiting for an available lock. Since the stall is resolved by a lock timeout and the lock timeout period can be very long, the throughput of the application can be extremely low. We discuss these races and the changes to the lock manager that fixes them.
The release before wait race involves two transactions that are contending for the same lock. Here is the sequence of events for transactions A and B, and lock L that demonstrates the release before wait race:
A acquires L
B tries to acquire L, conflicts with A
A releases L
A retries pending lock requests, B's is not found
B adds itself to the pending lock request set
B waits for L even though L is free
The race occurs when a lock is released by transaction A after transaction B tried to acquire it but before transaction B has a chance to register it's pending lock request. There are several ways to fix this problem, but we want to optimize for the common situation of minimal lock conflicts, which is what the lock acquisition algorithm currently does. Our solution to the release before wait race is for transaction B to retry its lock request after its lock request has been added to the pending lock set.
The retry while retry race involves three transactions are are contending for the same lock. Here is the sequence of events for transactions A, B, and C, and lock L that demonstrates the retry while retry race:
A acquires L
B tries to acquire L, conflicts with A
C tries to acquire L, conflicts with A
A releases L
A retries pending lock requests, grants L to B, gets suspended by the kernel before it finishes lock
the lock retry loop
B releases L
B tries to retry pending lock requests, fails because lock retry is already busy
A finishes lock retry
A and B do something else, C is stuck waiting for L even though L is free
The race occurs in the current lock retry algorithm which assumes that if some transaction is running lock retry, then my transaction does not also need to run it. There is a chance that some pending lock requests will be skipped, but these lock requests will eventually time out. For applications with small numbers of concurrent transactions, timeouts will frequently occur, and the application throughput will be very small.
The solution to the retry race is to use a group retry algorithm. All threads run through the retry logic. Sequence numbers are used to group retries into groups such that one transaction can run the retry logic on behalf of several transactions. This amortizes the lock retry cost over all of the retired transactions.
One of the applications affected by these lock manager races is MariaDB's optimistic parallel replication algorithm. The number of concurrent transactions is limited by the number of parallel threads and can be relatively small, which makes the optimistic scheduler susceptible to
the fractal tree lock manager races.
The optimistic scheduler looks outside of the group commit boundaries in the binary log to assign replication work to replication threads. Sometimes, the parallel work has locking conflicts. When this occurs, the conflicting transactions are aborted and the work schedule is changed to avoid the conflicts.
When the optimistic scheduler needs to abort the replication work, it needs a mechanism to kill any thread that may be waiting for lock inside of a storage engine. MariaDB calls the handlerton kill query method, which causes TokuDB to find the pending lock request associated with a client thread and kill it. This allows the client thread to subsequently abort its transaction.