There is a long history of tweaking the binary log group commit algorithm and the InnoDB storage engine to scale performance. Mark Callaghan describes the work done on this problem at Facebook. Kristian Nielsen describes how this problem was fixed in MariaDB. Mats Kindahl describes how this problem was fixed in MySQL 5.6. Now, there is yet another tweak to the algorithm.
Since the binary log group commit algorithm in MySQL 5.7 changes the interaction between the MySQL transaction commit logic and the storage engine, a small change to TokuDB is required to use the new algorithm. I decided to run the sysbench non-indexed update test with the InnoDB and TokuDB storage engines on MySQL 5.7 to test this new algorithm.
InnoDB experiment
The sysbench non-indexed update test is run on a small table that fits entirely in the InnoDB in memory buffer pool. A small table is chosen to focus the test on the group commit algorithm and its disk sync's. A consumer grade SSD with about 3 millisecond fsync time is used to store the data. The sysbench update throughput as a function of the number of concurrent threads updating the table is reported.The new binary log group commit algorithm increases application throughput by about 20%.
TokuDB experiment
The sysbench non-indexed update test is run on a small table that fits entirely in the TokuDB in memory cache. The test labelled 'mysql 5.7.7 tokudb flush' required a change to the TokuDB prepare function to work with the new binary log group commit algorithm.
Sysbench throughput on Percona Server 5.6.24 and MySQL 5.7.7 are about the same as expected. When TokuDB is changed to use the MySQL 5.7 binary log group commit algorithm, sysbench non-indexed update throughput increases by about 30%.
What is the new binary log group commit algorithm?
MySQL uses a two-phase commit algorithm to coordinate the state of the MySQL binary log with the state of the storage engine recovery log.
Here is the MySQL 5.6 commit logic and its interaction with the TokuDB storage engine. The interaction with InnoDB is similar.
MySQL commit logic calls TokuDB prepare:
(1) write a prepare recovery log entry to the in memory TokuDB recovery log.
(2) write and sync the TokuDB recovery log with a group commit algorithm.
MySQL commit logic writes the binary log:
(3) write transactional changes to the binary log.
(4) sync the binary log with a group commit algorithm.
MySQL commit logic calls TokuDB commit:
(5) write a commit recovery log entry to the TokuDB recovery log. A sync of the TokuDB recovery log is unnecessary here for proper recovery.
Transaction durability occurs when the storage engine recovery log and the binary log are synced to disk in steps (2) and (4).
During crash recovery, the MySQL transaction coordinator gets a list of all of the prepared transactions from the storage engine and will commit those prepared transactions that are present in the binary log. If the prepared transaction is not in the binary log, as is the case if the crash occurs before the binary log is synced to disk, then the transaction is aborted.
Disk syncs are really expensive compared to memory operations. A group commit algorithm is used to amortize the sync cost over multiple transactions. Since the expected sync time is about the same for both steps (2) and (4), we expect about the same number of threads to be group committed together.
Here is the MySQL 5.7 commit logic and its interaction with the TokuDB storage engine.
MySQL commit logic calls TokuDB prepare:
(1) write a prepare recovery log entry to the in memory TokuDB recovery log.
MySQL commit logic calls TokuDB flush logs:
(2) write and sync the TokuDB recovery log.
MySQL commit logic writes the binary log:
(3) write transactional changes to the binary log
(4) sync the binary log with a group commit algorithm
MySQL commit logic calls TokuDB commit:
(5) write a commit recovery log entry to the TokuDB recovery log. A sync of the TokuDB recovery log is unnecessary for proper recovery.
MySQL 5.7 groups threads together when executing steps (2) through (5). A leader thread is chosen to execute these actions. Steps (2) through (4) are executed by the leader thread on behalf of many other threads. Since there is only one point where threads are collected for group commit (step 4), the sync cost is amortized over more threads than on MySQL 5.6 which has two places where group commit occurs.
Here is the MySQL 5.6 commit logic and its interaction with the TokuDB storage engine. The interaction with InnoDB is similar.
MySQL commit logic calls TokuDB prepare:
(1) write a prepare recovery log entry to the in memory TokuDB recovery log.
(2) write and sync the TokuDB recovery log with a group commit algorithm.
MySQL commit logic writes the binary log:
(3) write transactional changes to the binary log.
(4) sync the binary log with a group commit algorithm.
MySQL commit logic calls TokuDB commit:
(5) write a commit recovery log entry to the TokuDB recovery log. A sync of the TokuDB recovery log is unnecessary here for proper recovery.
Transaction durability occurs when the storage engine recovery log and the binary log are synced to disk in steps (2) and (4).
During crash recovery, the MySQL transaction coordinator gets a list of all of the prepared transactions from the storage engine and will commit those prepared transactions that are present in the binary log. If the prepared transaction is not in the binary log, as is the case if the crash occurs before the binary log is synced to disk, then the transaction is aborted.
Disk syncs are really expensive compared to memory operations. A group commit algorithm is used to amortize the sync cost over multiple transactions. Since the expected sync time is about the same for both steps (2) and (4), we expect about the same number of threads to be group committed together.
Here is the MySQL 5.7 commit logic and its interaction with the TokuDB storage engine.
MySQL commit logic calls TokuDB prepare:
(1) write a prepare recovery log entry to the in memory TokuDB recovery log.
MySQL commit logic calls TokuDB flush logs:
(2) write and sync the TokuDB recovery log.
MySQL commit logic writes the binary log:
(3) write transactional changes to the binary log
(4) sync the binary log with a group commit algorithm
MySQL commit logic calls TokuDB commit:
(5) write a commit recovery log entry to the TokuDB recovery log. A sync of the TokuDB recovery log is unnecessary for proper recovery.
MySQL 5.7 groups threads together when executing steps (2) through (5). A leader thread is chosen to execute these actions. Steps (2) through (4) are executed by the leader thread on behalf of many other threads. Since there is only one point where threads are collected for group commit (step 4), the sync cost is amortized over more threads than on MySQL 5.6 which has two places where group commit occurs.
TokuDB changes
TokuDB detects that the new two-phase commit algorithm is being used and does not sync its recovery log in the storage engine prepare (step 2). TokuDB assumes that MySQL will call its flush logs function to write and sync its recovery log. See this gitub commit for details.