2016-02-21 19:37 UTC

The Cost Of Consistency: When 9ms Is 1s Too Much

The CAP theorem remains one of the most important tools in my software developer's toolbelt. Used correctly, it can help create services and products that can offer an excellent user experience and protect revenue during partial failures. Used incorrectly, it can lead to a poor user experience and loss of revenue. Throughout this article, I look to explain the unexpected latency of specific transactions in MySQL, a CP system.

This is a follow-up to my last article, and, as you may recall, I referred to observations of unexpected log entries in a MySQL slow query log. And, while the proposed changes completely eliminated those specific entries, new ones that were initially even more difficult to reason about arose. Using the same example database created in the last article, we will look to find the root cause of a new, resultant slow query log entry.

Testing Notes

The data presented in this article were reproduced and collected under fairly controlled circumstances. While attempting to diagnose the issues, I tried to minimise noise while keeping variables reasonably controlled. For example, I ran MySQL entirely on a RAM disk and used local loopback interface to reduce the variability of external hardware. While this led to some fairly sterilized data, I believe these controls were necessary. All slow query logs below consider queries taking longer than 100 microseconds as "slow".

Next, if you question why I am using optimistic concurrency control in this example, you're absolutely right. Normally, OCC is used when conflicts are unlikely, but in this article's example, I am purposefully creating conflicts as quickly as possible. I am doing this for one specific reason: to highlight the behaviour during conflict, which, while maybe unlikely throughout the entire life of some data, may happen frequently during specific phases of its life.

Mystery Child

Before going any further, we should take a look at the aforementioned mysterious slow query log digest that was not previously provided. There are numerous other websites on the internet that walk you through the slow query log format and many others that walk you through Percona's pt-query-digest. But, because we're not diagnosing a slow query per se, I will just point out the important parts:

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         64  236087
# Exec time     94   2794s   113us    76ms    12ms    26ms     7ms    10ms
# Lock time     95   1361s     7us    51ms     6ms    18ms     7ms     2ms
# rec lock wai  97   2470s       0    73ms    10ms    23ms     7ms     9ms
INSERT INTO child(id, parent_id, reference) VALUES(1455575750201, 1, 1455575750201)\G

In this digest that was compiled from a reproduced data set, we can see 236,087 INSERT INTO child statements that took longer than 100 microseconds. Of these 236,087, the 95th percentile insert time was 26ms with a maximum of 76ms. Because this is a fairly straight-forward INSERT into a fairly lightly-indexed table, these numbers seemed remarkably high. Looking a bit closer, you may see that the majority of the query times are actually on rec lock wai[t].

But what is rec lock wait? I'm glad you asked. This refers to the time waited to acquire a lock on a record. This may either be a read lock or a write lock on a specific row in a table. Reflecting back to the previous article, you may recall that during an INSERT INTO child, numerous rows were locked, even unrelated ones. This caused the high amount of contention as you can see in the digest.

As noted, however, reordering the UPDATE and INSERT queries completely eliminated this specific entry from showing up in the slow query log. However, in doing so, another mole, waiting for a firm whack, popped up its ugly head.

Annoying Parent

As a result of the query reordering, the parent table became the most contended table, as shown below in the slow query digest of a reproduction, using the controlled system discussed above, of the production data.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         92    3927
# Exec time     99   4391s   174us      1s      1s      1s    43ms      1s
# Lock time     99   4391s    59us      1s      1s      1s    43ms      1s
# rec lock wai 100   4391s       0      1s      1s      1s    43ms      1s
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 2844953\G

If, like me, you're a bit distressed as to why a simple UPDATE by primary key is showing up in the slow query log, we should consider a number of possible reasons:

Swapping. Typically, when a system does not have enough memory, or your system is misconfigured to use an excessively large InnoDB Buffer Pool, for example, the operating system may begin using a reserved portion of disk as additional memory. Given that disk is several orders of magnitude slower than memory, this will naturally decrease the throughput of a system. In this scenario, however, swap was not a factor.
Disk speed. As a part of the Durability guarantee of ACID, MySQL needs to write to disk. If a system is equipped with a disk that is unable to handle all of the reads and writes on the system, each statement will have to wait its turn. In this example, however, disk speed was not a factor, as MySQL ran off a RAM disk on an under-utilised system.
System load. While swapping and IO waits will cause a high system load, so too will compute-bound operations on a CPU. In the case of over-utilisation, each statement will need wait additional time to get its quantum. In this scenario, system load was negligible.
Bad hardware. I've encountered a number of exotic failures that can simply be chalked-up to subtle (and not-so-subtle) hardware failures. For example, flaky RAID controllers, flaky disks, bad memory, old ethernet cables, flaky bonding, fluctuating power, etc. These can be hard to find, but after normal fail-over testing and analysis, this problem can be ruled out.
Kernel mutex. Prior to MySQL 5.6, InnoDB used a global mutex, the kernel_mutex, upon which a significant number of operations waited, including the commit or rollback of a transaction. Under load, the kernel mutex can cause bottlenecks. In this scenario, however, InnoDB showed no sign whatsoever of mutex waits.
Table locking. In MySQL, both DDL operations and explicit requests to lock a table will lock a table. While InnoDB does not necessarily lock tables for each operation, it does place an "intention" lock on a table prior to a read or write. Intention locks are essentially shared locks that indicate a transaction is currently reading from or writing to a table. In this case, the table locks were not an issue.
Row locking. Depending on a transaction's isolation level, InnoDB may lock an individual row in read or write mode for a given operation. Before saying any more, I would like to point out that, because we're using READ COMMITTED, gap locks and next-key locks are effectively out of scope. During an UPDATE, MySQL will lock each affected row to uphold the necessary isolation guarantees. This, I hypothesized, was the issue. But why?

Row Locking

What could cause a row lock? In a READ COMMITTED isolation, according to ACID, a DBMS would acquire a write lock before a write until after the transaction is committed and would acquire a read lock before a read until immediately after the read. The granularity of this lock is ambiguous, but in InnoDB, locking can be, and is configured so in this example, granular to the row.

One possibility is that SELECT statements are taking too long or occur too frequently. Each select statement may acquire a read lock during the read. Taking a quick look at the slow query log shows that neither are likely to be the problem.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          2      86
# Exec time      0    20ms   140us     3ms   228us   301us   288us   176us
# rec lock wai   0       0       0       0       0       0       0       0
SELECT version FROM parent WHERE id = 1\G

Compared to the UPDATE statement which exhibited 3927 slow queries, we captured only 86 slow SELECT queries during the same period of time. Further, the 95th percentile is merely 301 microseconds, meaning the slowest 5% were all under 3ms. Because the SELECT to UPDATE ratio is 1:1 (and performed against a separate replica database), the read locks during a SELECT do not seem to be the problem.

Other mechanisms, such as SELECT ... FOR UPDATE, or, in InnoDB, SELECT ... LOCK IN SHARE MODE, may also be used to acquire granular locks; however, this example uses neither. As such, we may hypothesise that another transaction is performing some write-locking operation on this row, perhaps the UPDATE.

Discovering The Problem

At this point, our hypothesis is simply that the UPDATE statements are conflicting with themselves. As usual, the first step in diagnosing this issue is reproducing it. Given the controlled system, the following data were collected:

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         91  536434
# Exec time     99   5990s   134us   147ms    11ms    12ms     2ms    11ms
# Lock time     99   5920s    27us   147ms    11ms    12ms     2ms    11ms
# rec lock wai 100   5908s       0   147ms    11ms    12ms     2ms    11ms
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 2856388\G

I would first like to point out the total count of 536,434 slow UPDATE statements. This is the first observation that suggests something may be odd about the production data set: the number of slow queries here is more than 135 times higher than the reproduced production data set for the same time range, but the query times are substantially lower. Looking at the runtime characteristics, the overall throughput of this execution was just over 100 times higher than the reproduced data set, so it may be safe to temporarily disregard this observation: the numbers are relatively close.

Next, observing the query execution times of the statements taking longer than 100 microseconds, the slowest UPDATE statements took 174ms, while the fastest 95% took between 134 microseconds and 12 milliseconds. This also corresponds to the row lock wait.

To remove other variables, we can analyse a scenario of this example without contention.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         10       4
# Exec time      8   606us   142us   170us   151us   167us    12us   156us
# Lock time     22    65us    16us    17us    16us    16us       0    15us
# rec lock wai   0       0       0       0       0       0       0       0
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 2976567\G

This demonstrates that without contention, the 95th percentile execution of the only 4 UPDATE statements over 100 microseconds is 167 microseconds and no lock wait time whatsoever. Runtime characteristics also suggest throughput is nearly 5000 times higher than the last execution. The lack of record lock waits, combined with the previous results, lets us currently rule-out the storage layer or other InnoDB internals as potential problems unless other evidence points us back.

If we have effectively ruled-out MySQL as the potential problem, one thing remains: the code executing the statements. The only clear correlation we now have is that the applications are somehow holding row locks for longer than necessary. Given that our code is essentially a sequence of UPDATE, INSERT, COMMIT, there is little room for programmatic misbehaviour. However, any kind of delay after invoking the UPDATE will increase the duration of the row's write lock. Let's evaluate some possibilities:

Sleeps or waits. Using sleeps or waits during a database transaction will increase the duration of a transaction causing row locks to beheld for longer than necessary.
- Sleeps. I have often seen developers use sleeps before a "retry", but this kind of behaviour usually just causes the system to a grind under even moderate contention: I strongly urge you to reconsider putting threads to sleep in any user-facing application, leave alone during an active transaction.
- Waits. Generally, waits are implicit and happen when you attempt to acquire a lock or other system resource that will "notify" a waiting thread when some condition is met. Generally, you do not want to put your thread into a waiting state once you have started a transaction. Instead, do this type of work before or after if you can.
This example exhibits neither sleeps nor waits during the life of the transaction.
Network calls. While making a network call will effectively put your thread into a waiting state, you are also increasing the duration of the transaction by the amount of time it takes the external system to return. Ideally, you will never communicate with other external systems during a transaction.
Garbage collection. In an incorrectly tuned runtime with garbage collection, a single garbage collection can freeze up an application for hundreds of milliseconds, or even seconds, causing transactions to last longer than necessary. In this case, running jstat excludes garbage collection from being an issue, as garbage collection never happens during the runtime, and a dump of safepoint statistics shows no other pauses of consequence.
OS stalls and hardware issues. Depending on configuration and condition, both the operating system and hardware can cause noticeable stalls in an application. While these stalls are typically measured in nanoseconds, I've seen applications hang for tens or hundreds of seconds. In this case, we can rule out this issue.
Network latency. The network latency between an application and a database can cause transactions to stay open longer than necessary. The time taken between MySQL's acknowledgement of an UPDATE and the subsequent receipt of a COMMIT can be critical.

The Missing 991 Milliseconds

In this case, of the above causes, the only variable that cannot be ruled out is network latency. As such, understanding the topology of the system in question now comes into scope: a single MySQL master database sits in one of multiple datacenters and has client connections over a WAN. Given a typical RTT is 9ms with a 1-2ms jitter, we have our first variable.

When InnoDB attempts to acquire a row lock, it will first evaluate if the record's page of the clustered index has any locks. If not, it will attempt to immediately acquire the lock; otherwise, InnoDB will enqueue the write lock request into a write lock queue. Using a queue ensures fairness, such that the request waiting longest will receive the lock next.

Given the following algorithm

BEGIN;
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
IF ROW_COUNT() == 1
    INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);
    COMMIT;
ELSE
    ROLLBACK;
END

consider 3 transactions with a network latency of 4.5ms to transmit and 4.5ms to receive.

Transaction #1, #2, and #3 attempt the UPDATE.
4.5ms elapses
Transaction #1 is granted the write lock. Transactions #2 and #3 are enqueued.
4.5ms elapses
Transaction #1 issues an INSERT.
9ms elapses
Transaction #1 issues COMMIT.
4.5ms elapses
Transaction #2 is granted the write lock.
4.5ms elapses
Transaction #2 issues a ROLLBACK because the UPDATE returned 0 changes.
4.5ms elapses
Transaction #3 is granted the write lock.
4.5ms elapses
Transaction #3a issues a ROLLBACK because the UPDATE returned 0 changes.
4.5ms elapses
The row parent(id=1) is unlocked.

By step 11, transaction #2 has waited at least 24ms before it receives the response for its UPDATE. By step 15, transaction #3 has waited at least 32ms before it receives the response for its UPDATE. Given a higher rate of concurrency, you can easily see where the time discrepancy arises. Or, under lower rates of concurrency, even as something as trivial as a 3 second spike in latency to 100ms can cause cascading failures. In this example, we used only READ COMMITTED, however, stronger isolation levels, such as REPEATABLE READ, MySQL's default level, are even more prone to widespread issues of contention because of its gap and next-key locks.

At this point, running simulations with an added 9ms of latency generate nearly identical results to that seen in the original data set. The 9ms overhead, combined with a concurrency of 128 results in over 1 second of actual record lock wait time wherein MySQL is simply doing nothing.

Some Perspective

MySQL is a CP system, which is to say that its data remains consistent in the event of a network partition. Given the definition of an arbitrary system, a 9ms latency may even be considered an on-going network partition. As demonstrated in this article, the cost of consistency can be remarkably high, potentially causing poor user experience and unpredictable system degradation. Indeed, many systems require very strong consistency guarantees, and I in no way intend to suggest those systems are wrong. However, for systems wherein consistency may not be essential, such as in domains wherein writes are isolated to single users, relaxing strong consistency may be of great benefit in the long run.

So, what can be done about all this? A few things.

Avoid transactions in favour of auto-commit. This will ensure that locks are held only for the duration of the statement, independent of the issuing client. This approach, however, does not work well when attempting to atomically manipulate a set of dependent relationships and requires a carefully designed total ordering.
Do not maintain more than 1 active data center. This is a sledge-hammer solution, but for many systems, this is a viable option. While simpler in some regards, this adds on additional risk and complexity in the event of fail-over.
Dedicate a specific data center for writes to which all write-bound traffic would be routed. This will ensure writes are as fast as possible, likely in the hundreds of microseconds range. Probabilistically, a client will likely not make another request until after all writes have been replicated to all other read-bound data centres, but in the case this assumption does not hold true, the client must either be written to expect this eventual consistency or must be "pinned" to the write-bound data center for a longer period of time.
Amortise the cost of writes with local affinity using queues, topics, journals or other forms of asynchronous replication. These systems may become eventually synchronized with the remote master database. This technique works best when writes will fail very, very infrequently and pushes much of the responsibility of handing eventual consistency to the client.
Migrate toward an AP system that enables background synchronization and local affinity of writes. Such a set-up may be configured to ensure that clients always see a consistent view of their data provided they are (even temporarily) pinned to a single data center. Of course, this may be an expensive solution, so you should weigh the consequences carefully.

This list is non-exhaustive and every situation is different. Ultimately, I encourage you and your team to approach each situation with a open, critical mind and make informed, empirical decisions.