The Cost Of Consistency: When 9ms Is 1s Too Much

The CAP theorem remains one of the most important tools in my software developer's toolbelt. Used correctly, it can help create services and products that can offer an excellent user experience and protect revenue during partial failures. Used incorrectly, it can lead to a poor user experience and loss of revenue. Throughout this article, I look to explain the unexpected latency of specific transactions in MySQL, a CP system.

This is a follow-up to my last article, and, as you may recall, I referred to observations of unexpected log entries in a MySQL slow query log. And, while the proposed changes completely eliminated those specific entries, new ones that were initially even more difficult to reason about arose. Using the same example database created in the last article, we will look to find the root cause of a new, resultant slow query log entry.

Testing Notes

The data presented in this article were reproduced and collected under fairly controlled circumstances. While attempting to diagnose the issues, I tried to minimise noise while keeping variables reasonably controlled. For example, I ran MySQL entirely on a RAM disk and used local loopback interface to reduce the variability of external hardware. While this led to some fairly sterilized data, I believe these controls were necessary. All slow query logs below consider queries taking longer than 100 microseconds as "slow".

Next, if you question why I am using optimistic concurrency control in this example, you're absolutely right. Normally, OCC is used when conflicts are unlikely, but in this article's example, I am purposefully creating conflicts as quickly as possible. I am doing this for one specific reason: to highlight the behaviour during conflict, which, while maybe unlikely throughout the entire life of some data, may happen frequently during specific phases of its life.

Mystery Child

Before going any further, we should take a look at the aforementioned mysterious slow query log digest that was not previously provided. There are numerous other websites on the internet that walk you through the slow query log format and many others that walk you through Percona's pt-query-digest. But, because we're not diagnosing a slow query per se, I will just point out the important parts:

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         64  236087
# Exec time     94   2794s   113us    76ms    12ms    26ms     7ms    10ms
# Lock time     95   1361s     7us    51ms     6ms    18ms     7ms     2ms
# rec lock wai  97   2470s       0    73ms    10ms    23ms     7ms     9ms
INSERT INTO child(id, parent_id, reference) VALUES(1455575750201, 1, 1455575750201)\G

In this digest that was compiled from a reproduced data set, we can see 236,087 INSERT INTO child statements that took longer than 100 microseconds. Of these 236,087, the 95th percentile insert time was 26ms with a maximum of 76ms. Because this is a fairly straight-forward INSERT into a fairly lightly-indexed table, these numbers seemed remarkably high. Looking a bit closer, you may see that the majority of the query times are actually on rec lock wai[t].

But what is rec lock wait? I'm glad you asked. This refers to the time waited to acquire a lock on a record. This may either be a read lock or a write lock on a specific row in a table. Reflecting back to the previous article, you may recall that during an INSERT INTO child, numerous rows were locked, even unrelated ones. This caused the high amount of contention as you can see in the digest.

As noted, however, reordering the UPDATE and INSERT queries completely eliminated this specific entry from showing up in the slow query log. However, in doing so, another mole, waiting for a firm whack, popped up its ugly head.

Annoying Parent

As a result of the query reordering, the parent table became the most contended table, as shown below in the slow query digest of a reproduction, using the controlled system discussed above, of the production data.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         92    3927
# Exec time     99   4391s   174us      1s      1s      1s    43ms      1s
# Lock time     99   4391s    59us      1s      1s      1s    43ms      1s
# rec lock wai 100   4391s       0      1s      1s      1s    43ms      1s
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 2844953\G

If, like me, you're a bit distressed as to why a simple UPDATE by primary key is showing up in the slow query log, we should consider a number of possible reasons:

Row Locking

What could cause a row lock? In a READ COMMITTED isolation, according to ACID, a DBMS would acquire a write lock before a write until after the transaction is committed and would acquire a read lock before a read until immediately after the read. The granularity of this lock is ambiguous, but in InnoDB, locking can be, and is configured so in this example, granular to the row.

One possibility is that SELECT statements are taking too long or occur too frequently. Each select statement may acquire a read lock during the read. Taking a quick look at the slow query log shows that neither are likely to be the problem.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          2      86
# Exec time      0    20ms   140us     3ms   228us   301us   288us   176us
# rec lock wai   0       0       0       0       0       0       0       0
SELECT version FROM parent WHERE id = 1\G

Compared to the UPDATE statement which exhibited 3927 slow queries, we captured only 86 slow SELECT queries during the same period of time. Further, the 95th percentile is merely 301 microseconds, meaning the slowest 5% were all under 3ms. Because the SELECT to UPDATE ratio is 1:1 (and performed against a separate replica database), the read locks during a SELECT do not seem to be the problem.

Other mechanisms, such as SELECT ... FOR UPDATE, or, in InnoDB, SELECT ... LOCK IN SHARE MODE, may also be used to acquire granular locks; however, this example uses neither. As such, we may hypothesise that another transaction is performing some write-locking operation on this row, perhaps the UPDATE.

Discovering The Problem

At this point, our hypothesis is simply that the UPDATE statements are conflicting with themselves. As usual, the first step in diagnosing this issue is reproducing it. Given the controlled system, the following data were collected:

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         91  536434
# Exec time     99   5990s   134us   147ms    11ms    12ms     2ms    11ms
# Lock time     99   5920s    27us   147ms    11ms    12ms     2ms    11ms
# rec lock wai 100   5908s       0   147ms    11ms    12ms     2ms    11ms
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 2856388\G

I would first like to point out the total count of 536,434 slow UPDATE statements. This is the first observation that suggests something may be odd about the production data set: the number of slow queries here is more than 135 times higher than the reproduced production data set for the same time range, but the query times are substantially lower. Looking at the runtime characteristics, the overall throughput of this execution was just over 100 times higher than the reproduced data set, so it may be safe to temporarily disregard this observation: the numbers are relatively close.

Next, observing the query execution times of the statements taking longer than 100 microseconds, the slowest UPDATE statements took 174ms, while the fastest 95% took between 134 microseconds and 12 milliseconds. This also corresponds to the row lock wait.

To remove other variables, we can analyse a scenario of this example without contention.

# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         10       4
# Exec time      8   606us   142us   170us   151us   167us    12us   156us
# Lock time     22    65us    16us    17us    16us    16us       0    15us
# rec lock wai   0       0       0       0       0       0       0       0
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 2976567\G

This demonstrates that without contention, the 95th percentile execution of the only 4 UPDATE statements over 100 microseconds is 167 microseconds and no lock wait time whatsoever. Runtime characteristics also suggest throughput is nearly 5000 times higher than the last execution. The lack of record lock waits, combined with the previous results, lets us currently rule-out the storage layer or other InnoDB internals as potential problems unless other evidence points us back.

If we have effectively ruled-out MySQL as the potential problem, one thing remains: the code executing the statements. The only clear correlation we now have is that the applications are somehow holding row locks for longer than necessary. Given that our code is essentially a sequence of UPDATE, INSERT, COMMIT, there is little room for programmatic misbehaviour. However, any kind of delay after invoking the UPDATE will increase the duration of the row's write lock. Let's evaluate some possibilities:

The Missing 991 Milliseconds

In this case, of the above causes, the only variable that cannot be ruled out is network latency. As such, understanding the topology of the system in question now comes into scope: a single MySQL master database sits in one of multiple datacenters and has client connections over a WAN. Given a typical RTT is 9ms with a 1-2ms jitter, we have our first variable.

When InnoDB attempts to acquire a row lock, it will first evaluate if the record's page of the clustered index has any locks. If not, it will attempt to immediately acquire the lock; otherwise, InnoDB will enqueue the write lock request into a write lock queue. Using a queue ensures fairness, such that the request waiting longest will receive the lock next.

Given the following algorithm

UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
    INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);

consider 3 transactions with a network latency of 4.5ms to transmit and 4.5ms to receive.

  1. Transaction #1, #2, and #3 attempt the UPDATE.
  2. 4.5ms elapses
  3. Transaction #1 is granted the write lock. Transactions #2 and #3 are enqueued.
  4. 4.5ms elapses
  5. Transaction #1 issues an INSERT.
  6. 9ms elapses
  7. Transaction #1 issues COMMIT.
  8. 4.5ms elapses
  9. Transaction #2 is granted the write lock.
  10. 4.5ms elapses
  11. Transaction #2 issues a ROLLBACK because the UPDATE returned 0 changes.
  12. 4.5ms elapses
  13. Transaction #3 is granted the write lock.
  14. 4.5ms elapses
  15. Transaction #3a issues a ROLLBACK because the UPDATE returned 0 changes.
  16. 4.5ms elapses
  17. The row parent(id=1) is unlocked.

By step 11, transaction #2 has waited at least 24ms before it receives the response for its UPDATE. By step 15, transaction #3 has waited at least 32ms before it receives the response for its UPDATE. Given a higher rate of concurrency, you can easily see where the time discrepancy arises. Or, under lower rates of concurrency, even as something as trivial as a 3 second spike in latency to 100ms can cause cascading failures. In this example, we used only READ COMMITTED, however, stronger isolation levels, such as REPEATABLE READ, MySQL's default level, are even more prone to widespread issues of contention because of its gap and next-key locks.

At this point, running simulations with an added 9ms of latency generate nearly identical results to that seen in the original data set. The 9ms overhead, combined with a concurrency of 128 results in over 1 second of actual record lock wait time wherein MySQL is simply doing nothing.

Some Perspective

MySQL is a CP system, which is to say that its data remains consistent in the event of a network partition. Given the definition of an arbitrary system, a 9ms latency may even be considered an on-going network partition. As demonstrated in this article, the cost of consistency can be remarkably high, potentially causing poor user experience and unpredictable system degradation. Indeed, many systems require very strong consistency guarantees, and I in no way intend to suggest those systems are wrong. However, for systems wherein consistency may not be essential, such as in domains wherein writes are isolated to single users, relaxing strong consistency may be of great benefit in the long run.

So, what can be done about all this? A few things.

This list is non-exhaustive and every situation is different. Ultimately, I encourage you and your team to approach each situation with a open, critical mind and make informed, empirical decisions.