2016-02-07 23:24 UTC

Pessimistic Optimism: The Case Of Unexpected Deadlocks

Despite the fact I've been developing software in one way or another for almost 20 years, I'm constantly surprised how things work, and, once the frustration wears off, I'm intrigued when things break. Lucky for me, over the last few weeks, I've been surprised, frustrated, and intrigued by what, on the surface, seems like a very trivial problem: MySQL reporting deadlocks in unexpected places and unexplainable slow query log entries when using optimistic concurrency control.

The Example

Before we start, let's get on the same page so that, if you wish, you can follow along. I assume you have a basic understanding of SQL and ACID. I also assume you have a basic understanding of FOREIGN KEYs and UNIQUE constraints. This article will use READ COMMITTED, which is the minimum necessary level for optimistic concurrency control.

Throughout this article, we will use the following setup in a MySQL database:

CREATE DATABASE test;
USE test;

CREATE TABLE parent (
    id BIGINT PRIMARY KEY,
    version INT NOT NULL
);

CREATE TABLE child (
    id BIGINT PRIMARY KEY,
    parent_id BIGINT NOT NULL,
    reference BIGINT NOT NULL,
    CONSTRAINT parentid_reference_uk UNIQUE(parent_id, reference),
    CONSTRAINT parentid_fk FOREIGN KEY(parent_id) REFERENCES parent(id)
);

INSERT INTO parent(id, version) VALUES(1, 0);

To summarise, we've created a one-to-many relationship between parent and child: each child has one parent, and each parent has many child references. I'd like to call out the reference column and parentid_reference_uk constraint in the child table. These two things make the frustrating magic happen, but if you're wondering, reference is simply an arbitrary value.

Optimistic Concurrency Control

If you've followed any of my other articles, or if you've been in the field for a while, you've probably seen the term 'optimistic locking' a number of times. OCC is much like optimistic locking: you are able to modify an entity without locking it under the optimistic assumption that a competing write will be rare.

If you're already familiar with the OCC algorithm, feel free to skip this section. Otherwise, it's pretty simple and I will use the parent and child tables to demonstrate. In this scenario, the parent-child relationship is composition, meaning the child entities are owned exclusively by a parent. Further, to uphold invariants of the relationship, this means child entities should not be modified outside of the parent graph.

Record the version - Before modifying a parent entity, you first need to record its current version.
```
BEGIN;
SELECT version FROM parent WHERE id = 1;
COMMIT;
```
A read like this could be performed off a replica database or performed in time long prior to the subsequent steps. For this example, let's assume the version=0 where id=1. One important point is that the transaction is being committed after the read.
Modify the entity - Next, in a new transaction, we'll modify the parent entry by inserting a reference child row.
```
BEGIN;
INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);
```
Notice that we have not yet committed or rolled back the transaction. It needs to remain open for the next step.
Validate the changes - Once the parent graph is in the desirable state, we have to validate that our new changes were not conflicted. We do this by bumping up the parent entity's version while asserting the current version is still 0
```
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
```
This update will return one of two results: an update count of 0 or 1. If the update count is 0, the parent record was modified by another transaction some time between step 1 and now. In this case, we should ROLLBACK; otherwise, an update count of 1 indicates that the record was not changed by another transaction. We may then COMMIT. After committing, the new version is 1, which means any other prior reads of parent(version=0) are invalid.

Deadlocks

Before moving along, let's be clear about the term, "deadlock". A deadlock is, essentially, a state when one thread, thread 1, is preventing another thread, thread 2, from making meaningful progress. Further, thread 2 is also preventing thread 1 from making meaningful progress. For example,

Thread 1 acquires lock A
Thread 2 acquires lock B
Thread 1 waits for lock B
Thread 2 waits for lock A

Many people much smarter than me have found ways to describe, detect, and prevent deadlocks, so I'll leave the Internet at your disposal.

Reproducing The Surprise

Given my understanding of optimistic concurrency control, I certainly never expected to observe deadlocks in the way demonstrated below. Yet, as usual, I was humbled by the unusually large logs in production informing me I was wrong. Step one in diagnosing an issue is reproducing it, so I set-up two connections to a local database and started testing things out.

First, in both connections, disable autocommit and set a READ COMMITTED transaction isolation level

SET AUTOCOMMIT = 0;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
BEGIN;

Both transactions, 2511 and 2510, are ready

---TRANSACTION 2511, not started
MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 563 127.0.0.1 someuser cleaning up
---TRANSACTION 2510, not started
MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 562 127.0.0.1 someuser cleaning up

In the first transaction, we'll insert a new child row that references parent(id=1). As noted, we'll arbitrarily use reference=1.

INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);
Query OK, 1 row affected (0.00 sec)

Note that this transaction has not yet been committed. Taking a look at the InnoDB engine status, we can see that transaction 2510 now has a single insert with 1 row lock.

---TRANSACTION 2511, not started
MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 563 127.0.0.1 someuser cleaning up
---TRANSACTION 2510, ACTIVE 3 sec
3 lock struct(s), heap size 360, 1 row lock(s), undo log entries 1
MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 565 127.0.0.1 someuser cleaning up

Hopefully, you're not as confused as I was, uncertain about which 1 row was locked. If you are, we will clear it up later.

In the second transaction, we'll insert a new child row that purposefully duplicates the parentid_reference_uk constraint.

INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1);
<hanging>

Again, this transaction has not yet been committed. And, in fact, the INSERT call is blocking. Taking a look at the InnoDB engine status, we have a bit more information to help understand what's going on.

---TRANSACTION 2511, ACTIVE 2 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 567 127.0.0.1 someuser update
INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1)
------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 14 page no 4 n bits 72 index `parentid_reference_uk` of table `test`.`child` trx id 2511 lock mode S waiting
------------------
---TRANSACTION 2510, ACTIVE 22 sec
4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 565 127.0.0.1 someuser cleaning up

We can see that transaction 2551 is waiting to lock a row in parentid_reference_uk. You may also notice that transaction 2510 now has 2 row locks. This is the first sign that something weird is about to happen, but let's proceed. Also notice that undo log entries 1 in both transactions indicates that something has happened in both.

In the first transaction, we'll bump the version of the parent(id=1) record to ensure that no other transactions have changed it.
```
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0
```
Note, again, that we have not yet called COMMIT. The changes made in this transaction are uncommitted and should not yet be visible to any other transactions.

The second transaction, which, as you may recall, was waiting for a lock, has now failed. The failure is reported as a deadlock

INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1);
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Taking a look in the InnoDB engine status, we can hopefully start to diagnose the deadlock

------------------------
LATEST DETECTED DEADLOCK
------------------------
*** (1) TRANSACTION:
TRANSACTION 2511, ACTIVE 23 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1184, 2 row lock(s), undo log entries 1
MySQL thread id 68, OS thread handle 0x7fce0e270700, query id 567 127.0.0.1 someuser update
INSERT INTO child(id, parent_id, reference) VALUES(2, 1, 1)
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 14 page no 4 n bits 72 index `parentid_reference_uk` of table `test`.`child` trx id 2511 lock mode S waiting

*** (2) TRANSACTION:
TRANSACTION 2510, ACTIVE 43 sec starting index read
mysql tables in use 1, locked 1
6 lock struct(s), heap size 1184, 3 row lock(s), undo log entries 1
MySQL thread id 67, OS thread handle 0x7fce0e23f700, query id 569 127.0.0.1 someuser updating
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 14 page no 4 n bits 72 index `parentid_reference_uk` of table `test`.`child` trx id 2510 lock_mode X locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 8 page no 3 n bits 72 index `PRIMARY` of table `test`.`parent` trx id 2510 lock_mode X locks rec but not gap waiting

*** WE ROLL BACK TRANSACTION (1)

If, like me, you don't initially understand how this could possibly be a deadlock, be comforted that it took me quite some time to understand it. If the deadlock reported a conflict on PRIMARY on both threads, I would have understood, but some information is missing here, and it took a deep-dive into the MySQL source code before I could fill in the gap.

A Deep Dive

In the following sections, I reference the MySQL source code to help explain this behaviour. Because I am neither a MySQL developer nor an expert of its inner workings, however, I encourage any readers to correct my understanding wherever necessary. Further, I encourage you to follow along in the source code.

At a very, very high level, MySQL is a database API that is backed by numerous storage engines such as InnoDB, blackhole, NDB, MyISAM, etc. To simplify in an extreme way, consider that the storage engines are responsible for enforcing (or not enforcing) all the aspects of ACID, while MySQL handles the remaining functionality. In this article, as in most deployments of MySQL, I am using the InnoDB engine.

If you're following along with the source code, spend a few minutes exlporing example/ha_example.cc, which is a documented example of a custom storage engine. Then, when you're ready, open up innobase/handler/ha_innodb.cc, the InnoDB storage engine entry point, and come back.

InnoDB Inserts

Before proceeding, you'll need to know a few InnoDB terms. First, within InnoDB, everything is a B+tree. The data in a table are stored in a tree called the "clustered index" and is ordered by the primary key while the data stored in table indices are called "secondary indices". Within each tree, "rows" (or records) are stored on "pages" (of a predictable size).

Indices

When InnoDB performs an insert, a number of things happen. Using the parent and child tables demonstrated in this article, I will explain the pertinent steps. Recall that the first statement was INSERT INTO child. So, starting in innobase/handler/ha_innodb.cc, navigate to ha_innobase::write_row(uchar*), which is the handler for an INSERT statement. Give this function a quick read and then come back at Step-5:

/* Step-5: Execute insert graph that will result in actual insert. */
error = row_insert_for_mysql((byte*) record, m_prebuilt);

This call to row_insert_for_mysql in innobase/row/row0mysql.cc evaluates the insert and determines whether it is for an intrinsic table or a regular table. Intrinsic tables are in-memory tables used to store intermediary results; such tables are optimised to avoid many of the expensive guarantees of ACID, since an intrinsic table insert is only visible to the current thread (which cannot contend with itself).

if (dict_table_is_intrinsic(prebuilt->table)) {
    return(row_insert_for_mysql_using_cursor(mysql_rec, prebuilt));
} else {
    return(row_insert_for_mysql_using_ins_graph(mysql_rec, prebuilt));
}

In this case, child is not an intrinsic table, so we proceed to row_insert_for_mysql_using_ins_graph. To save you the leg work of groking the next couple steps, the next calls are in innobase/row/row0ins.cc: row_ins_step(que_thr_t*), which then calls row_ins(ins_node_t*, que_thr_t*). An insert essentially consists of inserting the new record into all non-fulltext indices of the table, starting with the clustered index (i.e. first_index):

node->index = dict_table_get_first_index(node->table);
...
while (node->index != NULL) {
    if (node->index->type != DICT_FTS) {
        err = row_ins_index_entry_step(node, thr);
...
    node->index = dict_table_get_next_index(node->index);

Stepping through row_ins_index_entry_step, we end up in row_ins_index_entry, which branches between a clustered index or secondary index:

if (dict_index_is_clust(index)) {
    return(row_ins_clust_index_entry(index, entry, thr, 0, false));
} else {
    return(row_ins_sec_index_entry(index, entry, thr, false));
}

As noted, the first index is always the clustered index. In this example, this is a straight-forward insert, so we'll skip it and move on to the more interesting row_ins_sec_index_entry, which, among other, things eventually calls the following two functions (in this order):

row_ins_check_foreign_constraint for each foreign key index
row_ins_scan_sec_index_for_duplicate to look for duplicates in each index

The child table has 2 secondary indices: parentid_reference_uk, a unique index, and parentid_fk, a foreign key index.

Foreign Keys

Starting with row_ins_check_foreign_constraint, an index scan is performed to find a matching record in the foreign index. If the record is found, the matched record in the foreign index is locked with a read lock (shared) through row_ins_set_shared_rec_lock.

cmp = cmp_dtuple_rec(entry, rec, offsets);

if (cmp == 0) {
    ...
    err = row_ins_set_shared_rec_lock(LOCK_REC_NOT_GAP, block, rec, check_index, offsets, thr);

In this case, the clustered index of parent(id=1) is locked with a read lock to ensure that the foreign key constraint is upheld until the transaction is either committed or rolled-back.

Unique Keys

With the parent row locked, insertion may continue. InnoDB next finds the appropriate insertion point in the index

btr_cur_search_to_nth_level(index, 0, entry, PAGE_CUR_LE, search_mode, &cursor, 0, __FILE__, __LINE__, &mtr);

Upon return, cursor points to the necessary insertion point for the new record, but before insertion happens, InnoDB needs to ensure that the record does not violate any UNIQUE constraints.

n_unique = dict_index_get_n_unique(index);

if (dict_index_is_unique(index) && (cursor.low_match >= n_unique || cursor.up_match >= n_unique)) {
    ...

    err = row_ins_scan_sec_index_for_duplicate(flags, index, entry, thr, check, &mtr, offsets_heap);

First, n_unique is set to the number of fields that are unique to the index. In the parentid_reference_uk secondary index, n_unique is 2, as there are 2 fields. From the call to btr_cur_search_to_nth_level, both low_match and up_match are set. These fields report the number of fields in the record before and after the insertion point that match the new entry. So, in the case of the first insert to child, low_match=0 and up_match=0, whereas in the second insert to child, low_match=2 and up_match=0.

As such, during the second transaction's INSERT into child, a call is made to row_ins_scan_sec_index_for_duplicate

err = row_ins_set_shared_rec_lock(lock_type, block, rec, index, offsets, thr);
...

cmp = cmp_dtuple_rec(entry, rec, offsets);

if (cmp == 0 && !index->allow_duplicates) {
    if (row_ins_dupl_error_with_rec(rec, entry, index, offsets)) {
        err = DB_DUPLICATE_KEY;

A shared lock is placed on each potential match prior to performing a full comparison. This ensures the consistency guarantees within the transaction. For this article, the important part of this method is the attempted call to row_ins_set_shared_rec_lock, which, as mentioned, places a read lock on each potential match.

Assuming all things succeed without failure, the record is inserted in row_ins_sec_index_entry_by_modify and another customer is satisfied.

Putting It All Together

To summarise, when InnoDB performs an INSERT, it first inserts into the clustered index, then into each secondary index. For each index, the following two steps happen in order

If this is a foreign key, lock the parent record
If this is a unique key, lock the potential matching records

Then, let's briefly review the the OCC pattern again

Record the row's version
Perform database changes
Validate that the changes in #2 reflect the state recorded in #1
Commit or roll-back

Putting these together, the deadlock becomes apparent

Transaction 1 inserts the data into the clustered index of child
Transaction 1 locks parent(id=1) with a read lock due to the foreign key
Transaction 1 inserts the data into parentid_reference_uk and locks the record with a write lock
Transaction 2 inserts the data into the clustered index of child
Transaction 2 locks parent(id=1) with a read lock due to the foreign key
Transaction 2 attempts to put a read lock on parentid_reference_uk(parent_id=1,reference=1)
Transaction 1 attempts to promote its read lock to a write lock on parent(id=1) to perform the UPDATE

At this point, there's a deadlock: neither transaction 1 nor 2 can make meaningful progress until they each release their locks for the other. Something has to give, so InnoDB kills one of the transactions and lets the other proceed. While I will not say that this is incorrect behaviour, I will say this is extremely surprising: I do not expect uncommitted changes to deadlock another transaction. To be fair, however, InnoDB does describe their UPDATE reads as "semi-consistent", so this point is, at least, documented.

Indeed, a deadlock would arise without the unique index. However, in such a case, the deadlock would happen at a time we can reason about: when both transactions attempt to UPDATE parent. And, in such a case, MySQL actually reports the deadlock in a useful way:

UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 32 page no 3 n bits 72 index PRIMARY of table `test`.`parent` trx id 2511 lock_mode X locks rec but not gap waiting

*** (2) TRANSACTION:
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 32 page no 3 n bits 72 index PRIMARY of table `test`.`parent` trx id 2510 lock mode S locks rec but not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 32 page no 3 n bits 72 index PRIMARY of table `test`.`parent` trx id 2510 lock_mode X locks rec but not gap waiting

*** WE ROLL BACK TRANSACTION (2)

The Workaround

Unless somebody can recommend another fix that doesn't compromise data integrity, I believe the only work-around is to be less optimistic. By changing the order of operations, the deadlock can be avoided at the cost of a write lock for the duration of the transaction:

BEGIN;
UPDATE parent SET version = version + 1 WHERE id = 1 AND version = 0;
INSERT INTO child(id, parent_id, reference) VALUES(1, 1, 1);
COMMIT;

By updating the parent first, all subsequent transactions will block waiting to acquire a write lock. This will prevent them from modifying the child collection until after the current transaction is finished. I would argue rather strongly, however, at this point, we no longer have optimistic locking since a write lock is being held on the entire graph for the entire operation.

Further Thoughts

MySQL, and, in particular, InnoDB is an interesting piece of software that powers a significant portion of the Internet. If you can remember its numerous nuances, it will probably take you wherever you want to go. This all said, I will never describe MySQL as my favourite RDBMS; but, rather, an old acquaintance that comes around once or twice a year that you just have to humour and feed.

In closing, while the MySQL Internals are fairly well documented, the knowledge I gained from in-person training from an exceptional DBA and from Percona was rather useful in understanding this issue. I would strongly encourage all readers to participate in similar training no matter who it is from.