Fun MySQL fact of the day: PITR with GTIDs

Yesterday, we looked at MySQL's MASTER_DELAY functionality, which we may use to maintain a delayed replica in our important database clusters. A couple weeks ago, we also looked at a very basic point-in-time recovery technique using mysqlbinlog. Today, we're going to combine the two with a universal twist.

For all versions of MySQL 5.6 and later, MySQL has a useful feature called "Global Transaction IDs", or GTIDs. A GTID is a way of uniquely identifying a transaction all the way through a replication tree in the format of server_uuid:transaction_id. At the start of each transaction, MySQL will set the session variable, gtid_next, to the next natural GTID, like a8f0b8ad-6b72-11e9-9f81-0242ac140002:1404763, which would identify transaction number 1,404,763 on the server, a8f0b8ad-6b72-11e9-9f81-0242ac140002. This session variable gets recorded in the binary log and retained in the relay logs so that when a replica's SQL thread executes the transaction, the replica itself will record and retain the same GTID. As an "added bonus" fun fact, a MySQL replica will not execute the same GTID twice. What does this mean to us? Well, it means that we can identify, isolate, and even "fake" one or more "bad" transactions universally.

So, let's consider the same accidental query as a couple weeks ago: UPDATE users SET full_name = 'Charlie Charlie', email = '[email protected]'. We need to "undo" this query quickly, and, thankfully, this time, we have a delayed replica to save us some time and effort. First of all, before we do anything, let's stop the IO thread on our delayed replica and, if possible, do a quick/safe back-up of the MySQL directory (maybe a cp -rl). Next, we'll need to find the GTID of this accidental transaction using mysqlbinlog. Then, once we have the transaction's GTID, we can CHANGE MASTER TO MASTER_DELAY = 0 and then START SLAVE SQL_THREAD UNTIL SQL_BEFORE_GTIDS = 'a8f0b8ad-6b72-11e9-9f81-0242ac140002:1404763'.

At this point, grab a coffee until the SQL thread has stopped, leaving us in a state from immediately before the accidental transaction. From here, we'll want to "fake" the bad transaction with a quick SET gtid_next='a8f0b8ad-6b72-11e9-9f81-0242ac140002:1404763'; BEGIN; COMMIT; and then resume replication with a START SLAVE SQL_THREAD. At this point, our delayed replica will be a non-delayed replica that has skipped-over the bad transaction. From here, patching the data on the master database is a fun exercise I'll leave you to solve, but if you don't feel like NIHing it, check out pt-table-checksum and pt-table-sync.

If you'll believe it, this isn't even a terrible over-simplification: point-in-time recovery using GTIDs and MASTER_DELAY is pretty much this simple. However, tomorrow, we'll dive a bit deeper into the binary log and figure out how to do it the "hard way" without GTIDs.