2019-06-04 12:37 UTC

Fun MySQL fact of the day: replication lag can be good

We've spent number of days discussing replication delay, and we've mostly considered replication delay as bad. And, true, in most cases, it usually is pretty bad, but today we're going to consider a case when replication delay is as desirable as it is important.

On May 24, we started looking at point-in-time recovery. In that example, we had used binary logs manually downloaded from our corrupt master database to "catch up" an out-of-date back-up. While this is a completely viable and plausible use case, wouldn't it be great if we could shorten the cycle by not needing to restore from back-ups next time you have a disaster? Enter delayed replicas: replicas configured to lag behind their master by some fixed period of time. For example, consider a database cluster that had two delayed replicas: one that lagged behind by 12 hours and one that lagged behind by 48 hours. These delayed replicas could be used to perform point-in-time recovery for up to 48 hours without needing to restore your database from back-ups.

In MySQL 5.6, a new feature was added to the replication implementation to help accomplish just this goal: MASTER_DELAY. With a non-zero MASTER_DELAY, a replica is configured to "lag" behind its master for a specified number of seconds. Before MySQL 5.6, the same thing could be accomplished using third-party tools, like pt-slave-delay, but with some amount of additional complexity and indirection. But, no, we'll leave our pre-MySQL 5.6 history as history and focus, instead, on MASTER_DELAY.

Now, let's say we were to CHANGE MASTER TO MASTER_DELAY=172800 (or 48 hours) on a healthy, correctly-configured MySQL replica with (currently) 0 seconds of replication delay. By doing this, we tell the replica to, effectively, "stall" the SQL thread for 48 hours while still continuing to download the binary logs. Then, after the 48th hour, the SQL thread will "wake up" and start processing the relay logs as usual, making sure to not process any changes newer than 48 hours. This results in a replica that is constantly 48-hours behind master, giving us 48 hours to "undo" bad/disastrous changes without resorting to our back-ups.

Tomorrow and Thursday, we'll start considering how to use these delayed replicas during our next disaster. Until then, I suggest you go provision and set-up a new delayed replica or two with at least 12 hours of delay. I'll be waiting.