2019-06-06 12:43 UTC

Fun MySQL fact of the day: PITR without GTIDs

When we isolated yesterday's bad transaction using GTIDs, life was pretty simple. But in some cases, GTIDs won't be an option for us. So, today, we'll take a slightly closer look at the binary log and figure out an alternative.

Without a GTID, a transaction can be specified to the replication engine using two pieces of information: a log file (MASTER_LOG_FILE or RELAY_LOG_FILE) and the log file position (MASTER_LOG_POS or RELAY_LOG_POS). The position is, essentially, a byte offset allowing MySQL to seek to the correct position in the binary or relay log file. Today, we're going to focus on RELAY_LOG_FILE and RELAY_LOG_POS for simplicity, but this implies that our replication chain continues to be healthy and that we are not required to move the delayed replica under a different master. Now, starting simple, much like yesterday, we can use mysqlbinlog to find our bad query; except, today, because we're using RELAY_LOG_FILE and RELAY_LOG_POS, we'll want to search for it in the relay logs on our delayed host:

$ mysqlbinlog relaylog.0000024
# at 239
#190606 12:21:37 server id 1  end_log_pos 318 CRC32 0xb5aa47b1  Query   thread_id=28    exec_time=0     error_code=0
BEGIN;
# at 318
#190606 12:21:37 server id 1  end_log_pos 445 CRC32 0xd6f803c1  Query   thread_id=28    exec_time=0     error_code=0
UPDATE user SET full_name = 'Mary Mary' WHERE id = 33;
# at 445
#190606 12:21:40 server id 1  end_log_pos 476 CRC32 0xdc463c6f  Xid = 12826
COMMIT;
# at 476
#190606 12:21:46 server id 1  end_log_pos 555 CRC32 0xfb1bdef3  Query   thread_id=28    exec_time=201   error_code=0
BEGIN;
# at 555
#190606 12:21:46 server id 1  end_log_pos 674 CRC32 0x58358895  Query   thread_id=28    exec_time=201   error_code=0
UPDATE user SET full_name = 'Charlie Charlie';
# at 674
#190606 12:29:40 server id 1  end_log_pos 705 CRC32 0x41800509  Xid = 12915
COMMIT;

I have taken the liberty of cleaning-up this output to make it a bit simpler to see, and I'd like to draw your attention to the lines starting with # at, which indicate the log position (i.e. byte offset). These are the values we may pass into RELAY_LOG_POS or MASTER_LOG_POS. Similarly, mysqlbinlog also prints out the end position of an entry, such as end_log_pos. In our case, we know that the change WHERE id = 33 is legitimate, and we can also see that the transaction completes at log position 476, which conveniently happens to be the start of the bad transaction. With this information, we can STOP SLAVE SQL_THREAD; CHANGE MASTER TO MASTER_DELAY = 0; START SLAVE SQ_THREAD UNTIL RELAY_LOG_FILE='relaylog.0000024', RELAY_LOG_POS=476;.

At this point you may, again, go grab a coffee while the SQL thread catches up from its delay until stopping just before the bad transaction. Now unlike yesterday where we could inject a "fake" transaction, because we aren't using GTIDs, we need another way of skipping the bad transaction. This is where the sql_slave_skip_counter counter comes in: it can be used to skip over transactions (or single changes for non-transactional engines). So, if we SET GLOBAL sql_slave_skip_counter = 1 and then START SLAVE SQL_THREAD, MySQL will skip over the 1 transaction (the bad transaction) and then resume running the remaining relay logs normally until the SQL thread has caught up completely.

Now, I feel it's worth pointing out that, so far, all of our binary log Fun MySQL Facts Of The Day have used STATEMENT-based replication solely for purposes of making our examples simpler to understand. However, if you can remember way back to February 28, you may recall we discussed one of several cases where STATEMENT-based replication can be extremely unsafe. So, tomorrow, we'll start to look at why and then move into next week looking at some safer alternatives.