Replication channel failover with Galera Cluster for MySQL

Thu, 2014-06-19 07:05 — Shinguz

Sometimes it could be desirable to replicate from a Galera Cluster to a single MySQL slave or to an other Galera Cluster. Reasons for this measure could be:

An unstable network between two Galera Cluster locations.
A separation of a reporting slave and the Galera Cluster so that heavy reports on the slave do not affect the Galera Cluster performance.
Mixing different sources in a slave or a Galera Cluster (fan-in replication).

This article is based on earlier research work (see MySQL Cluster - Cluster circular replication with 2 replication channels) and uses the old MySQL replication style (without MySQL GTID).

Preconditions

Enable the binary logs on 2 nodes of a Galera Cluster (we call them channel masters) with the log_bin variable.
Set log_slave_updates = 1 on ALL Galera nodes.
It is recommended to have small binary logs and relay logs in such a situation to reduce overhead of scanning the files (max_binlog_size = 100M).

Scenarios

Let us assume that for some reason the actual channel master of channel 1 breaks. As a consequence the slave of channel 1 does not receive any replication events any more. But we have to keep the replication stream up and running. So we have to switch the replication channel to channel master 2.

Switching replication channel

First for security reasons we should stop the slave of replication channel 1 first:

mysql> STOP SLAVE;

Then we have to find the actual relay log on the slave:

mysql> pager grep Relay_Log_File
mysql> SHOW SLAVE STATUS\G
mysql> nopager

Relay_Log_File: slave-relay-bin.000019

Next we have to find the last applied transaction on the slave:

mysql> SHOW RELAYLOG EVENTS IN 'slave-relay-bin.000019';

| slave-relay-bin.000019 | 3386717 | Query       |      5201 |    53745015 | BEGIN                          |
| slave-relay-bin.000019 | 3386794 | Table_map   |      5201 |    53745067 | table_id: 72 (test.test)       |
| slave-relay-bin.000019 | 3386846 | Write_rows  |      5201 |    53745142 | table_id: 72 flags: STMT_END_F |
| slave-relay-bin.000019 | 3386921 | Xid         |      5201 |    53745173 | COMMIT /* xid=1457451 */       |
+------------------------+---------+-------------+-----------+-------------+--------------------------------+

This is transaction 1457451 which is the same on all Galera nodes.

On the new channel master of channel 2 we have to find now the matching binary log. This can be done best by matching times between the relay log and the binary log of master of channel 2 (consider different time zones and that server times are synced with ntpd

On slave:

shell> ll *relay-bin*
-rw-rw---- 1 mysql mysql     336 Mai 22 20:32 slave-relay-bin.000018
-rw-rw---- 1 mysql mysql 3387029 Mai 22 20:37 slave-relay-bin.000019

On master of channel 2:

shell> ll *bin-log*
-rw-rw---- 1 mysql mysql  2518737 Mai 22 19:57 bin-log.000072
-rw-rw---- 1 mysql mysql      143 Mai 22 19:57 bin-log.000073
-rw-rw---- 1 mysql mysql      165 Mai 22 20:01 bin-log.000074
-rw-rw---- 1 mysql mysql 62953648 Mai 22 20:40 bin-log.000075

It looks like binary log 75 of master 2 matches to relay log of our slave.

Now we have to find the same transaction on the master of channel 2:

mysql> pager grep -B 6 1457451
mysql> SHOW BINLOG EVENTS IN 'bin-log.000075';
mysql> nopager

| bin-log.000075 | 53744832 | Write_rows  |      5201 |    53744907 | table_id: 72 flags: STMT_END_F        |
| bin-log.000075 | 53744907 | Xid         |      5201 |    53744938 | COMMIT /* xid=1457450 */              |
| bin-log.000075 | 53744938 | Query       |      5201 |    53745015 | BEGIN                                 |
| bin-log.000075 | 53745015 | Table_map   |      5201 |    53745067 | table_id: 72 (test.test)              |
| bin-log.000075 | 53745067 | Write_rows  |      5201 |    53745142 | table_id: 72 flags: STMT_END_F        |
| bin-log.000075 | 53745142 | Xid         |      5201 |    53745173 | COMMIT /* xid=1457451 */              |
+----------------+----------+-------------+-----------+-------------+---------------------------------------+

We successfully found the transaction and want the position of the next transaction 53745173 where we should continue replicating.

As a last step we have to set the slave to the master of replication channel 2:

mysql> CHANGE MASTER TO master_host='master2', master_port=3306, master_log_file='bin-log.000075', master_log_pos=53745173;
mysql> START SLAVE;

After a while the slave has caught up and is ready for the next failover back.

Discussion

We found during our experiments that an IST of a channel master does not lead to a gap or loss of events in the replication stream. So restarting a channel master does not require a channel failover as long as an IST can be used for resyncing the channel master with the Galera Cluster.

The increase of wsrep_cluster_conf_id is NOT an indication that a channel failover is required.

A SST resets the binary logs so after the SST a slave will not replicate any more. So using this method should be safe to use. If you find any situation where you experience troubles with channel failover please let us know.

Taxonomy upgrade extras: