You are here

MariaDB's parallel replication to catch up

Due to an application error, our replication stopped for 5 days (over Easter). After the problem was solved, the replication was supposed to catch up, which turned out to be very slow. All the usual tricks (innodb_flush_log_at_trx_commit, sync_binlog, etc.) had already been exhausted. So we tried our hand at parallel replication of the MariaDB server.

Parallel replication is deactivated by default:

SQL> SHOW GLOBAL VARIABLES LIKE '%parallel%';
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| slave_domain_parallel_threads | 0          |
| slave_parallel_max_queued     | 131072     |
| slave_parallel_mode           | optimistic |
| slave_parallel_threads        | 0          |
| slave_parallel_workers        | 0          |
+-------------------------------+------------+

Parallel replication is activated by setting the server variables slave_parallel_threads:

SQL> SET GLOBAL slave_parallel_threads = 8;
ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first

However, this must be done when replication is stopped:

SQL> STOP SLAVE;
SQL> SET GLOBAL slave_parallel_threads = 8;
SQL> START SLAVE;

Replication then caught up a little faster. However, as we were impatient, we tried to make it even faster. With the command:

SQL> SHOW SLAVE STATUS\G
...
Slave_SQL_Running_State: Waiting for room in worker thread event queue
...

we found the following message. You would also see it using the SHOW PROCESSLIST command:

SQL> SHOW PROCESSLIST;
+--------+-------------+- ... -+-----------+------+-----------------------------------------------+- ... -+
| Id     | User        |  ...  | Command   | Time | State                                         |  ...  |
+--------+-------------+- ... -+-----------+------+-----------------------------------------------+- ... -+
...                       ...                                                                        ...
| 212496 | system user |  ...  | Slave_SQL |   16 | Waiting for room in worker thread event queue |  ...  |
+--------+-------------+- ... -+-----------+------+-----------------------------------------------+- ... -+

According to the documentation, it can help in this case to increase the size of the slave_parallel_max_queued variable slightly (attention: Oom!).

SQL> STOP SLAVE;
SQL> SET GLOBAL slave_parallel_max_queued = 1*1024*1024;
SQL> SHOW GLOBAL VARIABLES LIKE '%parallel%';
+-------------------------------+------------+
| Variable_name                 | Value      |
+-------------------------------+------------+
| slave_domain_parallel_threads | 0          |
| slave_parallel_max_queued     | 1048576    |
| slave_parallel_mode           | optimistic |
| slave_parallel_threads        | 8          |
| slave_parallel_workers        | 8          |
+-------------------------------+------------+
SQL> START SLAVE;

We have played around with the values slave_parallel_threads in the range from 4 to 32 (with 8 vCores) and with slave_parallel_max_queued in the range from 128 kbyte to 32 Mbyte.
Caution: Do not exaggerate: 32 threads x 32 Mbyte = 1 Gbyte RAM (Oom)!

To find out which values are the optimum, you would have to test and measure more extensively. In any case, the replication made up the 5-day backlog after about an hour, towards the end a little more than at the beginning, which was hopefully caused by our configuration adjustments.

Depending on what DML statements are currently running, you can see that all threads can be used or that some threads have to wait for other threads:

SQL> SHOW PROCESSLIST;
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
| Id     | User            | Command      | Time   | State                                                         | Info       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
|      2 | event_scheduler | Daemon       | 506179 | Waiting for next activation                                   | NULL       |
| 191154 | root            | Query        |      0 | starting                                                      | show pr... |
| 208669 | replication     | Binlog Dump  |    297 | Master has sent all binlog to slave; waiting for more updates | NULL       |
| 212495 | system user     | Slave_IO     |     20 | Waiting for master to send event                              | NULL       |
| 212497 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212498 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212499 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212500 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212501 | system user     | Slave_worker |      0 | Write_rows_log_event::write_row(-1) on table `history_uint`   | insert ... |
| 212502 | system user     | Slave_worker |      0 | Write_rows_log_event::write_row(-1) on table `history_uint`   | insert ... |
| 212503 | system user     | Slave_worker |      0 | Write_rows_log_event::write_row(-1) on table `history_str`    | insert ... |
| 212504 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212505 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212506 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212507 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212510 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212509 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212508 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212511 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212512 | system user     | Slave_worker |      0 | Waiting for prior transaction to commit                       | NULL       |
| 212496 | system user     | Slave_SQL    |     16 | Waiting for room in worker thread event queue                 | NULL       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+

SQL> SHOW PROCESSLIST;
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
| Id     | User            | Command      | Time   | State                                                         | Info       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+
|      2 | event_scheduler | Daemon       | 506197 | Waiting for next activation                                   | NULL       |
| 191154 | root            | Query        |      0 | starting                                                      | show pr... |
| 208669 | replication     | Binlog Dump  |    315 | Master has sent all binlog to slave; waiting for more updates | NULL       |
| 212495 | system user     | Slave_IO     |     37 | Waiting for master to send event                              | NULL       |
| 212497 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212498 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212499 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212500 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212501 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212502 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212503 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212504 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212505 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212506 | system user     | Slave_worker |      0 | Delete_rows_log_event::ha_delete_row(-1) on table `history`   | delete ... |
| 212507 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212510 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212509 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212508 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212511 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212512 | system user     | Slave_worker |      0 | Delete_rows_log_event::find_row(-1) on table `history`        | delete ... |
| 212496 | system user     | Slave_SQL    |     11 | Waiting for room in worker thread event queue                 | NULL       |
+--------+-----------------+--------------+--------+---------------------------------------------------------------+------------+

Our monitoring also showed us that the CPU load went up, the I/O system got more to do and more rows were modified...

What was also noticeable is that with parallel replication, Foreign Key errors suddenly occurred, a phenomenon that we had not observed before:

FromDual.maas2.prod2 - Warning: InnoDB Foreign Key error detected

Trigger: InnoDB Foreign Key error detected
Trigger status: PROBLEM
Trigger severity: Warning
Trigger URL: https://fromdual.com/innodb-foreign-key-error-detected

Item values: 1

1. InnoDB new Foreign Key error (FromDual.maas2.prod2:FromDual.MySQL.innodb.ForeignKey_new): 1

With the command SHOW ENGINE INNODB STATUS\G you can inspect these accordingly or view them in the monitoring:

------------------------
LATEST FOREIGN KEY ERROR
------------------------
2024-04-02 10:36:39 0x7f36088ff640 Transaction:
TRANSACTION 7199599266, ACTIVE 0 sec inserting
mysql tables in use 1, locked 1
6 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 1
MariaDB thread id 228555, OS thread handle 139870048613952, query id 28453893 Write_rows_log_event::write_row(-1) on table `alerts`
insert into alerts (alertid,actionid,eventid,userid,clock,mediatypeid,sendto,subject,message,status,error,esc_step,alerttype,acknowledgeid,parameters) values (203687,4,471733,3,1712044003,1,'xxx@fromdual.com','Zabbix server - High: Too many processes on Zabbix server','Trigger: Too many processes on Zabbix server
Trigger status: PROBLEM
Trigger severity: High
Trigger URL:

Item values: 309

1. Number of processes (Zabbix server:proc.num[]): 309',3,'',1,0,null,'{}')
Foreign key constraint fails for table `zabbix`.`alerts`:
,
  CONSTRAINT `c_alerts_2` FOREIGN KEY (`eventid`) REFERENCES `events` (`eventid`) ON DELETE CASCADE in parent table, in index alerts_3 tuple:
DATA TUPLE: 2 fields;
 ...

But in parent table `zabbix`.`events`, in index PRIMARY,
the closest match we can find is record:
PHYSICAL RECORD: n_fields 12; compact format; info bits 0
 ...

Literature/Sources