MariaDB Galera Cluster with Corosync/Pacemaker VIP

Wed, 2021-03-17 20:26 — Shinguz

Sometimes customers want to have a very simple Galera Cluster set-up. They do not want to invest into machines and build up the know-how for load balancers in front of the Galera Cluster.

For this type of customers there is a possibility to just run a VIP controlled by Corosync/Pacemaker in front of the Galera Cluster moving an IP address from one node to the other. But this is just an active/passive/passive set-up and reads and writes are only possible to one node at the time.
So you loose the scaling read/write and load-balancing functionality and just have the high availability feature left.

Corosync/Pacemaker

A few words upfront about Corosync/Pacemaker:

Pacemaker is a Cluster Resource Manager (CRM) (similar to InitV or SystemD). It "is the thing that starts and stops services (like your database or mail server) and contains logic for ensuring both that they are running, and that they are only running in one location (to avoid data corruption)." [ 1 ]

Corosync on the other hand is the thing that provides the messaging layer and talks to instances of itself on the other node(s). Corosync provides reliable communication between nodes, manages cluster membership and determines quorum. Think of Corosync as dbus but between nodes.

The following proof of concept is based on Pacemaker 2.0 and Corosync 3.0. Commands for older versions of Corosync/Pacemaker may vary slightly.

# crmadmin --version
Pacemaker 2.0.1

# corosync -v
Corosync Cluster Engine, version '3.0.1'

Prerequisites

DNS resolution must work.
Nodes must be reachable (firewall).
Nodes must allow traffic between them.

The following steps must be performed on all 3 nodes unless specified otherwise:

DNS resolution

Add the hosts to your /etc/host file (or however you do hostname resolution in your set-up):

#
# /etc/hosts
#
192.168.56.103 node1
192.168.56.133 node2
192.168.56.134 node3

Especially pay attention to choose the right IP address if you have different network interfaces: One for inter-cluster-communication (192.168.56.*) and one for application-traffic (192.168.1.*).

Check all the nodes on all the nodes:

# ping node1
# ping node2
# ping node3

Firewall

Check your firewall settings:

# iptables -L
# systemctl status firewalld

A simple Corosync/Pacemaker Cluster needs the following firewall settings [ 3 ]:

TCP port 2224 for pcsd, Web UI and node-to-node communication.
TCP port 3121 if cluster has any Pacemaker Remote nodes.
TCP port 5403 for quorum device with corosync-qnetd.
UDP port 5404 for corosync if it is configured for multicast UDP.
UDP port 5405 for corosync.

Install Corosync/Pacemaker

Install the Corosync/Pacemaker packages:

# apt-get install pacemaker pcs

The user which is used for the Corosync/Pacemaker Cluster is the following:

# grep hacluster /etc/passwd
hacluster:x:106:112::/var/lib/pacemaker:/usr/sbin/nologin

Set the password for the Corosync/Pacemaker Cluster user:

# passwd hacluster
New password:
Retype new password:
passwd: password updated successfully

Configuring the Corosync/Pacemaker Cluster

Start the Pacemaker/Corosync Configuration System Daemon (pcsd):

# systemctl enable pcsd
# systemctl start pcsd
# systemctl status pcsd --no-pager
# journalctl -xe -u pcsd --no-pager

Authenticate the nodes in the Cluster (on one node only):

# pcs host auth node1 node2 node3
Username: hacluster
Password: 
node1: Authorized
node3: Authorized
node2: Authorized

If something fails the following command will do the undo operation:

# pcs pscd clear-auth [node]

Create the Corosync/Pacemaker Cluster

To create the Corosync/Pacemaker Cluster run the following command (on one node only):

# pcs cluster setup galera-cluster --start node1 node2 node3 --force
No addresses specified for host 'node1', using 'node1'
No addresses specified for host 'node2', using 'node2'
No addresses specified for host 'node3', using 'node3'
Warning: node1: Cluster configuration files found, the host seems to be in a cluster already
Warning: node3: Cluster configuration files found, the host seems to be in a cluster already
Warning: node2: Cluster configuration files found, the host seems to be in a cluster already
Destroying cluster on hosts: 'node1', 'node2', 'node3'...
node1: Successfully destroyed cluster
node3: Successfully destroyed cluster
node2: Successfully destroyed cluster
Requesting remove 'pcsd settings' from 'node1', 'node2', 'node3'
node3: successful removal of the file 'pcsd settings'
node1: successful removal of the file 'pcsd settings'
node2: successful removal of the file 'pcsd settings'
Sending 'corosync authkey', 'pacemaker authkey' to 'node1', 'node2', 'node3'
node1: successful distribution of the file 'corosync authkey'
node1: successful distribution of the file 'pacemaker authkey'
node3: successful distribution of the file 'corosync authkey'
node3: successful distribution of the file 'pacemaker authkey'
node2: successful distribution of the file 'corosync authkey'
node2: successful distribution of the file 'pacemaker authkey'
Synchronizing pcsd SSL certificates on nodes 'node1', 'node2', 'node3'...
node2: Success
node3: Success
node1: Success
Sending 'corosync.conf' to 'node1', 'node2', 'node3'
node1: successful distribution of the file 'corosync.conf'
node2: successful distribution of the file 'corosync.conf'
node3: successful distribution of the file 'corosync.conf'
Cluster has been successfully set up.
Starting cluster on hosts: 'node1', 'node2', 'node3'...

This command creates the file: /etc/corosync/corosync.conf.

The command pcs cluster start will trigger Pacemaker and Corosync start in the background:

# systemctl status pacemaker --no-pager
# systemctl status corosync --no-pager

Undo if something fails:

# pcs cluster destroy

Check your Corosync/Pacemaker Cluster:

# pcs status
Cluster name: galera-cluster

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node3 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Mon Mar 15 15:45:21 2021
Last change: Mon Mar 15 15:40:45 2021 by hacluster via crmd on node3

3 nodes configured
0 resources configured

Online: [ node1 node2 node3 ]

No resources


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

To start the pacemaker and corosync services at system restart enable them in SystemD (on all 3 nodes again):

# systemctl enable pacemaker
# systemctl enable corosync

Add Corosync/Pacemaker Resources

A resource is a service which is managed by the Cluster. For example a Web-Server, a database instance or a Virtual IP address.

Add a Virtual IP (VIP) address resource (aka Floating IP, on one node only):

# pcs resource create VirtualIP ocf:heartbeat:IPaddr2 ip=192.168.1.199 cidr_netmask=32 op monitor interval=5s
# pcs status resources
 VirtualIP     (ocf::heartbeat:IPaddr2):       Stopped

# pcs status cluster
Cluster Status:
 Stack: corosync
 Current DC: node3 (version 2.0.1-9e909a5bdd) - partition with quorum
 Last updated: Mon Mar  8 16:54:03 2021
 Last change: Mon Mar  8 16:52:32 2021 by root via cibadmin on node1
 3 nodes configured
 1 resource configured

PCSD Status:
  node2: Online
  node3: Online
  node1: Online

# pcs status nodes
Pacemaker Nodes:
 Online: node1 node2 node3
 Standby:
 Maintenance:
 Offline:
Pacemaker Remote Nodes:
 Online:
 Standby:
 Maintenance:
 Offline:

# pcs resource enable VirtualIP 
# pcs status
Cluster name: galera-cluster

WARNINGS:
No stonith devices and stonith-enabled is not false

Stack: corosync
Current DC: node3 (version 2.0.1-9e909a5bdd) - partition with quorum
Last updated: Mon Mar 15 15:53:07 2021
Last change: Mon Mar 15 15:51:29 2021 by root via cibadmin on node2

3 nodes configured
1 resource configured

Online: [ node1 node2 node3 ]

Full list of resources:

 VirtualIP     (ocf::heartbeat:IPaddr2):       Stopped

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

As we can see the resource VirtualIP is still stopped. To get more information, you can run the following command:

# crm_verify -L -V
(unpack_resources)      error: Resource start-up disabled since no STONITH resources have been defined
(unpack_resources)      error: Either configure some or disable STONITH with the stonith-enabled option
(unpack_resources)      error: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid

Beacause we do NOT have shared data (Galera Cluster is a shared-nothing architecture) we do not need STONITH:

# pcs property set stonith-enabled=false

After stonith-enabled is set to false the VIP will be started:

# pcs resource status
 VirtualIP     (ocf::heartbeat:IPaddr2):       Started node1

# ip -f inet addr show enp0s8
3: enp0s8:  mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    inet 192.168.1.122/24 brd 192.168.1.255 scope global dynamic enp0s8
       valid_lft 84918sec preferred_lft 84918sec
    inet 192.168.1.199/32 brd 192.168.1.255 scope global enp0s8
       valid_lft forever preferred_lft forever

Because quorum and fencing is done also by Galera Cluster we do not want interferrence by Corosync/Pacemaker. Thus we set the no-quorum-policy to ignore:

# pcs property set no-quorum-policy=ignore

Graceful manual switchover

The rudest variant moving a resource away from a node is to take if offline:

# pcs cluster stop node2
node2: Stopping Cluster (pacemaker)...
node2: Stopping Cluster (corosync)...

# pcs cluster start node2
node2: Starting Cluster...

A softer possibility moving a resource away from a node is by putting the node into standby:

# pcs node standby node2

To get it back will move the resource to the node again:

# pcs node unstandby node2

Both methods have in common, that the resource is moved back when the node in online again. This is possibly not what you want. To nicest way moving a resource away is the move command:

# pcs resource status
 VirtualIP     (ocf::heartbeat:IPaddr2):       Started node2
# pcs resource move VirtualIP node3
# pcs resource status
 VirtualIP     (ocf::heartbeat:IPaddr2):       Started node3

Prevent Resources from Moving back after Recovery

To prevent a resource moving around we can define a stickiness for a resource:

# pcs resource defaults
No defaults set
# pcs resource defaults resource-stickiness=100
Warning: Defaults do not apply to resources which override them with their own defined values
# pcs resource defaults
resource-stickiness: 100

With later tests I have seen that a resource stickiness of INFINIY gave some better, but not perfect results.

Graphical Web User Interface

Pacemaker/Corosync also provides a Graphical Web User Interface. It can be reached via all IP addresses/interfaces of each node:

# netstat -tlpn | grep -e python -e PID
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:2224            0.0.0.0:*               LISTEN      16550/python3       
tcp6       0      0 :::2224                 :::*                    LISTEN      16550/python3

It can simply be reached via the following Link: https://127.0.0.1:2224/login
The user and password are the same as you used above setting up the Cluster.

If you plan to NOT use the Web-GUI you can disable it on all nodes in the following files: /etc/default/pcsd (Debian, Ubuntu) or /etc/sysconfig/pcsd (CentOS) followed by a restart of the pcsd process.

Improvements

There is still some space for improvements: If a Galera node becomes not Synced (also including Donor/Desynced?) the VIP address should also move somewhere else. One possibility is to hook this into the wsrep_notify_command variable:

[mysqld]
wsrep_notify_command = pcs_standby_node.sh

The script pcs_standby_node.sh should cover the following scenarios:

Sceario	w/o script	with script
Machine halts suddenly (power off)	OK	OK
Machine reboots/restarts	OK	OK
Split Brain	OK^***	OK^***
Instance restarts	NOK^*	OK
Instance goes non-synced	NOK^*	OK
Instance dies (crash, Oom, kill -9)	NOK^*	NOK^**
Max connections reached	NOK^*	NOK^**

^* Your application will experience errors such as:

ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.1.199' (111)

ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104

ERROR 1047 (08S01) at line 1: WSREP has not yet prepared node for application use

^** For this last case we need some more tooling...

^*** Not tested but should work.

If you let Galera run the scrip now you will get some errors:

sudo /usr/sbin/pcs node unstandby node1
sudo: unable to change to root gid: Operation not permitted
sudo: unable to initialize policy plugin
ret=1

To make the script work we have to add the mysql user to the haclient group and add some ACLs [ 11 ]:

# grep haclient /etc/group
haclient:x:112:

# usermod -a -G haclient mysql
# pcs acl enable
# pcs acl role create standby_r description="Put node to standby" write xpath /cib
# pcs acl user create mysql standby_r
# pcs acl

Now the failover works quiet smooth and I have not seen any errors any more. Just sometimes the connections hang. I tried to reduce the hang with reducing tcp_retries2 to 3 as suggested here [ 10 ] but it did not help. If anybody has a hint please let me know!

General thoughts

A Corosync/Pacemaker Cluster is IMHO too complicated (!= KISS) for a simple VIP failover solution!
Probably keepalived is the simpler solution. See also: [ 4, 5 and 6 ]