You are here

Jörg Brühe

Subscribe to Jörg Brühe feed
FromDual RSS feed about MySQL, Galera Cluster, MariaDB and Percona Server
Updated: 8 min 36 sec ago

How the Lack of a Primary Key May Effectively Stop the Slave

Tue, 2017-05-02 09:50

Most (relational) DBAs and DB application developers know the concept of a primary key ("PK") and what it is good for. However, much too often one still encounters table definitions without a PK. True, the relational theory based on sets does not need a PK, and all operations (insert, select, update, delete) can also be done on tables for which no PK was defined. If performance doesn't matter (or the data volume is small, a typical situation in tests), the lack of a PK does not immediately cause negative consequences.

But recently, we had several customers who (independent of each other) had big tables without a PK in a replication setup which they wanted to delete, and they all suffered severely from that: Their replication did not progress, the slave lag grew larger and larger, and all this without reporting any error. When I say "larger and larger", I mean it: I'm not talking about minutes or even hours, I'm talking about days!


In all cases, the situation could be summarized as:

  • They were running a traditional, asynchronous replication using the "row" format.
  • They had a table without PK with many entries, say a million rows.
  • This table wasn't needed any more, and someone issued a "delete from T" statement on the master.
  • From this moment, the slave did not show any sign of progress: The output of "show slave status \G"
    - listed both the IO and the SQL slave thread as running,
    - reported the same log positions without any change,
    - did not mention any slave error,
    - but the slave lag grew in sync with the passing time.

Looking at the slave's machine load, say using "vmstat", you could see a certain amount of CPU load in user mode, possibly some "waiting for IO", and some read IO (group "io", column "bi" = "block input"). Checking with "top", you could see that the MySQL server process was the only significant CPU load, and closer inspection would reveal that it used roughly one CPU core.

(A side remark: The numbers of "vmstat" and "top" are not directly comparable: While "vmstat" reports the overall CPU usage as a percentage of the total available CPU power, "top" reports the CPU usage of the individual processes as the percentage of a single CPU core. So if you had a 4-core machine with only one active process running an infinite loop, "top" would show this process as using 100%, while "vmstat" would report the CPU as running 25% in user mode and being 75% idle. If you don't know it, look at "/proc/cpuinfo" to find the number of CPU cores.)

What was the Issue? From "top", we could tell that the MySQL server was using one CPU core at full speed. From "vmstat", we could tell that it might access some data from disk, but didn't write any significant amount. This sure looked like an infinite loop or a close relative of it, executed by only one thread.

Some of you may have heard that (under some conditions) replication may do a full table scan: This is it!

When the slave has to apply a change (for this discussion: a delete or an update) which is provided in row format, and the table has no PK (or other suitable index), the SQL thread does a full table scan searching for the matching row. Note that it doesn't stop on the first match, but rather continues the scan throughout the rest of the table.

To be honest: I don't understand why it doesn't stop after the first match. I also ran a modified test where some master rows had two matches on the slave (I had duplicated them by "insert select"): The "delete" was still replicated as before, and these additional slave rows were still present after the delete.

Until now, you may consider that "full table scan" as a claim without proof, and I agree with all demands for a more thorough check.

The Experiment For tests, I have a setup of two VMs, each running a MySQL server process, and they are configured for a traditional master-slave replication. In fact, I have it for the versions MySQL 5.5.44, 5.6.24, and 5.7.17. (Yes, I might add newer versions.)

So I designed an experiment:

  • Start both master and slave, make sure replication is running.
  • On the master, create a table without any key / index, neither primary nor secondary: CREATE TABLE for_delete ( inc_num int, # ascending numbers 1 .. 10,000 repeated for larger tables clock timestamp, # now() counted: 129 different values with 200 k rows bubble char(250) # repeat('Abcd', 60) ) DEFAULT CHARACTER SET latin1 ;
  • Insert into that table the desired number of rows (I had runs with 10 k, 20 k, 100 k, and 200 k)
  • on both master and slave, run "SELECT ... FROM information_schema.global_status WHERE ..." to get status counters,
  • Delete all rows of that table
  • sleep for a time depending on the row count (you will later see why)
  • get new status counters
  • compute the status counter differences

This experiment reliably reproduced the symptoms which the customers had reported: When the "delete" had been done on the master, the slave became busy for a long time, depending on the row count, and in this time it did not show any progress indicators. This holds for all three versions I checked.

So what did the slave do? It worked as assumed: For each row, it did a full table scan, continuing even after it had found a matching row. This can be easily seen in the status counters, which you will find in the result tables below.

In this article, I will show the results of tests with 200 k rows only, as they are the most impressive. As described above, the table was created anew for each run. Then, 10,000 rows were inserted, and this was repeated 20 times. Each such group had the value of "inc_num" go up from 1 to 10,000.

In all runs, the status counters "HANDLER_DELETE", "INNODB_ROWS_DELETED", and "INNODB_ROWS_READ" showed the value of 200,000, the table size. So the handler calls did not show any scan, it must be below that interface (= within the handler).

If the server does a table scan, it must access all data pages of the table, so there must be read requests for them to guarantee they are in the buffer pool. And yes, the status counters showed these read requests: Table 1: No Key or Index

200,000 rows, status counters on Master Slave

Version 5.5.44

INNODB_BUFFER_POOL_READ_REQUESTS 2,174,121 380,132,972 Ratio "read_requests" Slave/Master
174.8 "read_requests" per "rows_deleted" 10.9 1,900.7

Version 5.6.24

INNODB_BUFFER_POOL_READ_REQUESTS 2,318,685 379,762,583 Ratio "read_requests" Slave/Master
163.8 "read_requests" per "rows_deleted" 11.6 1,898.8

Version 5.7.17

INNODB_BUFFER_POOL_READ_REQUESTS 1,598,745 379,538,695 Ratio "read_requests" Slave/Master
237.4 "read_requests" per "rows_deleted" 8.0 1,897.7

While master and slave (from a logical point of view) do the same work (delete all rows from a 200 k rows table), the slave accesses many more pages than the master - depending on the version, the factor ranges from 164 to 237. (Note that in version 5.7 the slave didn't do worse than in earlier versions, rather the master did better. This shows clearly in the ratio of "innodb buffer pool read requests" to "rows deleted" on the master: This is about 11 in 5.5 and 5.6, but only 8 in 5.7.)

I won't bother you with the detailed results for 100 k rows. It is enough to say that for half as many rows, the master had half as many read requests, while the slave had a quarter. This means that the effort on the master grows proportional to table size, but on the slave it grows with the square of the table size. Accepting the fact that the slave always scans the full table, this is easy to understand: If the table has twice as many rows, there are twice as many pages to scan, and the number of scans also doubles.

Quadratic growth means that an increase of the table size by a factor ten lets the effort grow by a factor hundred! Now imagine I had tested with a million of rows ...

All this isn't new: In the public MySQL bug database, this is listed as bug #53375. The bug was reported on May 3, 2010, for MySQL 5.1. The bug is closed since December 15, 2011. This does not mean the full table scans were stopped - obviously, they are considered unavoidable, so the bug fix is to write an explaining message into the error log. In my tests, I encountered it only once, here it is:

[Note] The slave is applying a ROW event on behalf of a DELETE statement on table for_delete and is currently taking a considerable amount of time (61 seconds). This is due to the fact that it is scanning an index while looking up records to be processed. Consider adding a primary key (or unique key) to the table to improve performance. Avoiding the Issue

Of course, this problem is the consequence of sloppy table design: If there were a primary key, it would not occur. But this may not help those who currently have tables without a PK and dare not add one, because it would require prior testing and a maintenance window to alter the table on the master.

Luckily, there are remedies, and even more than the log message (quoted above) mentions: In my tests, I found that the slave will use any index on that table if one is available. (It even uses an index with low selectivity, which may increase the page request count.) I did 5 tests:

  • Create an index on "clock" (129 different values).
  • Create an index on "inc_num" (10,000 different values).
  • Create an index on both "clock" and "inc_num" (all combinations are distinct).
  • As before, and declare it as "unique".
  • Define the combination of "clock" and "inc_num" to be the primary key.

The "alter table" statement for this was always issued on the slave only, after the inserts, but before the delete.

In all three versions, the slave used any index available. The index on "clock" of course had many different rows per value, so it caused many even more page requests than the no-index case: between 2,800 and 3,165 per row deleted. Obviously, thios makes the problem even worse. But all other indexes were huge improvements, the primary key of course was the best choice. Here are the numbers: Table 2: With Added Key or Index

200,000 rows, status counters on Master Slave Slave Slave Slave Slave

"Alter table ..." on slave before "Delete":
Index on "clock" 129 values Index on "inc_num" 10,000 v. Index
on both
Unique I.
on both
Primary key on both Version 5.5.44

INNODB_BUFFER_POOL_READ_REQUESTS 2,174,121 633,132,032 10,197,554 4,429,243 4,421,357 1,810,734 Ratio "read_requests" Slave/Master
291.2 4.7 2.0 2.0 0.8 "read_requests" per "rows_deleted" 10.9 3,165.7 51.0 22.1 22.1 9.1

Version 5.6.24

INNODB_BUFFER_POOL_READ_REQUESTS 2,318,685 610,482,315 10,327,132 4,375,106 4,073,543 1,794,524 Ratio "read_requests" Slave/Master
342.1 5.0 2.1 2.0 0.9 "read_requests" per "rows_deleted" 11.6 3,052.4 51.6 21.9 20.4 9.0

Version 5.7.17

INNODB_BUFFER_POOL_READ_REQUESTS 1,598,745 559,398,633 9,531,444 3,859,343 3,836,390 1,720,731 Ratio "read_requests" Slave/Master
354.8 6.0 2.3 2.5 1.1 "read_requests" per "rows_deleted" 8.0 2,797.0 47.7 19.3 19.2 8.6

But one very important fact should be mentioned in addition to the numbers: The fact that it worked! By adding an index, I made the slave's schema differ from the master's. The primary key even totally changed the B-tree in which the rows are stored, and it made the slave drop the internal row ID which InnoDB had added on the master. Still, replication using the row format could handle these differences without problems.

The Way Out

So the good message is:

  • Even if you have a table without indexes or primary key, you can add these on the slave without breaking the replication.
  • If you suffer from slow replication on such a table, adding a good index or (even better) the PK will solve this problem.
  • In a replication setup, you can improve the schema on the slave and then do a failover, effectively improving it for all accesses without any maintenance window - just the short failover time.

This might help many DBAs who otherwise don't see a chance to improve a bad schema once it is used in production.

Take care!

Multiple MySQL Instances on a Single Machine

Thu, 2016-07-28 13:11

Typically, on a single machine (be it a physical or a virtual one) only a single MySQL instance (process) is running. This is perfectly ok for all those situations where a single instance is sufficient, like for storing small amounts of data (RedHat using MySQL for postfix, KDE using it for akonadi, ...), as well as those where a dedicated machine per MySQL instance is appropriate (high CPU load, memory fully loaded, availability requirements).

But there are also those users who want to (or would like to) have multiple instances which would still fit into a single machine. Even among them, a single instance per machine is typical. For this, there are good reasons:

  • MySQL comes with defaults for files (config file, error log, ...) and directories (data directory, binlogs, ...) which would cause conflicts between multiple instances (unless they are changed).
  • The scripts coming with MySQL, especially the automated start/stop with machine reboot/shutdown, are written for a single instance only.
  • Last but not least: The instructions. those in the manual as well as the many "How to setup ..." in the Web, cover a single instance only.
Because of this, users often either restrict themselves to a single instance, or they set up several virtual machines (or containers) holding a single instance each.

But that overhead (both in software and in labour) isn't necessary: There is a way out, supporting easy handling of multiple MySQL instances on a single machine directly, without containers or VMs. This is our "MyEnv" package, available for download here, licensed under the GPL.

What Does MyEnv Do?

MyEnv cares about two aspects which in combination provide easy use of multiple instances:

  • It helps to configure multiple MySQL instances without overlap, so they won't collide with each other.
  • It maintains separate environments, each to manage and access one specific instance.

Each environment contains the path to the binaries (so the instances can use different versions), the config file, the socket and port number, data directory, error log etc. The environment is specified by a name (choose a meaningful one!), and it is switched by using its name as a shell command. (MyEnv creates an alias for that.)

Administrative commands like "start" and "stop" will manage the instance of the current environment. MySQL client programs like "mysql" or "mysqldump" will access that instance.

MyEnv supports the autostart of instances at machine boot, configurable per instance - something which is impossible using only the tools of a MySQL distribution.

Of course, an instance started via MyEnv (either manually or via autostart) can be accessed by any other client program on the machine, or from any other machine in the network - all that is needed is the specification of the proper socket or network port.

Handling Multiple Binaries

In the previous section, I wrote the instances can use different versions. This is done by installing those different versions into different locations, controlled by MyEnv, and the directory with the binaries will become a component of the user's PATH variable, switched when the environment is switched. Obviously, this works only if the destination path of the installation can be controlled, which implies the tar.gz format - RPM or DEB packages have fixed destinations, so different versions would overwrite each other on installation.
But that is no severe limitation, as all MySQL versions are available in tar.gz format, and these are sufficiently generic to run on any reasonably current Linux distribution.
(Yes, that is something I forgot to mention: MyEnv is developed and tested on Linux only. You are welcome to try it on any other Unix platform, and we will gladly listen to your experiences and accept your contributions, but we do not actively pursue non-Linux platforms.)

This support for multiple versions makes MyEnv the perfect tool for application development: Using a single machine, you can let your application access the MySQL servers of different versions and can verify it works the way you want it to.

Similar, you can install binaries of MySQL (Oracle), Percona Server, or MariaDB, and verify your application is portable across them.

And the adventurous among us can use different binaries, from the same or different vendor(s), to test whether replication works across versions and/or vendors, all without the effort of installing a separate VM or container setup.

MyEnv and Galera Cluster Till now, I mentioned MySQL (and its variants), and many readers may associate that term with a traditional single instance. So I better state explicitly: Of course, such an instance can take part in replication, in any role: master, slave, or intermediate in multi-level replication.

But besides single instances and replication, there exists a different MySQL setup: Nodes combined to form a Galera Cluster. And again, let me state explicitly: Again of course, an instance controlled by MyEnv can be a node participating in a Galera Cluster.

Those readers who have experiance with Galera Cluster (or who have just read the documentation or blogs about it) know that to start the first node of a cluster a special command is needed, called "bootstrap" - a simple "start" will not do. So this command was also added to MyEnv, it can manage a Galera Cluster completely by its builtin commands.

RPM and DEB packages Above, I wrote that to install different versions you cannot use RPM or DEB packages. I did not write that MyEnv cannot use RPM or DEB - in fact it can, the absolute path names in these formats just limit this to a single version.

So you can install the RPM or DEB of your choice, disable its autostart, and then call MyEnv to create multiple instances. You will give them different names, specify different sockets and ports and use different data directories, but for all of them you will specify the same path "/usr". As a result, MyEnv will simply manage multiple instances of the same version.

You can configure them differently to test the consequences, or you can set them up to replicate among them - master and slave can run on the same machine. Of course, this will not give you the "high availability" or the "scale-out" benefits which are the typical reasons to use replication, but I trust this wasn't your purpose for this test.

Using binaries that include Galera, and configuring them properly, you can even run all nodes of a Galera Cluster as separate instances on a single machine. That may be considered to stretch the concept, because a single machine is a very different setup than separate machines, but it gives an idea of the possibilities opened by MyEnv.

Typical Use of MyEnv

Admitted: The claim to know what MyEnv is used for by others would be arrogant, and I do not uphold it. Nonetheless, we do know some use cases of people who downloaded MyEnv, and they are close to our internal use of the tool.

MyEnv allows to have multiple MySQL instances on the same machine, to manage them separately, and to access them using MySQL client programs or other applications. So it is the perfect setup for all those who need to access different versions: developers and software testers.

When we encounter some unexpected behaviour, we often want to know whether it is specific to some version or series, or is widespread. To check that, MyEnv is the perfect infrastructure: You write a test case to provoke the effect and run it on several versions, then you note the result and can tell whether it exists "since ages" or is new, whether it still occurs in current versions or will change with an upgrade - exactly the information you need to decide about an upgrade or write a bug report.

Database administrators and application developers use it to avoid nasty surprises with new versions, so their production instances will not suffer from unexpected functional changes. Setting up a test environment, especially for multiple versions, becomes cheap, much less ressources are needed. You don't need to copy your test code onto different machines, and you are sure you are running identical tests, so that you won't compare apples and oranges.


If all that made you curious, I invite you to look into the instructions, to download MyEnv and to try it. And of course, your feedback and reports are very welcome.

Take care!

Appendix: Where to Meet Us

All FromDual colleagues will deliver talks at the FrOSCon in St. Augustin near Cologne, Germany, on August 20 and 21, so that is a good opportunity for personal contact. As several talks will be delivered in English, the conference also meets the needs of attendants who cannot follow a German talk - check the programme. Froscon is a famous event, very interesting talks are promised, and I look forward to enjoy the community atmosphere there.

I will deliver a talk at the "Open Source Backup Conference" in Cologne, Germany, on September 26 and 27; this conference is held in English.

I do not have feedback yet about Percona Live in Amsterdam, I may attend that also.

And finally, FromDual will again have a booth and deliver talks at the DOAG conference on November 15 - 18 in Nuremberg, Germany. This is "the" event for Oracle users (at least in Germany, maybe in all Europe), and it has a separate track dealing with MySQL only.

We will be delighted to meet you face to face!

Past and Future Conferences, and Talks Around MySQL

Mon, 2016-04-11 15:25

Time flies, and my blogging frequency is quite low. More frequent would be better, but knowing myself I'll rather not promise anything ;-)

Still, it is appropriate to write some notes about CeBIT, the "Chemnitzer Linuxtage 2016", and future events.


CeBIT was running from March 14 to 18 (Monday till Friday) in Hannover, Germany, and I will leave the general assessment to the various marketing departments as well as to the regular visitors (to which I do not belong).

In order to meet our current customers as well as potential future ones, FromDual had a booth in the "Open Source Forum". We displayed a Galera Cluster, running on three tiny headless single-board Linux machines, and showed how it reacts to node failures and then recovers all by itself, without any administrator intervention. Many of our visitors were fascinated, because a HA solution would be a good fit in their solution architecture. We had got several stuffed dolphins "Sakila", the traditional MySQL symbol, and all of them found new homes (typically with the words "for my grandchild"). :-)

IMHO, the "Open Source Forum" had deserved a better visitor attraction than it really got - placing it into one hall with document management systems was no good fit, research and development might have been more appropriate.
The forum had an area for talks which were running all five days, I consider John "Maddog" Hall (who had provided an Alpha machine to Linus Torvalds decades ago) and Prof. Klaus Knopper (who is maintaining the "Knoppix" live distribution) the most prominent speakers. FromDual's Oli Sennhauser talked about the new features of MySQL 5.7, you can get the slides via the FromDual download page.

Chemnitzer Linux-Tage

The weekend following CeBIT, March 19 and 20, had been selected for the Chemnitzer Linux-Tage. Like in the previous years, the conference attracted many visitors from all over Germany as well as from some neighbouring countries, and both John Hall and Klaus Knopper had come there directly from Hannover - like me and several others.

As usual, the conference programme covered all aspects of Linux, the headline was "It is your project!". Databases are definitely not one of the major topics there, it is more about overall trends, distributions, communication, and many other aspects.

I delivered a talk about "RPM conventions - a Modern Tower of Bable", and it was well received. I am using various MySQL RPMs (from MySQL AB, Oracle, or RedHat) as examples to show different opinions about packaging, dependencies, installation actions, and compatibility issues, which partly originate from the diverging positions of software developer vs distributor. MySQL was used as a well-known example (but will interest my readers here), most of the items are also applicable to almost any software. Again, the slides (in German) are available on the FromDual web site. Your comments are welcome!

Open Source Data Center Conference

We all know that Open Source has become a major force in computing, so it is no surprise to have it as the subject for various conferences.

One of them is the "Open Source Data Center Conference" "OSDC", to be held in Berlin on April 26 - 28. Open Source database systems are one of the topics, and the programme committee accepted my talk "MySQL-Server in Teamwork - Replication and Galera Cluster". After the conference, I will upload it on the FromDual site and make it available for download.

Now, having told you all this, i will turn to customer issues again ...


On Files, the Space They Need, and the Space They Take

Tue, 2016-02-09 14:55


xfs Users, Take Care!

Recently, we had a customer ask: Why do many files holding my data take up vastly more space than their size is? That question may sound weird to you, but it is for real, and the customer's observation was correct. For a start, let's make sure we are using the same terms.

  • The size of a file is the number of bytes it will deliver if it is read sequentially from start to end.
  • The space it takes up is the sum of all disk pages which are used to hold the file's data, or to locate those data pages ("indirect" blocks in Unix/Linux terminology).

Every Unix/Linux admin knows (or at least should know) that a file may take up less disk space than its size is. This happens when not all bytes of the file were really written, but the write pointer was advanced via "seek()", leaving a gap. Disk pages which are completely contained in such a gap will not be written, and reading these positions will produce bytes containing zero. This is called a "sparse file". You will find some remarks about them in our blog at, or search the net for that term.

The Customer's Message

Now that we have brought those basics into active memory again, let's return to the original question: Can there be files which take up vastly more space than their size is? We will not consider potential administrative overhead (pointers to pages), because to the customer a file of slightly more than 4 GB was reported to take up 8.1 GB disk space - see this quote from his mail (file name changed):

# ls -l some_table#P#p01.ibd
-rw-rw---- 1 mysql mysql 4307550208 Jan 4 01:06 some_table#P#p01.ibd
# du -hs some_table#P#p01.ibd
8,1G some_table#P#p01.ibd

Luckily, the customer's mail mentioned the file system: It was not one of the "ext" family (ext2, ext3, or etx4), but rather they are using xfs. This gave me a hint to search for information, and Google provided several pointers, IMO the most helpful ones where these:

Both these texts describe that the Linux kernel includes a tuning function for xfs file systems, which is to pre-allocate pages at the end of a growing file. Originally, the amount was small (64 kB), but then the size was made a function of the file size - the larger the file, the more pages were pre-allocated. Hence, this is now called "dynamic speculative EOF preallocation". It is based on the assumption that the file will continue to grow, and these pre-allocated pages are adjacent, so the performance of later file use (especially reads) will be improved. To not waste disk space permanently, such pre-allocated pages will be cut from the file when it is closed.

Measuring File Size and Space Taken

To see this behavior in practice, I wrote a little shell script that lets a file grow in increments of 160 kB (= ten InnoDB pages of default size) without closing it. (You can find it attached.) In parallel, I checked the size ("ls -l --block-size=K") and the space allocated ("du -k"). With this script, I could easily observe the effect:

Test './try-xfs-prealloc' is running on TTY 'pts/10'.

Linux trift-6core 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
/dev/mapper/vg1000-XFS_try on /XFS_try type xfs (rw)
Dateisystem 1K-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/mapper/vg1000-XFS_try 52403200 33504 52369696 1% /XFS_try
-rw-rw-r-- 1 joerg joerg 480K Feb 4 16:56 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
-rw-rw-r-- 1 joerg joerg 960K Feb 4 16:57 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
-rw-rw-r-- 1 joerg joerg 1440K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
-rw-rw-r-- 1 joerg joerg 1920K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
-rw-rw-r-- 1 joerg joerg 2240K Feb 4 16:57 /XFS_try/somedir/bigfile
4032 /XFS_try/somedir/bigfile

(( several lines not quoted ))

-rw-rw-r-- 1 joerg joerg 11200K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
-rw-rw-r-- 1 joerg joerg 11680K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
No further writes, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile

Writer process killed, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
12000 /XFS_try/somedir/bigfile

While dynamic preallocation is a good idea for most files, it fails badly on MySQL data files: The MySQL server will not close them, in general, even when they won't grow any further (like a table partitioned by date). So this is what the customer detected: A table partition which had grown somewhat beyond 4 GB had got pages for another 4 GB pre-allocated, they were not released, and this happened for many files. Those of you who have ample disk space may say "who cares?", but there are others who have to care. For them, this feature has risky consequences, so they should try to prevent them.

Avoiding The Unlimited Growth

Basically, the only way out is to use the "allocsize" mount option, as described in the FAQ. InnoDB reads 64 pages of 16 kB at most, so "allocsize=1M" might be best.

Like the customer, many DBAs or SysAdmins may not be aware of that behaviour and might detect it only on the running system. Of course, the first question will be: "Can I fix that without downtime?" Immediately, a "mount -o remount" comes to mind, so I tried that: While my test script was running, I issued
sudo mount -o remount,allocsize=1M /XFS_try

Sadly, I must tell you it had no effect: The size of the pre-allocated space continued to grow, like in the original run. Even worse, this command also did not have any effect on a run I started after issuing it.

This proves that the value of "allocsize" cannot be changed for a mounted XFS file system, rather its value at mount time remains effective until the unmount. Only when I unmounted it and then mounted it anew, giving "allocsize=1M", did I see the fixed size as pre-allocation amount. From the DBA point of view, it means that a shutdown of the MySQL instance cannot be avoided for this change. (Of course, if we talk about a Galera cluster, the system remains available, because the nodes can be handled one at a time.)

Can You Get It Without Shutdown?

Now what if you really need to avoid a shutdown, but also need to get back the pre-allocated space urgently? As written above, this will happen only when the file is closed. So the question is: How can the DBA let the MySQL server close a table data file without interrupting the service? There seems to be a chance: the "flush tables" statement. The manual says:

Closes all open tables, forces all tables in use to be closed, and flushes the query cache. ...

FLUSH TABLES tbl_name [, tbl_name] ...
With a list of one or more comma-separated table names, this statement is like FLUSH TABLES with no names except that the server flushes only the named tables. ...

The text is identical for versions from 5.1 to 5.7.

But then, see the user comment by Simon Mudd on that page: No effect for InnoDB. To check this, I wrote a script that inserts rows into an InnoDB table, then let it run: The effect of preallocation is clearly visible. However, sometimes the space used may suddenly go down to the file size, then go up again. My impression is that XFS will react different to a plain file and an InnoDB table, because a file will grow sequentially at the end only while an InnoDB table also has writes to other blocks during its growth. At the end of the insert run, "ls -l" and "du" might show a big preallocation, but not in all runs. To really be sure, I repeated the experiment with the daemon "mysqld" running under "strace" control: No, I did not get a "close()" logged from the "flush table" command.

I had the opportunity to discuss it with a MySQL developer: Yes, that is correct, and it is intentional. InnoDB relies heavily on background threads, and they do not want to add the complexity of syncing these tasks with a "flush table" command. So there is no command that would guarantee the release of preallocated space.

Conclusion While xfs is a good file system for databases, the "dynamic speculative EOF preallocation" is a feature to be aware of, and you may want to limit its amount so that you don't have too much wasted space on your disk(s). Use the "allocsize=" mount option, and remember that it needs to be set before the mount.

Take care!

AttachmentSize Shell script to show XFS preallocation1.92 KB

How to Get a Galera Cluster Into Split Brain

Fri, 2015-10-23 17:02

"Split Brain" is the term commonly used for a cluster whose nodes have different contents, rather than identical as they should have. Typically, a "split brain" situation is the DBA's nightmare, and the Galera software is designed to avoid it. Galera is very successful in that avoidance, and it needs some special steps by the DBA to achieve "split brain". Here is how to do it - or, for most DBAs, what to avoid doing to not get a split-brain cluster.

Galera's Design

First, let's remember how Galera is operating:

  • The Galera software ensures that all nodes participating in a cluster will start from identical contents, by doing a "snapshot state transfer" (SST) of all current data to a newly joining node.
  • When the cluster is running, Galera transfers all changes (transactions) to all cluster nodes and applies them (or rolls back and ignores, in the case of a conflict).
  • If some connections get lost, all nodes check whether they "have quorum" (belong to a majority), and stop serving requests if they don't.
  • When a disconnected node re-joins the cluster, it gets all meantime changes transferred ("incremental state transfer" IST) and so makes its contents current.
  • Should that be impossible, because some of those changes have become unavailable (log purge), a full transfer (SST) is done.
By this design, the Galera software successfully avoids getting into a "split brain" situation.

Of course, the quorum is a well-known concept. The old term for it is "majority consensus", and the approach is built on a simple principle:
In any set (of cluster nodes), there cannot be two (or more) non-overlapping subsets which both contain a majority of the elements.
So if some loss of connectivity splits the cluster into subsets, at most one of them can "have quorum", all others will stop serving requests, and there cannot be two (or more) different directions in which the contents (data) changes.

What Galera introduced (compared to previous designs of distributed DBMSs) is the efficient transfer of changes and conflict detection / resolution ("certification" in Galera terms) at "commit" time that makes the system fast, while previous designs used "distributed locking" or other principles which added latency to many commands and so made their systems slow.

The Story

Let's get back to the "split brain" issue, of which I said that Galera avoids it, and also said it can be reached. Sounds contradictory? Well, there are more active components than just Galera. Here is a real-world case, as happened to (I won't say "achieved by") a customer:

Originally, they had set up a Galera cluster of three nodes; let's call them A, B, and C. This is started by bringing up node A as a stand-alone node, running MySQL with the "wsrep" plugin. Then, one after the other, nodes B and C are configured to join node A in forming a cluster, and started. As a result, there are three nodes communicating with each other that form the cluster. In addition, HAproxy is running somewhere, it will direct the clients to an active cluster node.

So far, so good: The cluster is running, clients connect and issue transactions, everything is ok, and the DBA/s turn/s to other tasks ...

Some time later, node A must be stopped to do some hardware maintenance. No problem, nodes B and C are running fine, they have quorum (2 of 3 is a majority), so the system is still available and operations continue. HAproxy detects that node A does not respond, so it directs all clients to B or C. The cluster architecture is serving its purpose of continuous availability even during a maintenance period.

Maintenance is done, node A is rebooted, its MySQL+Galera server process restarts. It comes up, HAproxy detects it as running and directs clients to it. All seems fine ...

Three hours later, someone has become suspicious, detects trouble, and node A is stopped. Why? What has happened?



What has happened?

Remember what I wrote about the cluster setup: It was

... started by bringing up node A as a stand-alone node, ... ... nodes B and C are configured to join node A ... ... three nodes communicating with each other that form the cluster.

These steps were sufficient to get the nodes up and running. What was missing, however, was to re-configure A from "stand-alone" to "member of cluster with B and C".

As a consequence, when A was restarted after the maintenance, it again came up stand-alone. It did not try to join the cluster (which would have triggered an IST of the meantime changes) or check for quorum (a stand-alone node is self-sufficient).

Based on A's configuration (as read from disk), all was fine.
Based on the concept of the A+B+C cluster, it was a plain, simple split-brain:
Some changes were done on B+C (which still had quorum), while others were done on A only (which was mis-configured).

Lesson to learn (or rather, to bring back into active brain memory):
If some configuration is changed at run-time, this change must also be done in the configuration files so that it is used on restart.

The typical example for such changes is a "set global" command modifying some dynamic variable, like "max_connections".
But in a Galera cluster, a node joining the others is also a dynamic configuration change, and it should ASAP be reflected in all configuration files. If this isn't done, the consequences might be as described above.


Now, most stories have a happy ending, and this one shouldn't be an exception:

Luckily, the application uses self-generated keys, similar to UUIDs, so the entries created on A did not conflict with those of B+C. Also, there were no changes of existing data, just inserts. So the situation could be corrected by extracting all new data from A, inserting them on B+C, and then resetting A's state so that it asked B+C for an SST. Uff!

Operational Advice

There are some tools available that will do such a transfer. However, it can be done completely with standard parts coming with MySQL:

  • "mysqldump" will extract the data from A.
  • Suitable options will make sure this exctract does not contain "drop" or "create" commands, and generates "insert ignore".
    Check the documentation for the options "--no-create-db", "--no-create-info", "--skip-add-drop-table", and "--insert-ignore".
  • If the old, common data are deleted first, both dumping and loading becomes faster, and duplicates are reduced / avoided.
  • "mysql" can be used to load these data into B or C.
Note, however, that conflicts will be ignored and not reported. Other tools or approaches might do that.

Had they used auto-increment keys, or had they modified existing data, it would have been much more complicated, and it might even have been impossible to combine all changes without losing some. I leave it to your imagination to think of such scenarios.

To repeat the lesson in DBMS / DBA terms:

  • The most important property of a database is consistency, it must be kept up at all times.
  • For database operations, the configuration on disk (in files) must be consistent to that in RAM (of the running processes), so any runtime changes must be reflected in the configuration files on disk to maintain consistency.

Percona's "pt-config-diff" can be used to compare a node's current variables to its configuration file.

Take care!

The Upcoming Leap Second

Mon, 2015-06-29 17:56

The press, be it the general daily newspaper or the computer magazines, is currently informing the public about an upcoming leap second, which will be taken in the night from June 30 to July 1 at 00:00:00 UTC. While we Europeans will enjoy our well-deserved sleep then, this will be at 5 PM (17:00) local time on June 30 for Califormia people, and during the morning of July 1 for people in China, Japan, Korea, or Australia. (Other countries not mentioned for the sake of brevity.) This is different from last time, when the leap second was taken in the night from Saturday to Sunday (2012-July-1 00:00:00 UTC), so it was a weekend everywhere on the globe.

We have got several requests from our customers about this upcoming leap second, whether they need to take any special precautions or whether they "are safe". Well, obviously nobody is "safe" from the leap second in the sense that it would circumvent them, everybody will encounter it on their systems. The concern is whether they have to expect any trouble.

For the people operating MySQL (or any other DBMS), this issue is threefold:

  • How will the operating system behave?
  • How will the database system behave?
  • How will the applications behave?
  • Let us look at the operating system first, dealing with Linux only. (All other operating systems don't show up with significant numbers in our customer base.)
    Linux measures the time in seconds (since Jan 1, 1970, 00:00:00 UTC), and it does not include the leap seconds in this counting. When the leap second is taken, the timestamp value (those seconds) will simply not be advanced at the end of the regular second, but it will re-use its current value in the leap second. As a result, the conversion of timestamps into common time reckoning will still produce values from 0 to 59 for the minute as well as for the second, there will not be a 60. Another consequence is that there is no way to tell the leap second from the preceding regular one.

    MySQL always takes the time information from the Linux kernel, so it will also use the same timestamp value (or "now()" result) for both the regular and the leap second. The MySQL manual describes this on its own leap second page, which has become inaccurate by some recent code changes: We could not reproduce the results given there (and will probably file a documentation bug about this). However, that difference does not affect the principle of the page's first paragraph.

    About the application, it is hard to claim anything - there are too many of them. However, it is obvious that an application might run into trouble if it managed to store a new timestamp value every second and assumed they are distinct: they will not be (unless the application manages to skip the leap second). Let's hope any application programmer creating such a high-resolution application was aware of the problem.

    So does it all look fine? Not completely: Following the last leap second (just three years ago), several administrators noticed that their machines became extraordinarily busy, which even let the power consumption rise significantly. This was caused by a bug in the Linux kernel, it let user processes resume immediately if they were trying to wait on a high-resolution timer. The exact details are beyond the scope of this article, your favourite search engine will provide you with more than enough references if you ask it.
    For our purposes, the important fact is that this also happened to MySQL server processes ("mysqld") because InnoDB was doing such operations.

    Back then, Sheeri K. Cabral (well-known in the MySQL blogosphere) published a cure which her team had discovered:
    This loop can be broken by simply setting the Linux kernel's clock to the current time.
    date -s "`date`"
    While this looks like a no-op at first sight, I assume that it will reset the sub-second time information and so will have a small effect, which obviously is sufficient to terminate that loop. (In practice, you should stop your NTP daemon before changing the date, and restart it afterwards. The exact command will depend on your Linux distribution.)

    Well, this was three years ago. The issue was reported to kernel developers, a fix was developed (which gives credit to Sheeri's report) and applied to newer kernels, and it w as also backported to older kernels. From my search, it seems that all reasonably maintained installations should have received that fix. For example, I have seen notices that Ubuntu 12.04 (since kernel 3.2.0-29.46) and 14.04 do have it, as does RedHat 6.4 (since kernel 2.6.32-298); for other distributions, I did not search.

    In addition to a current kernel, you obviously need current packages including the leap second information. Typically, this would be "ntp", "ntpdate", "tzdata", and related ones. Don't forget to restart your NTP daemon after installing these updates!

    So I wish all our readers that the leap second may not get them into trouble, or that (if worst comes to worst) the cure published by Sheeri may get them out. However, I have to remind you of the old truth which is attributed to Mark Twain (sometimes also to Niels Bohr):
    "It is difficult to make predictions, especially about the future."

    Linuxtag: Knowledge and People - and New Colleagues?

    Wed, 2015-02-04 10:49

    At FromDual, we are currently preparing for our participation in the "Chemnitzer Linux-Tage" in March.
    While we don't yet know whether the programme committee accepted our proposed talks, we will have a booth and hope for interesting exchanges with others from the MySQL, database, Linux, ... world. Of course, we will also mention that we are looking for additional colleagues - there are so many tasks that we need more people to handle them all. (In case you got curious, look here: )

    While I attended the Chemnitzer Linux-Tage already last year (thank you, Frank Hofmann, for recommending them!), it will be the first time there for FromDual as a company. I enjoyed the informal atmosphere, which made it easy to get into contact with new people, to learn, and to discuss both facts and opinions.

    To those of you who didn't yet attend such events, I can only say: You are missing some great experiences! It is good to get stimulated by others, by their problems and their experiences, so that one's thoughts are broadened and start considering new fields. Often, they lead to insights that are helpful for one's own daily work.

    In the opposite direction, we at FromDual hope that during the Linux-Tage we can help others to make (better) use of MySQL in all its variants, with its surrounding tools and products, so that everybody will profit.

    The Chemnitzer Linux-Tage will take place on March 21 and 22, 2015, in Chemnitz, Germany. Parts of the programme will probably be delivered in English, so even those of you who don't understand German might enjoy coming there. If you now consider a visit, you will find further information here:

    We will be happy to meet you in Chemnitz - to talk about whatever interests both you and us, maybe even about you joining us?

    Introducing Myself: Jörg Brühe

    Mon, 2015-01-19 20:56
    For some time already, FromDual's "Our Team" page lists me, and it even reveals that I joined in September, 2014. Also for some time, the list of FromDual blogs contains an entry "Jörg's Blog", but it doesn't lead to any entries. It is high time to fix this and create entries, starting with an introduction of myself.

    Often, in such introductions people use the phrase of "the new kid on the block". I won't. If I am to use those words, I will arrange them as "the kid on the new block". The reason is that I don't feel as a new kid in the MySQL village (or is it a city?), let alone in DBMS country.

    Ever since I left university (Technical University of Berlin, Germany), I have been involved in SQL DBMS development. After my previous product's team had been dissolved in Berlin and maintenance moved to Riga, Latvia, as a cost-cutting measure, I joined MySQL AB in 2004 as a member of the Build Team. Some of you will remember my name from MySQL release announcements, others may have seen it in bug reports, newsgroup entries, or other MySQL communication. Together with the other MySQL colleagues, I got acquired by Sun in 2008 and then by Oracle in 2010.

    In 2012, I decided I wanted a culture change, from a highly (and very strictly) organized corporation to a smaller, more flexible entity. That brought me to an internet site, where I worked as a DBA, concentrating on their (more than 100) MySQL installations. It was very interesting to view MySQL from the DBA side now, I got a big load of new experiences and learned a lot about using, running, and automating MySQL (rather than building and testing it).

    That might have continued, but as usual the world is changing, and my DBA job was no exception: Company strategy was to shift those tasks more to the various application teams and to do without a central point for MySQL. It is a sign how far MySQL knowledge has spread, and how little effort may be sufficient to run MySQL, that this worked. OTOH, this invalidated the assumptions on which my job was based, and I did not really fit the alternatives left in that company.

    In that situation, my contact to FromDual got renewed, and both the company and me felt we should be a good match. Till now, there is no indication we were wrong: I like the work, I feel I'm productive and make customers happy, the tasks are manifold and interesting (so interesting that writing the first blog gets delayed quite long), and I expect this will continue.

    "Ad multos annos!"