You are here

News

Why you should take care of MySQL data types

Shinguz - Wed, 2016-05-25 11:42

A customer reported last month that MySQL does a full table scan (FTS) if a query was filtered by a INT value on a VARCHAR column. First I told him that this is not true any more because MySQL has fixed this behaviour long time ago. He showed me that I was wrong:

CREATE TABLE `test` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `data` varchar(64) DEFAULT NULL, `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `data` (`data`) ) ENGINE=InnoDB; EXPLAIN SELECT * FROM test WHERE data = 42\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: test partitions: NULL type: ALL possible_keys: data key: NULL key_len: NULL ref: NULL rows: 522500 filtered: 10.00 Extra: Using where EXPLAIN SELECT * FROM test WHERE data = '42'\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: test partitions: NULL type: ref possible_keys: data key: data key_len: 67 ref: const rows: 1 filtered: 100.00 Extra: NULL

When I executed the query I got some more interesting information:

SELECT * FROM test WHERE data = '42'; Empty set (0.00 sec) SELECT * FROM test WHERE data = 42; +--------+----------------------------------+---------------------+ | id | data | ts | +--------+----------------------------------+---------------------+ | 1096 | 42a5cb4a3e76857a3efe7af44ba9f4dd | 2016-05-25 10:26:59 | ... | 718989 | 42a1921fb2df42126d85f9586532eda4 | 2016-05-25 10:27:12 | +--------+----------------------------------+---------------------+ 767 rows in set, 65535 warnings (0.26 sec)

Looking at the warnings we also find the reason: MySQL does the cast on the column and not on the value which is a bit odd IMHO:

show warnings; | Warning | 1292 | Truncated incorrect DOUBLE value: '80f52706c2f9de40472ec29a7f70c992' |

A bit suspicious I looked at the warnings of the query execution plan again:

show warnings; +---------+------+---------------------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+---------------------------------------------------------------------------------------------+ | Warning | 1739 | Cannot use ref access on index 'data' due to type or collation conversion on field 'data' | | Warning | 1739 | Cannot use range access on index 'data' due to type or collation conversion on field 'data' | +---------+------+---------------------------------------------------------------------------------------------+

I thought this was fixed, but it seems not. The following releases behave like this: MySQL 5.0.96, 5.1.73, 5.5.38, 5.6.25, 5.7.12 and MariaDB 5.5.41, 10.0.21 and 10.1.9

The other way around it seems to work in both cases:

SELECT * FROM test WHERE id = 42; +----+----------------------------------+---------------------+ | id | data | ts | +----+----------------------------------+---------------------+ | 42 | 81d74057d7be8f20563da404bb1b3ab0 | 2016-05-25 10:26:56 | +----+----------------------------------+---------------------+ SELECT * FROM test WHERE id = '42'; +----+----------------------------------+---------------------+ | id | data | ts | +----+----------------------------------+---------------------+ | 42 | 81d74057d7be8f20563da404bb1b3ab0 | 2016-05-25 10:26:56 | +----+----------------------------------+---------------------+ EXPLAIN SELECT * FROM test WHERE id = 42\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: test partitions: NULL type: const possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 1 filtered: 100.00 Extra: NULL
Taxonomy upgrade extras: query tuningexplaindata typesql

How to become a certified DBA

Cédric Bruderer - Tue, 2016-05-10 10:16

I recently managed to get my certification as MySQL 5.6 DBA, and was asked to write a blog about it, because I had trouble getting the informations I needed.

You may have figured out too, that Oracle does not really supply you with information about the certification. At least, there is the MySQL documentation. It contains all the information you need.

Further, I recommend to use a virtual linux machine in combination with our tool MyEnv. This way you can simulate multiple scenarios, including replication set-ups, and if one or two servers die during your exercises, nobody gets mad at you.

When learning, make sure to have a look at the following topics:

  • Query tuning
  • Parameters tuning
  • MySQL client tools (mysqldump, mysqladmin, ...)
  • MySQL Audit Plugin
  • How to secure MySQL (Especially, the correct assignment of privileges.)
  • How to use the Performance and Information Schema
  • Partitions
  • Replication
  • Backup and Recovery (Both, physical and logical variant.)


The certification takes 150 minutes and contains 100 questions. 60% of your answers have to be correct, in order to pass. If you keep a pace of one answer per minute, you will also have enough time to go over those answers you were not entirely sure at the first time.

How to become a certified DBA

Cédric Bruderer - Tue, 2016-05-10 10:16

I recently managed to get my certification as MySQL 5.6 DBA, and was asked to write a blog about it, because I had trouble getting the informations I needed.

You may have figured out too, that Oracle does not really supply you with information about the certification. At least, there is the MySQL documentation. It contains all the information you need.

Further, I recommend to use a virtual linux machine in combination with our tool MyEnv. This way you can simulate multiple scenarios, including replication set-ups, and if one or two servers die during your exercises, nobody gets mad at you.

When learning, make sure to have a look at the following topics:

  • Query tuning
  • Parameters tuning
  • MySQL client tools (mysqldump, mysqladmin, ...)
  • MySQL Audit Plugin
  • How to secure MySQL (Especially, the correct assignment of privileges.)
  • How to use the Performance and Information Schema
  • Partitions
  • Replication
  • Backup and Recovery (Both, physical and logical variant.)


The certification takes 150 minutes and contains 100 questions. 60% of your answers have to be correct, in order to pass. If you keep a pace of one answer per minute, you will also have enough time to go over those answers you were not entirely sure at the first time.

How to become a certified DBA

Cédric Bruderer - Tue, 2016-05-10 10:16

I recently managed to get my certification as MySQL 5.6 DBA, and was asked to write a blog about it, because I had trouble getting the informations I needed.

You may have figured out too, that Oracle does not really supply you with information about the certification. At least, there is the MySQL documentation. It contains all the information you need.

Further, I recommend to use a virtual linux machine in combination with our tool MyEnv. This way you can simulate multiple scenarios, including replication set-ups, and if one or two servers die during your exercises, nobody gets mad at you.

When learning, make sure to have a look at the following topics:

  • Query tuning
  • Parameters tuning
  • MySQL client tools (mysqldump, mysqladmin, ...)
  • MySQL Audit Plugin
  • How to secure MySQL (Especially, the correct assignment of privileges.)
  • How to use the Performance and Information Schema
  • Partitions
  • Replication
  • Backup and Recovery (Both, physical and logical variant.)


The certification takes 150 minutes and contains 100 questions. 60% of your answers have to be correct, in order to pass. If you keep a pace of one answer per minute, you will also have enough time to go over those answers you were not entirely sure at the first time.

MariaDB 10.2 Window Function Examples

Shinguz - Mon, 2016-04-18 22:39

MariaDB 10.2 has introduced some Window Functions for analytical queries.

See also: Window Functions, Window Functions, Window function and Rows and Range, Preceding and Following

Function ROW_NUMBER()

Simulate a row number (sequence) top 3

SELECT ROW_NUMBER() OVER (PARTITION BY NULL ORDER BY category_id) AS num , category.category_id FROM category LIMIT 3 ;

or

SELECT ROW_NUMBER() OVER (ORDER BY category_id) AS num , category.category_id FROM category LIMIT 3 ; +-----+-------------+ | num | category_id | +-----+-------------+ | 1 | ACTUAL | | 2 | ADJUSTMENT | | 3 | BUDGET | +-----+-------------+
ROW_NUMBER() per PARTITION SELECT ROW_NUMBER() OVER (PARTITION BY store_type ORDER BY SUM(sf.store_sales) DESC) AS Nbr , s.store_type AS "Store Type", s.store_city AS City, SUM(sf.store_sales) AS Sales FROM store AS s JOIN sales_fact AS sf ON sf.store_id = s.store_id GROUP BY s.store_type, s.store_city ORDER BY s.store_type, Rank ; +-----+---------------------+---------------+------------+ | Nbr | Store Type | City | Sales | +-----+---------------------+---------------+------------+ | 1 | Deluxe Supermarket | Salem | 1091274.68 | | 2 | Deluxe Supermarket | Tacoma | 993823.44 | | 3 | Deluxe Supermarket | Hidalgo | 557076.84 | | 4 | Deluxe Supermarket | Merida | 548297.64 | | 5 | Deluxe Supermarket | Vancouver | 534180.96 | | 6 | Deluxe Supermarket | San Andres | 518044.80 | | 1 | Gourmet Supermarket | Beverly Hills | 619013.24 | | 2 | Gourmet Supermarket | Camacho | 357772.88 | | 1 | Mid-Size Grocery | Yakima | 304590.92 | | 2 | Mid-Size Grocery | Mexico City | 166503.48 | | 3 | Mid-Size Grocery | Victoria | 144827.48 | | 4 | Mid-Size Grocery | Hidalgo | 144272.84 | +-----+---------------------+---------------+------------+
Function RANK()

Ranking of top 10 salaries

SELECT full_name AS Name, salary AS Salary , RANK() OVER(ORDER BY salary DESC) AS Rank FROM employee ORDER BY salary DESC LIMIT 10 ; +-----------------+----------+------+ | Name | Salary | Rank | +-----------------+----------+------+ | Sheri Nowmer | 80000.00 | 1 | | Darren Stanz | 50000.00 | 2 | | Donna Arnold | 45000.00 | 3 | | Derrick Whelply | 40000.00 | 4 | | Michael Spence | 40000.00 | 4 | | Maya Gutierrez | 35000.00 | 6 | | Pedro Castillo | 35000.00 | 6 | | Laurie Borges | 35000.00 | 6 | | Beverly Baker | 30000.00 | 9 | | Roberta Damstra | 25000.00 | 10 | +-----------------+----------+------+
Function DENSE_RANK() SELECT full_name AS Name, salary AS Salary , DENSE_RANK() OVER(ORDER BY salary DESC) AS Rank FROM employee ORDER BY salary DESC LIMIT 10 ; +-----------------+----------+------+ | Name | Salary | Rank | +-----------------+----------+------+ | Sheri Nowmer | 80000.00 | 1 | | Darren Stanz | 50000.00 | 2 | | Donna Arnold | 45000.00 | 3 | | Derrick Whelply | 40000.00 | 4 | | Michael Spence | 40000.00 | 4 | | Maya Gutierrez | 35000.00 | 5 | | Pedro Castillo | 35000.00 | 5 | | Laurie Borges | 35000.00 | 5 | | Beverly Baker | 30000.00 | 6 | | Roberta Damstra | 25000.00 | 7 | +-----------------+----------+------+
Aggregation Windows SELECT full_name AS Name, salary AS Salary , SUM(salary) OVER(ORDER BY salary DESC) AS "Sum sal" FROM employee ORDER BY salary DESC LIMIT 10 ; +-----------------+----------+-----------+ | Name | Salary | Sum sal | +-----------------+----------+-----------+ | Sheri Nowmer | 80000.00 | 80000.00 | | Darren Stanz | 50000.00 | 130000.00 | | Donna Arnold | 45000.00 | 175000.00 | | Derrick Whelply | 40000.00 | 255000.00 | | Michael Spence | 40000.00 | 255000.00 | | Laurie Borges | 35000.00 | 360000.00 | | Maya Gutierrez | 35000.00 | 360000.00 | | Pedro Castillo | 35000.00 | 360000.00 | | Beverly Baker | 30000.00 | 390000.00 | | Roberta Damstra | 25000.00 | 415000.00 | +-----------------+----------+-----------+
Function CUME_DIST() and PERCENT_RANK() SELECT s.store_state AS State, s.store_city AS City, SUM(e.salary) AS Salary , CUME_DIST() OVER (PARTITION BY State ORDER BY Salary) AS CumeDist , PERCENT_RANK() OVER (PARTITION BY State ORDER BY Salary) AS PctRank FROM employee AS e JOIN store AS s on s.store_id = e.store_id WHERE s.store_country = 'USA' GROUP BY s.store_name ORDER BY s.store_state, Salary DESC ; +-------+---------------+-----------+--------------+--------------+ | State | City | Salary | CumeDist | PctRank | +-------+---------------+-----------+--------------+--------------+ | CA | Alameda | 537000.00 | 1.0000000000 | 1.0000000000 | | CA | Los Angeles | 221200.00 | 0.8000000000 | 0.7500000000 | | CA | San Diego | 220200.00 | 0.6000000000 | 0.5000000000 | | CA | Beverly Hills | 191800.00 | 0.4000000000 | 0.2500000000 | | CA | San Francisco | 30520.00 | 0.2000000000 | 0.0000000000 | | OR | Salem | 260220.00 | 1.0000000000 | 1.0000000000 | | OR | Portland | 221200.00 | 0.5000000000 | 0.0000000000 | | WA | Tacoma | 260220.00 | 1.0000000000 | 1.0000000000 | | WA | Spokane | 223200.00 | 0.8571428571 | 0.8333333333 | | WA | Bremerton | 221200.00 | 0.7142857143 | 0.6666666667 | | WA | Seattle | 220200.00 | 0.5714285714 | 0.5000000000 | | WA | Yakima | 74060.00 | 0.4285714286 | 0.3333333333 | | WA | Bellingham | 23220.00 | 0.2857142857 | 0.1666666667 | | WA | Walla Walla | 21320.00 | 0.1428571429 | 0.0000000000 | +-------+---------------+-----------+--------------+--------------+
Function NTILE() SELECT promotion_name, media_type , TO_DAYS(end_date)-TO_DAYS(start_date) AS Duration , NTILE(4) OVER (PARTITION BY promotion_name ORDER BY DURATION) AS quartile , NTILE(5) OVER (PARTITION BY promotion_name ORDER BY DURATION) AS quintile , NTILE(100) OVER (PARTITION BY promotion_name ORDER BY DURATION) AS precentile FROM promotion WHERE promotion_name = 'Weekend Markdown' LIMIT 10 ; +------------------+-------------------------+----------+----------+----------+------------+ | promotion_name | media_type | Duration | quartile | quintile | precentile | +------------------+-------------------------+----------+----------+----------+------------+ | Weekend Markdown | In-Store Coupon | 2 | 1 | 1 | 9 | | Weekend Markdown | Daily Paper | 3 | 3 | 4 | 29 | | Weekend Markdown | Radio | 3 | 4 | 4 | 36 | | Weekend Markdown | Daily Paper, Radio | 2 | 2 | 2 | 13 | | Weekend Markdown | Daily Paper, Radio, TV | 2 | 2 | 3 | 20 | | Weekend Markdown | TV | 2 | 3 | 3 | 26 | | Weekend Markdown | Sunday Paper | 3 | 3 | 4 | 28 | | Weekend Markdown | Daily Paper, Radio, TV | 3 | 3 | 4 | 34 | | Weekend Markdown | Daily Paper | 2 | 1 | 2 | 10 | | Weekend Markdown | Street Handout | 2 | 2 | 2 | 18 | | Weekend Markdown | Bulk Mail | 3 | 4 | 5 | 37 | | Weekend Markdown | Cash Register Handout | 2 | 2 | 2 | 14 | | Weekend Markdown | Daily Paper, Radio, TV | 3 | 3 | 4 | 31 | | Weekend Markdown | Sunday Paper | 2 | 3 | 3 | 27 | | Weekend Markdown | Sunday Paper, Radio, TV | 1 | 1 | 1 | 4 | +------------------+-------------------------+----------+----------+----------+------------+
Taxonomy upgrade extras: mariadbdwhreportingAnalyticsWindow FunctionOLAPData Mart

Define preferred SST donor for Galera Cluster

Cédric Bruderer - Fri, 2016-04-15 18:00

One of our customers recently ran into a problem, where he wanted to have a preferred donor for SST, whenever a node came up. The problem was, that the node did not come up, when the preferred donor was not running.

In the documentation, you can find the parameter wsrep_sst_donor, which prefers the specified node as SST donor. This is great, as long as the donor is actually running.

The problem can be fixed by adding a comma to the end of the value of wsrep_sst_donor, what would look like this:

wsrep_sst_donor="galera2,"

Note the comma at the end of the value. This trailing comma basically tells this node, that galera2 is the preferred donor, if galera2 is not available, any other available node will be used as donor.

You could also specify a secondary node, which is needed to be available for the node to come up:

wsrep_sst_donor="galera2,galera1"

In this case, galera1 wil be used as secondary donor if galera2 is not available. If both are not available, the node will refuse to come up.

Taxonomy upgrade extras: Galera Cluster

Define preferred SST donor for Galera Cluster

Cédric Bruderer - Fri, 2016-04-15 18:00

One of our customers recently ran into a problem, where he wanted to have a preferred donor for SST, whenever a node came up. The problem was, that the node did not come up, when the preferred donor was not running.

In the documentation, you can find the parameter wsrep_sst_donor, which prefers the specified node as SST donor. This is great, as long as the donor is actually running.

The problem can be fixed by adding a comma to the end of the value of wsrep_sst_donor, what would look like this:

wsrep_sst_donor="galera2,"

Note the comma at the end of the value. This trailing comma basically tells this node, that galera2 is the preferred donor, if galera2 is not available, any other available node will be used as donor.

You could also specify a secondary node, which is needed to be available for the node to come up:

wsrep_sst_donor="galera2,galera1"

In this case, galera1 wil be used as secondary donor if galera2 is not available. If both are not available, the node will refuse to come up.

Taxonomy upgrade extras: Galera Cluster

Define preferred SST donor for Galera Cluster

Cédric Bruderer - Fri, 2016-04-15 18:00

One of our customers recently ran into a problem, where he wanted to have a preferred donor for SST, whenever a node came up. The problem was, that the node did not come up, when the preferred donor was not running.

In the documentation, you can find the parameter wsrep_sst_donor, which prefers the specified node as SST donor. This is great, as long as the donor is actually running.

The problem can be fixed by adding a comma to the end of the value of wsrep_sst_donor, what would look like this:

wsrep_sst_donor="galera2,"

Note the comma at the end of the value. This trailing comma basically tells this node, that galera2 is the preferred donor, if galera2 is not available, any other available node will be used as donor.

You could also specify a secondary node, which is needed to be available for the node to come up:

wsrep_sst_donor="galera2,galera1"

In this case, galera1 wil be used as secondary donor if galera2 is not available. If both are not available, the node will refuse to come up.

Taxonomy upgrade extras: Galera Cluster

Past and Future Conferences, and Talks Around MySQL

Jörg Brühe - Mon, 2016-04-11 15:25

Time flies, and my blogging frequency is quite low. More frequent would be better, but knowing myself I'll rather not promise anything ;-)

Still, it is appropriate to write some notes about CeBIT, the "Chemnitzer Linuxtage 2016", and future events.

CeBIT

CeBIT was running from March 14 to 18 (Monday till Friday) in Hannover, Germany, and I will leave the general assessment to the various marketing departments as well as to the regular visitors (to which I do not belong).

In order to meet our current customers as well as potential future ones, FromDual had a booth in the "Open Source Forum". We displayed a Galera Cluster, running on three tiny headless single-board Linux machines, and showed how it reacts to node failures and then recovers all by itself, without any administrator intervention. Many of our visitors were fascinated, because a HA solution would be a good fit in their solution architecture. We had got several stuffed dolphins "Sakila", the traditional MySQL symbol, and all of them found new homes (typically with the words "for my grandchild"). :-)

IMHO, the "Open Source Forum" had deserved a better visitor attraction than it really got - placing it into one hall with document management systems was no good fit, research and development might have been more appropriate.
The forum had an area for talks which were running all five days, I consider John "Maddog" Hall (who had provided an Alpha machine to Linus Torvalds decades ago) and Prof. Klaus Knopper (who is maintaining the "Knoppix" live distribution) the most prominent speakers. FromDual's Oli Sennhauser talked about the new features of MySQL 5.7, you can get the slides via the FromDual download page.

Chemnitzer Linux-Tage

The weekend following CeBIT, March 19 and 20, had been selected for the Chemnitzer Linux-Tage. Like in the previous years, the conference attracted many visitors from all over Germany as well as from some neighbouring countries, and both John Hall and Klaus Knopper had come there directly from Hannover - like me and several others.

As usual, the conference programme covered all aspects of Linux, the headline was "It is your project!". Databases are definitely not one of the major topics there, it is more about overall trends, distributions, communication, and many other aspects.

I delivered a talk about "RPM conventions - a Modern Tower of Bable", and it was well received. I am using various MySQL RPMs (from MySQL AB, Oracle, or RedHat) as examples to show different opinions about packaging, dependencies, installation actions, and compatibility issues, which partly originate from the diverging positions of software developer vs distributor. MySQL was used as a well-known example (but will interest my readers here), most of the items are also applicable to almost any software. Again, the slides (in German) are available on the FromDual web site. Your comments are welcome!

Open Source Data Center Conference

We all know that Open Source has become a major force in computing, so it is no surprise to have it as the subject for various conferences.

One of them is the "Open Source Data Center Conference" "OSDC", to be held in Berlin on April 26 - 28. Open Source database systems are one of the topics, and the programme committee accepted my talk "MySQL-Server in Teamwork - Replication and Galera Cluster". After the conference, I will upload it on the FromDual site and make it available for download.

Now, having told you all this, i will turn to customer issues again ...

Enjoy!

Past and Future Conferences, and Talks Around MySQL

Jörg Brühe - Mon, 2016-04-11 15:25

Time flies, and my blogging frequency is quite low. More frequent would be better, but knowing myself I'll rather not promise anything ;-)

Still, it is appropriate to write some notes about CeBIT, the "Chemnitzer Linuxtage 2016", and future events.

CeBIT

CeBIT was running from March 14 to 18 (Monday till Friday) in Hannover, Germany, and I will leave the general assessment to the various marketing departments as well as to the regular visitors (to which I do not belong).

In order to meet our current customers as well as potential future ones, FromDual had a booth in the "Open Source Forum". We displayed a Galera Cluster, running on three tiny headless single-board Linux machines, and showed how it reacts to node failures and then recovers all by itself, without any administrator intervention. Many of our visitors were fascinated, because a HA solution would be a good fit in their solution architecture. We had got several stuffed dolphins "Sakila", the traditional MySQL symbol, and all of them found new homes (typically with the words "for my grandchild"). :-)

IMHO, the "Open Source Forum" had deserved a better visitor attraction than it really got - placing it into one hall with document management systems was no good fit, research and development might have been more appropriate.
The forum had an area for talks which were running all five days, I consider John "Maddog" Hall (who had provided an Alpha machine to Linus Torvalds decades ago) and Prof. Klaus Knopper (who is maintaining the "Knoppix" live distribution) the most prominent speakers. FromDual's Oli Sennhauser talked about the new features of MySQL 5.7, you can get the slides via the FromDual download page.

Chemnitzer Linux-Tage

The weekend following CeBIT, March 19 and 20, had been selected for the Chemnitzer Linux-Tage. Like in the previous years, the conference attracted many visitors from all over Germany as well as from some neighbouring countries, and both John Hall and Klaus Knopper had come there directly from Hannover - like me and several others.

As usual, the conference programme covered all aspects of Linux, the headline was "It is your project!". Databases are definitely not one of the major topics there, it is more about overall trends, distributions, communication, and many other aspects.

I delivered a talk about "RPM conventions - a Modern Tower of Bable", and it was well received. I am using various MySQL RPMs (from MySQL AB, Oracle, or RedHat) as examples to show different opinions about packaging, dependencies, installation actions, and compatibility issues, which partly originate from the diverging positions of software developer vs distributor. MySQL was used as a well-known example (but will interest my readers here), most of the items are also applicable to almost any software. Again, the slides (in German) are available on the FromDual web site. Your comments are welcome!

Open Source Data Center Conference

We all know that Open Source has become a major force in computing, so it is no surprise to have it as the subject for various conferences.

One of them is the "Open Source Data Center Conference" "OSDC", to be held in Berlin on April 26 - 28. Open Source database systems are one of the topics, and the programme committee accepted my talk "MySQL-Server in Teamwork - Replication and Galera Cluster". After the conference, I will upload it on the FromDual site and make it available for download.

Now, having told you all this, i will turn to customer issues again ...

Enjoy!

Past and Future Conferences, and Talks Around MySQL

Jörg Brühe - Mon, 2016-04-11 15:25

Time flies, and my blogging frequency is quite low. More frequent would be better, but knowing myself I'll rather not promise anything ;-)

Still, it is appropriate to write some notes about CeBIT, the "Chemnitzer Linuxtage 2016", and future events.

CeBIT

CeBIT was running from March 14 to 18 (Monday till Friday) in Hannover, Germany, and I will leave the general assessment to the various marketing departments as well as to the regular visitors (to which I do not belong).

In order to meet our current customers as well as potential future ones, FromDual had a booth in the "Open Source Forum". We displayed a Galera Cluster, running on three tiny headless single-board Linux machines, and showed how it reacts to node failures and then recovers all by itself, without any administrator intervention. Many of our visitors were fascinated, because a HA solution would be a good fit in their solution architecture. We had got several stuffed dolphins "Sakila", the traditional MySQL symbol, and all of them found new homes (typically with the words "for my grandchild"). :-)

IMHO, the "Open Source Forum" had deserved a better visitor attraction than it really got - placing it into one hall with document management systems was no good fit, research and development might have been more appropriate.
The forum had an area for talks which were running all five days, I consider John "Maddog" Hall (who had provided an Alpha machine to Linus Torvalds decades ago) and Prof. Klaus Knopper (who is maintaining the "Knoppix" live distribution) the most prominent speakers. FromDual's Oli Sennhauser talked about the new features of MySQL 5.7, you can get the slides via the FromDual download page.

Chemnitzer Linux-Tage

The weekend following CeBIT, March 19 and 20, had been selected for the Chemnitzer Linux-Tage. Like in the previous years, the conference attracted many visitors from all over Germany as well as from some neighbouring countries, and both John Hall and Klaus Knopper had come there directly from Hannover - like me and several others.

As usual, the conference programme covered all aspects of Linux, the headline was "It is your project!". Databases are definitely not one of the major topics there, it is more about overall trends, distributions, communication, and many other aspects.

I delivered a talk about "RPM conventions - a Modern Tower of Bable", and it was well received. I am using various MySQL RPMs (from MySQL AB, Oracle, or RedHat) as examples to show different opinions about packaging, dependencies, installation actions, and compatibility issues, which partly originate from the diverging positions of software developer vs distributor. MySQL was used as a well-known example (but will interest my readers here), most of the items are also applicable to almost any software. Again, the slides (in German) are available on the FromDual web site. Your comments are welcome!

Open Source Data Center Conference

We all know that Open Source has become a major force in computing, so it is no surprise to have it as the subject for various conferences.

One of them is the "Open Source Data Center Conference" "OSDC", to be held in Berlin on April 26 - 28. Open Source database systems are one of the topics, and the programme committee accepted my talk "MySQL-Server in Teamwork - Replication and Galera Cluster". After the conference, I will upload it on the FromDual site and make it available for download.

Now, having told you all this, i will turn to customer issues again ...

Enjoy!

Galera Cache sizing

Shinguz - Mon, 2016-04-04 22:03

To synchronize the data between the Galera Cluster and a new or re-entering Galera node Galera Cluster uses 2 different mechanisms:

  • For full synchronization of data: Snapshot State Transfer (SST).
  • For delta synchronization of data: Incremental State Transfer (IST).

The Incremental State Transfer (IST) is relevant when a node is already known to the Galera Cluster and just left the cluster short time ago. This typically happens in a maintenance window during a rolling cluster restart.

The Galera Cache is a round-robin file based cache that keeps all the write-sets (= transactions + meta data) for a certain amount of time. This time, which should be bigger than your planned maintenance window, depends on the size of the Galera Cache (default 128 Mbyte) and the traffic which will happen during your maintenance window.

If your traffic is bigger than the Galera Cache can keep Galera Cluster will fall-back from IST to SST which is a very expensive operation for big databases.

The size of the Galera Cache can be calculated of the delta of the sum of the following 2 Galera status informations before and after the maintenance window:

Galera Cache size = delta(wsrep_replicated_bytes + wsrep_received_bytes)

Ideally you determine these values before your change happens in a time window where you have roughly the same traffic as during your maintenance window.

If you do not have a Galera Cluster in place yet or if you do not have those values available you can also use the numbers of the traffic written to the binary log or the number of the traffic written to InnoDB transaction log (Innodb_os_log_written).

As a rough estimate we have evaluated the following formulas for you:

Binary Log Traffic x 1.3 = Wsrep traffic (+/- 10%)

or

InnoDB Log File traffic x 0.6 = Wsrep traffic (+/- 10%)
Taxonomy upgrade extras: Galera Clustercachesizing

FromDual Nagios and Icinga plug-ins v1.0.1 for MySQL/MariaDB released

FromDual.en - Tue, 2016-02-23 18:27

FromDual has the pleasure to announce the release of the new version 1.0.1 of the FromDual Nagios and Icinga plug-ins for MySQL, Galera Cluster and MariaDB.

Download

The new FromDual Nagios plug-ins can be downloaded here.

In the inconceivable case that you find a bug in the FromDual Nagios plug-ins please report it to our bug tracker.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Changes in Nagios plug-ins 1.0.1
  • Help adapted to new 5.7 conventions for creating user in check_repl_mysql_heartbeat.pl
  • Output can be formatted for Centreon, Icinga and Nagios now in perf_mysql.pl
  • Support for Galera Cluster implemented in perf_mysql.pl
Taxonomy upgrade extras: nagiosicingaplug-inmysqlmariadbpercona serverGalera ClustermonitoringOperationsrelease

FromDual Nagios and Icinga plug-ins v1.0.1 for MySQL/MariaDB released

FromDual.en - Tue, 2016-02-23 18:27

FromDual has the pleasure to announce the release of the new version 1.0.1 of the FromDual Nagios and Icinga plug-ins for MySQL, Galera Cluster and MariaDB.

Download

The new FromDual Nagios plug-ins can be downloaded here.

In the inconceivable case that you find a bug in the FromDual Nagios plug-ins please report it to our bug tracker.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Changes in Nagios plug-ins 1.0.1
  • Help adapted to new 5.7 conventions for creating user in check_repl_mysql_heartbeat.pl
  • Output can be formatted for Centreon, Icinga and Nagios now in perf_mysql.pl
  • Support for Galera Cluster implemented in perf_mysql.pl
Taxonomy upgrade extras: nagiosicingaplug-inmysqlmariadbpercona serverGalera ClustermonitoringOperationsrelease

FromDual Nagios and Icinga plug-ins v1.0.1 for MySQL/MariaDB released

FromDual.en - Tue, 2016-02-23 18:27

FromDual has the pleasure to announce the release of the new version 1.0.1 of the FromDual Nagios and Icinga plug-ins for MySQL, Galera Cluster and MariaDB.

Download

The new FromDual Nagios plug-ins can be downloaded here.

In the inconceivable case that you find a bug in the FromDual Nagios plug-ins please report it to our bug tracker.

Any feedback, statements and testimonials are welcome as well! Please send them to feedback@fromdual.com.

Changes in Nagios plug-ins 1.0.1
  • Help adapted to new 5.7 conventions for creating user in check_repl_mysql_heartbeat.pl
  • Output can be formatted for Centreon, Icinga and Nagios now in perf_mysql.pl
  • Support for Galera Cluster implemented in perf_mysql.pl
Taxonomy upgrade extras: nagiosicingaplug-inmysqlmariadbpercona serverGalera ClustermonitoringOperationsrelease

On Files, the Space They Need, and the Space They Take

Jörg Brühe - Tue, 2016-02-09 14:55

or

xfs Users, Take Care!

Recently, we had a customer ask: Why do many files holding my data take up vastly more space than their size is? That question may sound weird to you, but it is for real, and the customer's observation was correct. For a start, let's make sure we are using the same terms.

  • The size of a file is the number of bytes it will deliver if it is read sequentially from start to end.
  • The space it takes up is the sum of all disk pages which are used to hold the file's data, or to locate those data pages ("indirect" blocks in Unix/Linux terminology).

Every Unix/Linux admin knows (or at least should know) that a file may take up less disk space than its size is. This happens when not all bytes of the file were really written, but the write pointer was advanced via "seek()", leaving a gap. Disk pages which are completely contained in such a gap will not be written, and reading these positions will produce bytes containing zero. This is called a "sparse file". You will find some remarks about them in our blog at https://fromdual.com/mysql-cluster-sparse-files, or search the net for that term.

The Customer's Message

Now that we have brought those basics into active memory again, let's return to the original question: Can there be files which take up vastly more space than their size is? We will not consider potential administrative overhead (pointers to pages), because to the customer a file of slightly more than 4 GB was reported to take up 8.1 GB disk space - see this quote from his mail (file name changed):

# ls -l some_table#P#p01.ibd
-rw-rw---- 1 mysql mysql 4307550208 Jan 4 01:06 some_table#P#p01.ibd
# du -hs some_table#P#p01.ibd
8,1G some_table#P#p01.ibd

Luckily, the customer's mail mentioned the file system: It was not one of the "ext" family (ext2, ext3, or etx4), but rather they are using xfs. This gave me a hint to search for information, and Google provided several pointers, IMO the most helpful ones where these:

http://xfs.org/index.php/XFS_FAQ
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=055388a3188f56676c21e92962fc366ac8b5cb72

Both these texts describe that the Linux kernel includes a tuning function for xfs file systems, which is to pre-allocate pages at the end of a growing file. Originally, the amount was small (64 kB), but then the size was made a function of the file size - the larger the file, the more pages were pre-allocated. Hence, this is now called "dynamic speculative EOF preallocation". It is based on the assumption that the file will continue to grow, and these pre-allocated pages are adjacent, so the performance of later file use (especially reads) will be improved. To not waste disk space permanently, such pre-allocated pages will be cut from the file when it is closed.

Measuring File Size and Space Taken

To see this behavior in practice, I wrote a little shell script that lets a file grow in increments of 160 kB (= ten InnoDB pages of default size) without closing it. (You can find it attached.) In parallel, I checked the size ("ls -l --block-size=K") and the space allocated ("du -k"). With this script, I could easily observe the effect:

Test './try-xfs-prealloc' is running on TTY 'pts/10'.

Linux trift-6core 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
/dev/mapper/vg1000-XFS_try on /XFS_try type xfs (rw)
Dateisystem 1K-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/mapper/vg1000-XFS_try 52403200 33504 52369696 1% /XFS_try
...
-rw-rw-r-- 1 joerg joerg 480K Feb 4 16:56 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 960K Feb 4 16:57 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 1440K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 1920K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
..
-rw-rw-r-- 1 joerg joerg 2240K Feb 4 16:57 /XFS_try/somedir/bigfile
4032 /XFS_try/somedir/bigfile

(( several lines not quoted ))

-rw-rw-r-- 1 joerg joerg 11200K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 11680K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
..
=====
No further writes, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile

Writer process killed, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
12000 /XFS_try/somedir/bigfile

While dynamic preallocation is a good idea for most files, it fails badly on MySQL data files: The MySQL server will not close them, in general, even when they won't grow any further (like a table partitioned by date). So this is what the customer detected: A table partition which had grown somewhat beyond 4 GB had got pages for another 4 GB pre-allocated, they were not released, and this happened for many files. Those of you who have ample disk space may say "who cares?", but there are others who have to care. For them, this feature has risky consequences, so they should try to prevent them.

Avoiding The Unlimited Growth

Basically, the only way out is to use the "allocsize" mount option, as described in the FAQ. InnoDB reads 64 pages of 16 kB at most, so "allocsize=1M" might be best.

Like the customer, many DBAs or SysAdmins may not be aware of that behaviour and might detect it only on the running system. Of course, the first question will be: "Can I fix that without downtime?" Immediately, a "mount -o remount" comes to mind, so I tried that: While my test script was running, I issued
sudo mount -o remount,allocsize=1M /XFS_try

Sadly, I must tell you it had no effect: The size of the pre-allocated space continued to grow, like in the original run. Even worse, this command also did not have any effect on a run I started after issuing it.

This proves that the value of "allocsize" cannot be changed for a mounted XFS file system, rather its value at mount time remains effective until the unmount. Only when I unmounted it and then mounted it anew, giving "allocsize=1M", did I see the fixed size as pre-allocation amount. From the DBA point of view, it means that a shutdown of the MySQL instance cannot be avoided for this change. (Of course, if we talk about a Galera cluster, the system remains available, because the nodes can be handled one at a time.)

Can You Get It Without Shutdown?

Now what if you really need to avoid a shutdown, but also need to get back the pre-allocated space urgently? As written above, this will happen only when the file is closed. So the question is: How can the DBA let the MySQL server close a table data file without interrupting the service? There seems to be a chance: the "flush tables" statement. The manual says:

FLUSH TABLES
Closes all open tables, forces all tables in use to be closed, and flushes the query cache. ...

FLUSH TABLES tbl_name [, tbl_name] ...
With a list of one or more comma-separated table names, this statement is like FLUSH TABLES with no names except that the server flushes only the named tables. ...
http://dev.mysql.com/doc/refman/5.6/en/flush.html

The text is identical for versions from 5.1 to 5.7.

But then, see the user comment by Simon Mudd on that page: No effect for InnoDB (currently, the top user comment on that "flush syntax" manual page, in all versions from 5.1 to 5.7).
To check this, I wrote a script that inserts rows into an InnoDB table, then let it run: The effect of preallocation is clearly visible. However, sometimes the space used may suddenly go down to the file size, then go up again. My impression is that XFS will react different to a plain file and an InnoDB table, because a file will grow sequentially at the end only while an InnoDB table also has writes to other blocks during its growth. At the end of the insert run, "ls -l" and "du" might show a big preallocation, but not in all runs.

I issued a flush tables test.let_me_grow command but the difference remained.
To really be sure, I repeated the experiment with the daemon "mysqld" running under "strace" control: No, there was no "close()" logged from the "flush table" command.

I had the opportunity to discuss it with a MySQL developer: Yes, that is correct, and it is intentional. InnoDB relies heavily on background threads, and they do not want to add the complexity of syncing these tasks with a "flush table" command. So there is no command that would guarantee the release of preallocated space.

I have filed that as a bug: http://bugs.mysql.com/80319

Conclusion While xfs is a good file system for databases, the "dynamic speculative EOF preallocation" is a feature to be aware of, and you may want to limit its amount so that you don't have too much wasted space on your disk(s). Use the "allocsize=" mount option, and remember that it needs to be set before the mount.

Take care!

AttachmentSize Shell script to show XFS preallocation1.92 KB

On Files, the Space They Need, and the Space They Take

Jörg Brühe - Tue, 2016-02-09 14:55

or

xfs Users, Take Care!

Recently, we had a customer ask: Why do many files holding my data take up vastly more space than their size is? That question may sound weird to you, but it is for real, and the customer's observation was correct. For a start, let's make sure we are using the same terms.

  • The size of a file is the number of bytes it will deliver if it is read sequentially from start to end.
  • The space it takes up is the sum of all disk pages which are used to hold the file's data, or to locate those data pages ("indirect" blocks in Unix/Linux terminology).

Every Unix/Linux admin knows (or at least should know) that a file may take up less disk space than its size is. This happens when not all bytes of the file were really written, but the write pointer was advanced via "seek()", leaving a gap. Disk pages which are completely contained in such a gap will not be written, and reading these positions will produce bytes containing zero. This is called a "sparse file". You will find some remarks about them in our blog at https://fromdual.com/mysql-cluster-sparse-files, or search the net for that term.

The Customer's Message

Now that we have brought those basics into active memory again, let's return to the original question: Can there be files which take up vastly more space than their size is? We will not consider potential administrative overhead (pointers to pages), because to the customer a file of slightly more than 4 GB was reported to take up 8.1 GB disk space - see this quote from his mail (file name changed):

# ls -l some_table#P#p01.ibd
-rw-rw---- 1 mysql mysql 4307550208 Jan 4 01:06 some_table#P#p01.ibd
# du -hs some_table#P#p01.ibd
8,1G some_table#P#p01.ibd

Luckily, the customer's mail mentioned the file system: It was not one of the "ext" family (ext2, ext3, or etx4), but rather they are using xfs. This gave me a hint to search for information, and Google provided several pointers, IMO the most helpful ones where these:

http://xfs.org/index.php/XFS_FAQ
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=055388a3188f56676c21e92962fc366ac8b5cb72

Both these texts describe that the Linux kernel includes a tuning function for xfs file systems, which is to pre-allocate pages at the end of a growing file. Originally, the amount was small (64 kB), but then the size was made a function of the file size - the larger the file, the more pages were pre-allocated. Hence, this is now called "dynamic speculative EOF preallocation". It is based on the assumption that the file will continue to grow, and these pre-allocated pages are adjacent, so the performance of later file use (especially reads) will be improved. To not waste disk space permanently, such pre-allocated pages will be cut from the file when it is closed.

Measuring File Size and Space Taken

To see this behavior in practice, I wrote a little shell script that lets a file grow in increments of 160 kB (= ten InnoDB pages of default size) without closing it. (You can find it attached.) In parallel, I checked the size ("ls -l --block-size=K") and the space allocated ("du -k"). With this script, I could easily observe the effect:

Test './try-xfs-prealloc' is running on TTY 'pts/10'.

Linux trift-6core 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
/dev/mapper/vg1000-XFS_try on /XFS_try type xfs (rw)
Dateisystem 1K-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/mapper/vg1000-XFS_try 52403200 33504 52369696 1% /XFS_try
...
-rw-rw-r-- 1 joerg joerg 480K Feb 4 16:56 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 960K Feb 4 16:57 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 1440K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 1920K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
..
-rw-rw-r-- 1 joerg joerg 2240K Feb 4 16:57 /XFS_try/somedir/bigfile
4032 /XFS_try/somedir/bigfile

(( several lines not quoted ))

-rw-rw-r-- 1 joerg joerg 11200K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 11680K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
..
=====
No further writes, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile

Writer process killed, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
12000 /XFS_try/somedir/bigfile

While dynamic preallocation is a good idea for most files, it fails badly on MySQL data files: The MySQL server will not close them, in general, even when they won't grow any further (like a table partitioned by date). So this is what the customer detected: A table partition which had grown somewhat beyond 4 GB had got pages for another 4 GB pre-allocated, they were not released, and this happened for many files. Those of you who have ample disk space may say "who cares?", but there are others who have to care. For them, this feature has risky consequences, so they should try to prevent them.

Avoiding The Unlimited Growth

Basically, the only way out is to use the "allocsize" mount option, as described in the FAQ. InnoDB reads 64 pages of 16 kB at most, so "allocsize=1M" might be best.

Like the customer, many DBAs or SysAdmins may not be aware of that behaviour and might detect it only on the running system. Of course, the first question will be: "Can I fix that without downtime?" Immediately, a "mount -o remount" comes to mind, so I tried that: While my test script was running, I issued
sudo mount -o remount,allocsize=1M /XFS_try

Sadly, I must tell you it had no effect: The size of the pre-allocated space continued to grow, like in the original run. Even worse, this command also did not have any effect on a run I started after issuing it.

This proves that the value of "allocsize" cannot be changed for a mounted XFS file system, rather its value at mount time remains effective until the unmount. Only when I unmounted it and then mounted it anew, giving "allocsize=1M", did I see the fixed size as pre-allocation amount. From the DBA point of view, it means that a shutdown of the MySQL instance cannot be avoided for this change. (Of course, if we talk about a Galera cluster, the system remains available, because the nodes can be handled one at a time.)

Can You Get It Without Shutdown?

Now what if you really need to avoid a shutdown, but also need to get back the pre-allocated space urgently? As written above, this will happen only when the file is closed. So the question is: How can the DBA let the MySQL server close a table data file without interrupting the service? There seems to be a chance: the "flush tables" statement. The manual says:

FLUSH TABLES
Closes all open tables, forces all tables in use to be closed, and flushes the query cache. ...

FLUSH TABLES tbl_name [, tbl_name] ...
With a list of one or more comma-separated table names, this statement is like FLUSH TABLES with no names except that the server flushes only the named tables. ...
http://dev.mysql.com/doc/refman/5.6/en/flush.html

The text is identical for versions from 5.1 to 5.7.

But then, see the user comment by Simon Mudd on that page: No effect for InnoDB (currently, the top user comment on that "flush syntax" manual page, in all versions from 5.1 to 5.7).
To check this, I wrote a script that inserts rows into an InnoDB table, then let it run: The effect of preallocation is clearly visible. However, sometimes the space used may suddenly go down to the file size, then go up again. My impression is that XFS will react different to a plain file and an InnoDB table, because a file will grow sequentially at the end only while an InnoDB table also has writes to other blocks during its growth. At the end of the insert run, "ls -l" and "du" might show a big preallocation, but not in all runs.

I issued a flush tables test.let_me_grow command but the difference remained.
To really be sure, I repeated the experiment with the daemon "mysqld" running under "strace" control: No, there was no "close()" logged from the "flush table" command.

I had the opportunity to discuss it with a MySQL developer: Yes, that is correct, and it is intentional. InnoDB relies heavily on background threads, and they do not want to add the complexity of syncing these tasks with a "flush table" command. So there is no command that would guarantee the release of preallocated space.

I have filed that as a bug: http://bugs.mysql.com/80319

Conclusion While xfs is a good file system for databases, the "dynamic speculative EOF preallocation" is a feature to be aware of, and you may want to limit its amount so that you don't have too much wasted space on your disk(s). Use the "allocsize=" mount option, and remember that it needs to be set before the mount.

Take care!

AttachmentSize Shell script to show XFS preallocation1.92 KB

On Files, the Space They Need, and the Space They Take

Jörg Brühe - Tue, 2016-02-09 14:55

or

xfs Users, Take Care!

Recently, we had a customer ask: Why do many files holding my data take up vastly more space than their size is? That question may sound weird to you, but it is for real, and the customer's observation was correct. For a start, let's make sure we are using the same terms.

  • The size of a file is the number of bytes it will deliver if it is read sequentially from start to end.
  • The space it takes up is the sum of all disk pages which are used to hold the file's data, or to locate those data pages ("indirect" blocks in Unix/Linux terminology).

Every Unix/Linux admin knows (or at least should know) that a file may take up less disk space than its size is. This happens when not all bytes of the file were really written, but the write pointer was advanced via "seek()", leaving a gap. Disk pages which are completely contained in such a gap will not be written, and reading these positions will produce bytes containing zero. This is called a "sparse file". You will find some remarks about them in our blog at https://fromdual.com/mysql-cluster-sparse-files, or search the net for that term.

The Customer's Message

Now that we have brought those basics into active memory again, let's return to the original question: Can there be files which take up vastly more space than their size is? We will not consider potential administrative overhead (pointers to pages), because to the customer a file of slightly more than 4 GB was reported to take up 8.1 GB disk space - see this quote from his mail (file name changed):

# ls -l some_table#P#p01.ibd
-rw-rw---- 1 mysql mysql 4307550208 Jan 4 01:06 some_table#P#p01.ibd
# du -hs some_table#P#p01.ibd
8,1G some_table#P#p01.ibd

Luckily, the customer's mail mentioned the file system: It was not one of the "ext" family (ext2, ext3, or etx4), but rather they are using xfs. This gave me a hint to search for information, and Google provided several pointers, IMO the most helpful ones where these:

http://xfs.org/index.php/XFS_FAQ
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=055388a3188f56676c21e92962fc366ac8b5cb72

Both these texts describe that the Linux kernel includes a tuning function for xfs file systems, which is to pre-allocate pages at the end of a growing file. Originally, the amount was small (64 kB), but then the size was made a function of the file size - the larger the file, the more pages were pre-allocated. Hence, this is now called "dynamic speculative EOF preallocation". It is based on the assumption that the file will continue to grow, and these pre-allocated pages are adjacent, so the performance of later file use (especially reads) will be improved. To not waste disk space permanently, such pre-allocated pages will be cut from the file when it is closed.

Measuring File Size and Space Taken

To see this behavior in practice, I wrote a little shell script that lets a file grow in increments of 160 kB (= ten InnoDB pages of default size) without closing it. (You can find it attached.) In parallel, I checked the size ("ls -l --block-size=K") and the space allocated ("du -k"). With this script, I could easily observe the effect:

Test './try-xfs-prealloc' is running on TTY 'pts/10'.

Linux trift-6core 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
/dev/mapper/vg1000-XFS_try on /XFS_try type xfs (rw)
Dateisystem 1K-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
/dev/mapper/vg1000-XFS_try 52403200 33504 52369696 1% /XFS_try
...
-rw-rw-r-- 1 joerg joerg 480K Feb 4 16:56 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 960K Feb 4 16:57 /XFS_try/somedir/bigfile
960 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 1440K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 1920K Feb 4 16:57 /XFS_try/somedir/bigfile
1984 /XFS_try/somedir/bigfile
..
-rw-rw-r-- 1 joerg joerg 2240K Feb 4 16:57 /XFS_try/somedir/bigfile
4032 /XFS_try/somedir/bigfile

(( several lines not quoted ))

-rw-rw-r-- 1 joerg joerg 11200K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
...
-rw-rw-r-- 1 joerg joerg 11680K Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile
..
=====
No further writes, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
16320 /XFS_try/somedir/bigfile

Writer process killed, status:
-rw-rw-r-- 1 joerg joerg 12288000 Feb 4 17:02 /XFS_try/somedir/bigfile
12000 /XFS_try/somedir/bigfile

While dynamic preallocation is a good idea for most files, it fails badly on MySQL data files: The MySQL server will not close them, in general, even when they won't grow any further (like a table partitioned by date). So this is what the customer detected: A table partition which had grown somewhat beyond 4 GB had got pages for another 4 GB pre-allocated, they were not released, and this happened for many files. Those of you who have ample disk space may say "who cares?", but there are others who have to care. For them, this feature has risky consequences, so they should try to prevent them.

Avoiding The Unlimited Growth

Basically, the only way out is to use the "allocsize" mount option, as described in the FAQ. InnoDB reads 64 pages of 16 kB at most, so "allocsize=1M" might be best.

Like the customer, many DBAs or SysAdmins may not be aware of that behaviour and might detect it only on the running system. Of course, the first question will be: "Can I fix that without downtime?" Immediately, a "mount -o remount" comes to mind, so I tried that: While my test script was running, I issued
sudo mount -o remount,allocsize=1M /XFS_try

Sadly, I must tell you it had no effect: The size of the pre-allocated space continued to grow, like in the original run. Even worse, this command also did not have any effect on a run I started after issuing it.

This proves that the value of "allocsize" cannot be changed for a mounted XFS file system, rather its value at mount time remains effective until the unmount. Only when I unmounted it and then mounted it anew, giving "allocsize=1M", did I see the fixed size as pre-allocation amount. From the DBA point of view, it means that a shutdown of the MySQL instance cannot be avoided for this change. (Of course, if we talk about a Galera cluster, the system remains available, because the nodes can be handled one at a time.)

Can You Get It Without Shutdown?

Now what if you really need to avoid a shutdown, but also need to get back the pre-allocated space urgently? As written above, this will happen only when the file is closed. So the question is: How can the DBA let the MySQL server close a table data file without interrupting the service? There seems to be a chance: the "flush tables" statement. The manual says:

FLUSH TABLES
Closes all open tables, forces all tables in use to be closed, and flushes the query cache. ...

FLUSH TABLES tbl_name [, tbl_name] ...
With a list of one or more comma-separated table names, this statement is like FLUSH TABLES with no names except that the server flushes only the named tables. ...
http://dev.mysql.com/doc/refman/5.6/en/flush.html

The text is identical for versions from 5.1 to 5.7.

But then, see the user comment by Simon Mudd on that page: No effect for InnoDB. To check this, I wrote a script that inserts rows into an InnoDB table, then let it run: The effect of preallocation is clearly visible. However, sometimes the space used may suddenly go down to the file size, then go up again. My impression is that XFS will react different to a plain file and an InnoDB table, because a file will grow sequentially at the end only while an InnoDB table also has writes to other blocks during its growth. At the end of the insert run, "ls -l" and "du" might show a big preallocation, but not in all runs. To really be sure, I repeated the experiment with the daemon "mysqld" running under "strace" control: No, I did not get a "close()" logged from the "flush table" command.

I had the opportunity to discuss it with a MySQL developer: Yes, that is correct, and it is intentional. InnoDB relies heavily on background threads, and they do not want to add the complexity of syncing these tasks with a "flush table" command. So there is no command that would guarantee the release of preallocated space.

Conclusion While xfs is a good file system for databases, the "dynamic speculative EOF preallocation" is a feature to be aware of, and you may want to limit its amount so that you don't have too much wasted space on your disk(s). Use the "allocsize=" mount option, and remember that it needs to be set before the mount.

Take care!

AttachmentSize Shell script to show XFS preallocation1.92 KB

FOSDEM 2016 - MySQL slides about PERFORMANCE_SCHEMA available

FromDual.en - Wed, 2016-02-03 21:56

The FOSDEM 2016 in Brussels (Belgium) January 29/30 is over and was very interesting and IMHO a big success.

For all those who could not participate at FOSDEM 2016 our presentation slides about PERFORMANCE_SCHEMA and sys schema are available here:

PERFORMANCE_SCHEMA and sys schema - What can we do with it? (PDF, 406 kbyte)

Taxonomy upgrade extras: sysperformance_schemaslidespresentation

FOSDEM 2016 - MySQL slides about PERFORMANCE_SCHEMA available

FromDual.en - Wed, 2016-02-03 21:56

The FOSDEM 2016 in Brussels (Belgium) January 29/30 is over and was very interesting and IMHO a big success.

For all those who could not participate at FOSDEM 2016 our presentation slides about PERFORMANCE_SCHEMA and sys schema are available here:

PERFORMANCE_SCHEMA and sys schema - What can we do with it? (PDF, 406 kbyte)

Taxonomy upgrade extras: sysperformance_schemaslidespresentation

Pages

Subscribe to FromDual Aggregator – FromDual all (en)