What can MySQL performance monitoring graphs tell you?

Thu, 2012-02-16 18:32 — Shinguz

Many of you may monitor their databases for different purposes. Beside alerting it is often good to also make some graphs from MySQL performance counters to see what is actually happening on your database.

The following graphs where made with our FromDual Performance Monitor for MySQL as a Service (MaaS) set-up. If you do not have the time to install a performance monitoring yourself please feel free to contact us for our MaaS solution.

Overview

First of all it is a good idea to have an overview of all the settings in you different databases and if they are compliant to your standards.

Here it looks like two of our databases are still running with Statement Bases Replication (SBR). Further the MySQL General Query Log is enabled which is non optimal for write performance and the default Storage Engine is still set to MyISAM which is not wanted in our case.

InnoDB

This server is mostly an InnoDB Server. We can see that we have some write traffic because the InnoDB Buffer Pool has constantly 15 - 20% of dirty pages. Further we see that we have a very constant level of dirty pages:

Here I guess, the database was restarted on Wednesday:

We can clearly see the positive impact of MySQL partitioning:

Read and write are here more or less equally high with some strong spikes:

If we have a closer look we can see that one typical spike is always at 06:00 in the morning. It is read and write so it is possibly NOT a backup but more likely a reporting or maintenance job:

This system does mostly write with a heavy read phase during midnight:

The read starts at 23:00 and ends at 03:30. It could be some nightly reporting?

Extreme read and write spikes. Not good for a system that likes to have close to real time behavior:

The consequences we can see immediately, Locking:

When we look at the InnoDB Locking we can see that this job at 06:00 in the morning causes some lockings. Up to 2.5 seconds. If this causes troubles we really have to investigate what is running at that time:

If we look at the last 14 days we can see a huge read spike some time ago, what has happened there? This has possibly influenced the whole system as well:

Here we have a special InnoDB read pattern. Can you see it? Every 3 hours at xx:30 it happens. We should try to find out what it is:

From time to time we can see some big transactions:

MyISAM

Currently and over the last few days our MyISAM key buffer was mostly empty. But the high water mark Key_blocks_used indicates that they were used in the past. So we should try to find out if this key buffer is still used and if not if we can free this memory (about 2.4 Gbyte):

From time to time we see some MyISAM key reads from memory. This could be caused by MyISAM table indexes or by temporary tables gone to disk. So not really much MyISAM traffic at all and it is mostly happening during week days:

This is a MyISAM mostly system but nevertheless we do not have MyISAM table locking issues:

Connections

We are not far from hitting max_connections. So either we increase this value or we think about why we have so many connections open concurrently. On February 5^th we had many concurrent running threads. That was surely not good for the whole system performance:

On February 12^th there was possibly something wrong with the application: We see many aborted clients what never had before:

It looks like thread cache was always big enough:

On this server we definitly had a problem and we hit the roof with the amount of connections. I am wondering if it would not be more useful to even lower max_connections here?

Network traffic

Network traffic was growing a bit last week but now is stable again. Our network should not have a problem with this load:

This is a read mostly database. The patter comes from sorts and handler_read_rnd (see below):

Handler Interface

Here we see that we do many handler_read_next and handler_read_rnd_next operations. This is equivalent to index range scans and full table scans. Our spike at February 11^th seems to come from such an index range scan:

This server does mostly read by a full table scan! There is a huge potential for optimization so the server could cover more load if the queries were done more optimal:

It seems there were some UPDATEs involved on 8^th:

We see many INSERTs but few to no DELETEs. This means the database size is steadly growing. We should think about archiving old data?

OK. This seems to be a maintenance job. Luckily it was set off-peak hours (08:00):

Somebody does evil things hear: handler_read_rnd is bad for performance. And it pretty much correlates with the network traffic from above:

Sort

Sort behavior seems to have changed significantly on Monday. We should find out if application has released a new version:

On this server we see some sort_merge_passes which is a sign of a too small sort buffer or just huge sorts:

Queries

The majority of queries sent against this database are SELECT's:

Temporary Tables

The use of temporary tables has changed. This is a sign again that something in the application was modified:

If we look a bit closer we can see that the use of temporary disk tables has increased. We should keep an eye on this:

Here we most probably have a problem with temporary tables on disk. It happens quite periodical and predictable so we can investigate who causes those:

If you look closer at it we can see that it is a similar pattern as with the data reads from above:

MySQL Process information

The number of page faults has changed dramatically over the weekend. So what was changed there?

Query Cache

There are no query cache hits but Query Cache is enabled! There are several reasons for this. We should investigate or disable Query Cache at all:

We have some low memory prunes. Shall we increase Query Cache size?

Hmmm. It is already quite big. Better to defragment it from time to time to get rid of the free memory:

Table Definition Cache

The Table Definition Cache is much to big here. A value of 256 (default) would be far more than enough. Better to release the resources:

This is the opposite case. Here the Table Definition Cache is much too small at least when the job at 23:00 is running:

Table (Open) Cache

Same situation here with the Table Open Cache. A value of 3 - 4k would be enough. Better to release the resources. There are some cases known where a too big table cache causes performance issues:

Same here. Too small at midnight jobs:

Binary logging (Master)

The Binary Log has a size of about 100 Mbyte (Debian/Ubuntu?) and is filled up every hour (ca. 25 kybte/s). During the night we have some less traffic, during day some more traffic: