You are here

We build a data warehouse from the General Query Log

The design of a data warehouse differs from relational design. Data warehouses are often designed according to the concept of the star schema.

When building a data warehouse, you usually put the cart before the horse:

  • What questions should my data warehouse be able to answer?
  • How do I have to design my model so that my questions can be answered easily?
  • Where do I get the data to populate the model?
  • How do I fill my model with the data?

For training purposes, we have investigated an issue that arises from time to time with our support team: The system suddenly and unexpectedly starts to behave unusually, nobody has done anything and nobody knows why. Example with a customer last week: The system starts to become unstable at 3 pm, is then restarted hard and then stabilises again from 4 pm...

The easiest thing to do in such a case would be to quickly look at the database with the SHOW PROCESSLIST command and then it often becomes immediately clear where the problem lies. But customers often forget this or they are not fast enough. The General Query Log was already switched on for this customer, so this would be a great case for our General Query Log Data Warehouse!

What questions should my data warehouse be able to answer?

The generic question for this problem should be something like: "Who or what caused my system to behave abnormally?"

In technical terms, the question would be something like:

  • Who: Which user or account was on the database with how many connections at the time in question? What was unusual about it?
  • What: Which queries were running in which schema on the system at the time in question? Which of these queries were unusual?

What should my model look like?

We can already derive some facts and dimensions from the question:

  • User or account (user + host)
  • Time
  • Connections
  • Schema
  • Queries

And this also results in 4 dimensions and the fact table:

Data source

Where the data comes from is relatively easy to answer in this case: The customer provides his General Query Logs or you can also use the General Query Logs of our own systems for testing purposes.

How is the model populated?

Technically, this is known as an ETL process (Extract-Transform-Load). In our case, we have built a General Query Log parser that reads the General Query Log, prepares the data accordingly and saves it in the model.

Checking the model

And then we come to checking the model. We used test data from one of our systems for this:

  • Which user was on the system at the time in question?
  • Which user had how many connections open at the time in question?

SELECT td.time, cd.user, COUNT(*) AS count
  FROM connection_dim cd
  JOIN query_fact AS qf ON qf.connection_id = cd.connection_id
  JOIN time_dim AS td ON td.time_id = qf.time_id
  JOIN date_dim AS dd ON dd.date_id = qf.date_id
 WHERE td.time BETWEEN '17:00' AND '18:30'
   AND dd.date = '2019-08-02'
 GROUP BY td.time, cd.user
 ORDER BY td.time ASC, cd.user
;
+----------+---------------+-------+
| time     | user          | count |
+----------+---------------+-------+
| 17:58:00 | UNKNOWN USER  |     1 |
| 17:59:00 | brman         |    58 |
| 17:59:00 | brman_catalog |    18 |
| 17:59:00 | root          |     5 |
| 18:00:00 | brman         |   296 |
| 18:00:00 | brman_catalog |     7 |
| 18:00:00 | root          |     3 |
| 18:01:00 | brman_catalog |    18 |
| 18:01:00 | root          |     3 |
| 18:06:00 | brman         |   266 |
| 18:06:00 | brman_catalog |     6 |
| 18:07:00 | brman         |    88 |
| 18:07:00 | brman_catalog |     7 |
| 18:10:00 | brman         |   211 |
| 18:10:00 | brman_catalog |    18 |
| 18:10:00 | root          |     4 |
| 18:11:00 | brman         |   141 |
| 18:11:00 | root          |     3 |
| 18:13:00 | brman         |     4 |
| 18:14:00 | brman         |   348 |
| 18:17:00 | brman         |   354 |
| 18:17:00 | brman_catalog |    12 |
| 18:17:00 | root          |     1 |
+----------+---------------+-------+

  • Which account was on the system at the time in question?
  • Which account had how many connections open at the time in question?

SELECT td.time, cd.user, cd.hostname, COUNT(*) AS count
  FROM connection_dim cd
  JOIN query_fact AS qf ON qf.connection_id = cd.connection_id
  JOIN time_dim AS td ON td.time_id = qf.time_id
  JOIN date_dim AS dd ON dd.date_id = qf.date_id
 WHERE td.time BETWEEN '17:00' AND '18:30'
   AND dd.date = '2019-08-02'
 GROUP BY td.time, cd.user, cd.hostname
 ORDER BY td.time ASC, cd.user
;
+----------+---------------+--------------+-------+
| time     | user          | hostname     | count |
+----------+---------------+--------------+-------+
| 17:58:00 | UNKNOWN USER  | UNKNOWN HOST |     1 |
| 17:59:00 | brman         | localhost    |    58 |
| 17:59:00 | brman_catalog | localhost    |    18 |
| 17:59:00 | root          | localhost    |     5 |
| 18:00:00 | brman         | localhost    |   296 |
| 18:00:00 | brman_catalog | localhost    |     7 |
| 18:00:00 | root          | localhost    |     3 |
| 18:01:00 | brman_catalog | localhost    |    18 |
| 18:01:00 | root          | localhost    |     3 |
| 18:06:00 | brman         | localhost    |   266 |
| 18:06:00 | brman_catalog | localhost    |     6 |
| 18:07:00 | brman         | localhost    |    88 |
| 18:07:00 | brman_catalog | localhost    |     7 |
| 18:10:00 | brman         | localhost    |   211 |
| 18:10:00 | brman_catalog | localhost    |    18 |
| 18:10:00 | root          | localhost    |     4 |
| 18:11:00 | brman         | localhost    |   141 |
| 18:11:00 | root          | localhost    |     3 |
| 18:13:00 | brman         | localhost    |     4 |
| 18:14:00 | brman         | localhost    |   348 |
| 18:17:00 | brman         | localhost    |   354 |
| 18:17:00 | brman_catalog | localhost    |    12 |
| 18:17:00 | root          | localhost    |     1 |
+----------+---------------+--------------+-------+

  • What was unusual about it?

SELECT cd.user, td.time, COUNT(*) AS count
  FROM connection_dim cd
  JOIN query_fact AS qf ON qf.connection_id = cd.connection_id
  JOIN time_dim AS td ON td.time_id = qf.time_id
  JOIN date_dim AS dd ON dd.date_id = qf.date_id
 WHERE td.time BETWEEN '17:00' AND '18:30'
   AND dd.date = '2019-08-02'
 GROUP BY td.time, cd.user
 ORDER BY cd.user ASC, td.time ASC
;
+---------------+----------+-------+
| user          | time     | count |
+---------------+----------+-------+
| brman         | 17:59:00 |    58 |
| brman         | 18:00:00 |   296 |
| brman         | 18:06:00 |   266 |
| brman         | 18:07:00 |    88 |
| brman         | 18:10:00 |   211 |
| brman         | 18:11:00 |   141 |
| brman         | 18:13:00 |     4 |
| brman         | 18:14:00 |   348 |
| brman         | 18:17:00 |   354 |
| brman_catalog | 17:59:00 |    18 |
| brman_catalog | 18:00:00 |     7 |
| brman_catalog | 18:01:00 |    18 |
| brman_catalog | 18:06:00 |     6 |
| brman_catalog | 18:07:00 |     7 |
| brman_catalog | 18:10:00 |    18 |
| brman_catalog | 18:17:00 |    12 |
| root          | 17:59:00 |     5 |
| root          | 18:00:00 |     3 |
| root          | 18:01:00 |     3 |
| root          | 18:10:00 |     4 |
| root          | 18:11:00 |     3 |
| root          | 18:17:00 |     1 |
| UNKNOWN USER  | 17:58:00 |     1 |
+---------------+----------+-------+

One could deduce here, for example, that the user brman had a relatively large number of open connections during the period in question. Whether this is unusual, we have too little data or the time period is too short.

  • Which queries were running on the system at the time in question and in which schema?
  • Which of these queries were unusual?

SELECT sd.schema_name, td.time, SUBSTR(std.statement_text, 1, 128) AS query
  FROM query_fact AS qf
  JOIN time_dim AS td ON td.time_id = qf.time_id
  JOIN schema_dim AS sd ON sd.schema_id = qf.schema_id
  JOIN statement_dim AS std ON std.statement_id = qf.statement_id
 WHERE td.time BETWEEN '17:00' AND '18:30'
   AND sd.schema_name = 'brman_catalog'
   AND std.command = 'Query'
 ORDER BY td.time, qf.statement_id
 LIMIT 10
;
+---------------+----------+----------------------------------------------------------------------------------------------------------------------------------+
| schema_name   | time     | query                                                                                                                            |
+---------------+----------+----------------------------------------------------------------------------------------------------------------------------------+
| brman_catalog | 17:59:00 | SET NAMES `utf8`                                                                                                                 |
| brman_catalog | 17:59:00 | SELECT COUNT ( * ) AS `cnt` FROM `information_schema` . `tables` WHERE `table_schema` = ? AND TABLE_NAME = ?                     |
| brman_catalog | 17:59:00 | SELECT COUNT ( * ) AS `cnt` FROM `information_schema` . `tables` WHERE `table_schema` = ? AND TABLE_NAME = ?                     |
| brman_catalog | 17:59:00 | CREATE TABLE `metadata` ( `id` TINYINT UNSIGNED NOT NULL AUTO_INCREMENT , `key` VARCHARACTER (?) NOT NULL , `value` VARCHARACTER |
| brman_catalog | 17:59:00 | INSERT INTO `metadata` ( `key` , `value` ) VALUES (...)                                                                          |
| brman_catalog | 17:59:00 | INSERT INTO `metadata` ( `key` , `value` ) VALUES (...)                                                                          |
| brman_catalog | 17:59:00 | CREATE TABLE `backups` ( `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT , `instance_name` VARCHARACTER (?) NOT NULL , `start_ts`  |
| brman_catalog | 17:59:00 | CREATE TABLE `backup_details` ( `backup_id` INTEGER UNSIGNED NOT NULL , `hostname` VARCHARACTER (?) NULL , `binlog_file` VARCHAR |
| brman_catalog | 17:59:00 | CREATE TABLE `files` ( `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT , `schema_name` VARCHARACTER (?) NULL , `original_name` VAR |
| brman_catalog | 17:59:00 | CREATE TABLE `binary_logs` ( `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT , `filename` VARCHARACTER (?) NOT NULL , `begin_ts` I |
+---------------+----------+----------------------------------------------------------------------------------------------------------------------------------+

Suggestions for improvement

Based on this first iteration of the model, you can already see which questions the model cannot yet answer or where the model is too imprecise. This can then be improved in a second round....

Examples of this are:

  • The granularity of the time dimension may be too coarse with an accuracy of minutes. Would it make more sense to use seconds?
  • The question of how long a connection was open is not so easy to answer. Perhaps a further fact table would be appropriate here?
        SELECT cd.connection_number, cd.user, cd.hostname, tdf.time AS time_from, tdt.time AS time_to, (UNIX_TIMESTAMP(tdt.time) - UNIX_TIMESTAMP(tdf.time)) AS duration
          FROM connection_dim AS cd
          JOIN query_fact AS qf1 ON cd.connection_id = qf1.connection_id
          JOIN time_dim AS tdf ON tdf.time_id = qf1.time_id
          JOIN statement_dim AS sdf ON sdf.statement_id = qf1.statement_id
          JOIN query_fact AS qf2 ON cd.connection_id = qf2.connection_id
          JOIN time_dim AS tdt ON tdt.time_id = qf2.time_id
          JOIN statement_dim AS sdt ON sdt.statement_id = qf2.statement_id
         WHERE tdf.time BETWEEN '17:00' AND '18:30'
           AND sdf.command = 'Connect'
           AND sdt.command = 'Quit'
           AND (UNIX_TIMESTAMP(tdt.time) - UNIX_TIMESTAMP(tdf.time)) > 0
         ORDER BY tdf.time
        ;
        +-------------------+-------+-----------+-----------+----------+----------+
        | connection_number | user  | hostname  | time_from | time_to  | duration |
        +-------------------+-------+-----------+-----------+----------+----------+
        |               211 | brman | localhost | 17:59:00  | 18:00:00 |       60 |
        |               215 | root  | localhost | 18:00:00  | 18:17:00 |     1020 |
        |               219 | brman | localhost | 18:06:00  | 18:07:00 |       60 |
        |               225 | brman | localhost | 18:10:00  | 18:11:00 |       60 |
        |               226 | brman | localhost | 18:13:00  | 18:14:00 |       60 |
        +-------------------+-------+-----------+-----------+----------+----------+
        
  • Of course, it would be exciting if an AI were used to solve the problem. How do you train it correctly and does it find the problem once it has been trained?

So much for the little gimmick of building a data warehouse...

Taxonomy upgrade extras: