You are here
We build a data warehouse from the General Query Log
The design of a data warehouse differs from relational design. Data warehouses are often designed according to the concept of the star schema.
When building a data warehouse, you usually put the cart before the horse:
- What questions should my data warehouse be able to answer?
- How do I have to design my model so that my questions can be answered easily?
- Where do I get the data to populate the model?
- How do I fill my model with the data?
For training purposes, we have investigated an issue that arises from time to time with our support team: The system suddenly and unexpectedly starts to behave unusually, nobody has done anything and nobody knows why. Example with a customer last week: The system starts to become unstable at 3 pm, is then restarted hard and then stabilises again from 4 pm...
The easiest thing to do in such a case would be to quickly look at the database with the SHOW PROCESSLIST
command and then it often becomes immediately clear where the problem lies. But customers often forget this or they are not fast enough. The General Query Log was already switched on for this customer, so this would be a great case for our General Query Log Data Warehouse!
What questions should my data warehouse be able to answer?
The generic question for this problem should be something like: "Who or what caused my system to behave abnormally?"
In technical terms, the question would be something like:
- Who: Which user or account was on the database with how many connections at the time in question? What was unusual about it?
- What: Which queries were running in which schema on the system at the time in question? Which of these queries were unusual?
What should my model look like?
We can already derive some facts and dimensions from the question:
- User or account (user + host)
- Time
- Connections
- Schema
- Queries
And this also results in 4 dimensions and the fact table:
Data source
Where the data comes from is relatively easy to answer in this case: The customer provides his General Query Logs or you can also use the General Query Logs of our own systems for testing purposes.
How is the model populated?
Technically, this is known as an ETL process (Extract-Transform-Load). In our case, we have built a General Query Log parser that reads the General Query Log, prepares the data accordingly and saves it in the model.
Checking the model
And then we come to checking the model. We used test data from one of our systems for this:
- Which user was on the system at the time in question?
- Which user had how many connections open at the time in question?
SELECT td.time, cd.user, COUNT(*) AS count FROM connection_dim cd JOIN query_fact AS qf ON qf.connection_id = cd.connection_id JOIN time_dim AS td ON td.time_id = qf.time_id JOIN date_dim AS dd ON dd.date_id = qf.date_id WHERE td.time BETWEEN '17:00' AND '18:30' AND dd.date = '2019-08-02' GROUP BY td.time, cd.user ORDER BY td.time ASC, cd.user ; +----------+---------------+-------+ | time | user | count | +----------+---------------+-------+ | 17:58:00 | UNKNOWN USER | 1 | | 17:59:00 | brman | 58 | | 17:59:00 | brman_catalog | 18 | | 17:59:00 | root | 5 | | 18:00:00 | brman | 296 | | 18:00:00 | brman_catalog | 7 | | 18:00:00 | root | 3 | | 18:01:00 | brman_catalog | 18 | | 18:01:00 | root | 3 | | 18:06:00 | brman | 266 | | 18:06:00 | brman_catalog | 6 | | 18:07:00 | brman | 88 | | 18:07:00 | brman_catalog | 7 | | 18:10:00 | brman | 211 | | 18:10:00 | brman_catalog | 18 | | 18:10:00 | root | 4 | | 18:11:00 | brman | 141 | | 18:11:00 | root | 3 | | 18:13:00 | brman | 4 | | 18:14:00 | brman | 348 | | 18:17:00 | brman | 354 | | 18:17:00 | brman_catalog | 12 | | 18:17:00 | root | 1 | +----------+---------------+-------+
- Which account was on the system at the time in question?
- Which account had how many connections open at the time in question?
SELECT td.time, cd.user, cd.hostname, COUNT(*) AS count FROM connection_dim cd JOIN query_fact AS qf ON qf.connection_id = cd.connection_id JOIN time_dim AS td ON td.time_id = qf.time_id JOIN date_dim AS dd ON dd.date_id = qf.date_id WHERE td.time BETWEEN '17:00' AND '18:30' AND dd.date = '2019-08-02' GROUP BY td.time, cd.user, cd.hostname ORDER BY td.time ASC, cd.user ; +----------+---------------+--------------+-------+ | time | user | hostname | count | +----------+---------------+--------------+-------+ | 17:58:00 | UNKNOWN USER | UNKNOWN HOST | 1 | | 17:59:00 | brman | localhost | 58 | | 17:59:00 | brman_catalog | localhost | 18 | | 17:59:00 | root | localhost | 5 | | 18:00:00 | brman | localhost | 296 | | 18:00:00 | brman_catalog | localhost | 7 | | 18:00:00 | root | localhost | 3 | | 18:01:00 | brman_catalog | localhost | 18 | | 18:01:00 | root | localhost | 3 | | 18:06:00 | brman | localhost | 266 | | 18:06:00 | brman_catalog | localhost | 6 | | 18:07:00 | brman | localhost | 88 | | 18:07:00 | brman_catalog | localhost | 7 | | 18:10:00 | brman | localhost | 211 | | 18:10:00 | brman_catalog | localhost | 18 | | 18:10:00 | root | localhost | 4 | | 18:11:00 | brman | localhost | 141 | | 18:11:00 | root | localhost | 3 | | 18:13:00 | brman | localhost | 4 | | 18:14:00 | brman | localhost | 348 | | 18:17:00 | brman | localhost | 354 | | 18:17:00 | brman_catalog | localhost | 12 | | 18:17:00 | root | localhost | 1 | +----------+---------------+--------------+-------+
- What was unusual about it?
SELECT cd.user, td.time, COUNT(*) AS count FROM connection_dim cd JOIN query_fact AS qf ON qf.connection_id = cd.connection_id JOIN time_dim AS td ON td.time_id = qf.time_id JOIN date_dim AS dd ON dd.date_id = qf.date_id WHERE td.time BETWEEN '17:00' AND '18:30' AND dd.date = '2019-08-02' GROUP BY td.time, cd.user ORDER BY cd.user ASC, td.time ASC ; +---------------+----------+-------+ | user | time | count | +---------------+----------+-------+ | brman | 17:59:00 | 58 | | brman | 18:00:00 | 296 | | brman | 18:06:00 | 266 | | brman | 18:07:00 | 88 | | brman | 18:10:00 | 211 | | brman | 18:11:00 | 141 | | brman | 18:13:00 | 4 | | brman | 18:14:00 | 348 | | brman | 18:17:00 | 354 | | brman_catalog | 17:59:00 | 18 | | brman_catalog | 18:00:00 | 7 | | brman_catalog | 18:01:00 | 18 | | brman_catalog | 18:06:00 | 6 | | brman_catalog | 18:07:00 | 7 | | brman_catalog | 18:10:00 | 18 | | brman_catalog | 18:17:00 | 12 | | root | 17:59:00 | 5 | | root | 18:00:00 | 3 | | root | 18:01:00 | 3 | | root | 18:10:00 | 4 | | root | 18:11:00 | 3 | | root | 18:17:00 | 1 | | UNKNOWN USER | 17:58:00 | 1 | +---------------+----------+-------+
One could deduce here, for example, that the user brman had a relatively large number of open connections during the period in question. Whether this is unusual, we have too little data or the time period is too short.
- Which queries were running on the system at the time in question and in which schema?
- Which of these queries were unusual?
SELECT sd.schema_name, td.time, SUBSTR(std.statement_text, 1, 128) AS query FROM query_fact AS qf JOIN time_dim AS td ON td.time_id = qf.time_id JOIN schema_dim AS sd ON sd.schema_id = qf.schema_id JOIN statement_dim AS std ON std.statement_id = qf.statement_id WHERE td.time BETWEEN '17:00' AND '18:30' AND sd.schema_name = 'brman_catalog' AND std.command = 'Query' ORDER BY td.time, qf.statement_id LIMIT 10 ; +---------------+----------+----------------------------------------------------------------------------------------------------------------------------------+ | schema_name | time | query | +---------------+----------+----------------------------------------------------------------------------------------------------------------------------------+ | brman_catalog | 17:59:00 | SET NAMES `utf8` | | brman_catalog | 17:59:00 | SELECT COUNT ( * ) AS `cnt` FROM `information_schema` . `tables` WHERE `table_schema` = ? AND TABLE_NAME = ? | | brman_catalog | 17:59:00 | SELECT COUNT ( * ) AS `cnt` FROM `information_schema` . `tables` WHERE `table_schema` = ? AND TABLE_NAME = ? | | brman_catalog | 17:59:00 | CREATE TABLE `metadata` ( `id` TINYINT UNSIGNED NOT NULL AUTO_INCREMENT , `key` VARCHARACTER (?) NOT NULL , `value` VARCHARACTER | | brman_catalog | 17:59:00 | INSERT INTO `metadata` ( `key` , `value` ) VALUES (...) | | brman_catalog | 17:59:00 | INSERT INTO `metadata` ( `key` , `value` ) VALUES (...) | | brman_catalog | 17:59:00 | CREATE TABLE `backups` ( `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT , `instance_name` VARCHARACTER (?) NOT NULL , `start_ts` | | brman_catalog | 17:59:00 | CREATE TABLE `backup_details` ( `backup_id` INTEGER UNSIGNED NOT NULL , `hostname` VARCHARACTER (?) NULL , `binlog_file` VARCHAR | | brman_catalog | 17:59:00 | CREATE TABLE `files` ( `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT , `schema_name` VARCHARACTER (?) NULL , `original_name` VAR | | brman_catalog | 17:59:00 | CREATE TABLE `binary_logs` ( `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT , `filename` VARCHARACTER (?) NOT NULL , `begin_ts` I | +---------------+----------+----------------------------------------------------------------------------------------------------------------------------------+
Suggestions for improvement
Based on this first iteration of the model, you can already see which questions the model cannot yet answer or where the model is too imprecise. This can then be improved in a second round....
Examples of this are:
- The granularity of the time dimension may be too coarse with an accuracy of minutes. Would it make more sense to use seconds?
- The question of how long a connection was open is not so easy to answer. Perhaps a further fact table would be appropriate here?
SELECT cd.connection_number, cd.user, cd.hostname, tdf.time AS time_from, tdt.time AS time_to, (UNIX_TIMESTAMP(tdt.time) - UNIX_TIMESTAMP(tdf.time)) AS duration FROM connection_dim AS cd JOIN query_fact AS qf1 ON cd.connection_id = qf1.connection_id JOIN time_dim AS tdf ON tdf.time_id = qf1.time_id JOIN statement_dim AS sdf ON sdf.statement_id = qf1.statement_id JOIN query_fact AS qf2 ON cd.connection_id = qf2.connection_id JOIN time_dim AS tdt ON tdt.time_id = qf2.time_id JOIN statement_dim AS sdt ON sdt.statement_id = qf2.statement_id WHERE tdf.time BETWEEN '17:00' AND '18:30' AND sdf.command = 'Connect' AND sdt.command = 'Quit' AND (UNIX_TIMESTAMP(tdt.time) - UNIX_TIMESTAMP(tdf.time)) > 0 ORDER BY tdf.time ; +-------------------+-------+-----------+-----------+----------+----------+ | connection_number | user | hostname | time_from | time_to | duration | +-------------------+-------+-----------+-----------+----------+----------+ | 211 | brman | localhost | 17:59:00 | 18:00:00 | 60 | | 215 | root | localhost | 18:00:00 | 18:17:00 | 1020 | | 219 | brman | localhost | 18:06:00 | 18:07:00 | 60 | | 225 | brman | localhost | 18:10:00 | 18:11:00 | 60 | | 226 | brman | localhost | 18:13:00 | 18:14:00 | 60 | +-------------------+-------+-----------+-----------+----------+----------+
- Of course, it would be exciting if an AI were used to solve the problem. How do you train it correctly and does it find the problem once it has been trained?
So much for the little gimmick of building a data warehouse...
- Shinguz's blog
- Log in or register to post comments