You are here
Creating synthetic data sets for tuning SQL queries
When it comes to SQL Query tuning with customers we often get the slow running SQL query and possibly, in good cases, also the table structure. But very often, for various reasons, we do not get the data.
SQL query tuning on an empty table or a table with only little data is not really fun because either the results of the optimizer have nothing to do with reality or the response times do not really show if your change has improved anything. For example if your query response time before the change was 2 ms and after 1 ms this can be either the consequence of your improvement but more probable a hiccup of your system.
So what to do to get valid results from your SQL query optimizer during SQL query tuning?
- The best case is you get real data from the customer in size and content.
- The second best case is if you get real data from the customer in content. So you can analyze this content and synthetically pump it up.
- The worst case is if you get no data at all from your customer. In this case you have to create your own data set in size (this is easy) and in content. And this is a bit more tricky.
So let us have a look at how we get to this synthetic data.
Creating data volume by pumping up the table
We get from the customer a slow query for a data cleansing job on the call detail record (CDR) table. This table is used in telecom solutions like VoIP (Asterix, OpenSIPS), PBX and so on.
SELECT COUNT(*) FROM cdrs WHERE start >= end;
And fortunately we also get the CDR table:
CREATE TABLE `cdrs` ( `uniqueid` varchar(40) NOT NULL, `callid` varchar(40) NOT NULL, `asteriskid` varchar(20) NOT NULL DEFAULT '', `machine` int(11) NOT NULL DEFAULT 0, `status` varchar(15) NOT NULL DEFAULT '', `start` TIMESTAMP(6) NOT NULL, `end` TIMESTAMP(6) NOT NULL, `scustomer` int(11) NOT NULL DEFAULT 0, `stype` varchar(30) NOT NULL DEFAULT '', `snumber` varchar(255) NOT NULL DEFAULT '', `dcustomer` int(11) NOT NULL DEFAULT 0, `dtype` varchar(30) NOT NULL DEFAULT '', `dnumber` varchar(255) NOT NULL DEFAULT '', PRIMARY KEY (`uniqueid`), KEY `server` (`callid`), KEY `start` (`start`), KEY `scustomer` (`scustomer`,`start`), KEY `dcustomer` (`dcustomer`,`start`), KEY `asteriskid` (`asteriskid`) );
So how can we pump up this table to get a decent volume? For pumping up the table we use the concept of the Rice/Wheat and chessboard problem:
We first insert one row and then pump it up by adding rows from itself. This gives us an exponential growth and after a few statement we have enough data (possibly time becomes an issue sooner or later):
INSERT INTO cdrs SELECT UUID(), MD5(RAND()), '', 0, '', FROM_UNIXTIME(ROUND(RAND() * UNIX_TIMESTAMP(), 6)), 0, 0, '', '', 0, '', '' ; Query OK, 1 row affected (0.001 sec) Records: 1 Duplicates: 0 Warnings: 0 INSERT INTO cdrs SELECT UUID(), MD5(RAND()), '', 0, '', FROM_UNIXTIME(ROUND(RAND() * UNIX_TIMESTAMP(), 6)), 0, 0, '', '', 0, '', '' FROM cdrs ; -- Repeat this second query about 20 times to get 1 Mio rows. Query OK, 1 row affected (0.001 sec) Records: 1 Duplicates: 0 Warnings: 0 Query OK, 2 rows affected (0.001 sec) Records: 2 Duplicates: 0 Warnings: 0 Query OK, 4 rows affected (0.000 sec) Records: 4 Duplicates: 0 Warnings: 0 Query OK, 8 rows affected (0.000 sec) Records: 8 Duplicates: 0 Warnings: 0 Query OK, 16 rows affected (0.001 sec) Records: 16 Duplicates: 0 Warnings: 0 Query OK, 32 rows affected (0.002 sec) Records: 32 Duplicates: 0 Warnings: 0 Query OK, 64 rows affected (0.002 sec) Records: 64 Duplicates: 0 Warnings: 0 Query OK, 128 rows affected (0.094 sec) Records: 128 Duplicates: 0 Warnings: 0 Query OK, 256 rows affected (1.406 sec) Records: 256 Duplicates: 0 Warnings: 0 Query OK, 512 rows affected (2.747 sec) Records: 512 Duplicates: 0 Warnings: 0 Query OK, 1024 rows affected (4.888 sec) Records: 1024 Duplicates: 0 Warnings: 0
What happened here???
Query OK, 2048 rows affected (0.178 sec) Records: 2048 Duplicates: 0 Warnings: 0 Query OK, 4096 rows affected (0.259 sec) Records: 4096 Duplicates: 0 Warnings: 0 Query OK, 8192 rows affected (1.879 sec) Records: 8192 Duplicates: 0 Warnings: 0 Query OK, 16384 rows affected (4.149 sec) Records: 16384 Duplicates: 0 Warnings: 0 Query OK, 32768 rows affected (3.256 sec) Records: 32768 Duplicates: 0 Warnings: 0 Query OK, 65536 rows affected (7.209 sec) Records: 65536 Duplicates: 0 Warnings: 0 Query OK, 131072 rows affected (13.555 sec) Records: 131072 Duplicates: 0 Warnings: 0
Buffer Pool seems to be full! More RAM helps more...
Query OK, 262144 rows affected (6 min 17.659 sec) Records: 262144 Duplicates: 0 Warnings: 0
Increased Buffer Pool (online!) 6 times and waited for dirty page flushing.
Query OK, 524288 rows affected (1 min 14.629 sec) Records: 524288 Duplicates: 0 Warnings: 0
It definitely helped! More RAM helps more!!!
How to get more or less useful or realistic data
Description | Function | Example |
---|---|---|
Unique ID: | SELECT UUID(); | 09e16608-017f-11eb-9cc7-a8a15920b138 |
Random Float from 0 TO 99: | SELECT RAND() * 100; | 82.15320322863124 |
Random Integer from 10 to 19: | SELECT 10 + FLOOR(RAND() * 10); | 10 |
Random Float for currencies: | SELECT ROUND(RAND() * 1000, 2); | 628.07 |
Random String: | SELECT UUID(), MD5(RAND()), CRC32(RAND()); | 232fccee-017f-11eb-9cc7-a8a15920b138 0e468db120211529f5fc2940994024a8 263783538 |
Random Timestamp ≥ 1970-01-01 00:00:00 UTC: | SELECT FROM_UNIXTIME(ROUND(RAND() * UNIX_TIMESTAMP(), 6)); | 1992-06-30 11:04:04.784335 |
Random Timestamp ≥ 2020-01-01 and < 2020-12-31: | SELECT FROM_UNIXTIME(UNIX_TIMESTAMP('2020-01-01') + (365 * 24 * 3600 * RAND())); | 2020-08-06 04:48:53.342219 |
Some kind of email address: | SELECT CONCAT(CRC32(RAND()), '@', MD5(RAND()), '.com'); | 1619088853@6b20a5dad4522feee5efbfd3ebb17d71.com |
Time range of 21 days from now: | SELECT FROM_UNIXTIME(@begin := UNIX_TIMESTAMP()), FROM_UNIXTIME(@begin + (86400 * 21)); | 2020-10-02 11:14:48 2020-10-23 11:14:48 |
Street name: | SELECT CONCAT(CRC32(RAND()), 'street ', (1 + FLOOR(RAND() * 100))); | 3416042219street 14 |
Value selection: | SELECT CASE ROUND(RAND()*3, 0) WHEN 1 THEN 'Value 1' WHEN 2 THEN 'Value 2' WHEN 3 THEN 'Value 3' ELSE 'All other values' END AS selection ; | Value 3 |
50% of data NULL | IF(ROUND(RAND()) = 0, NULL, 1); | NULL or 1 |
and there are for sure many more possibilities...
Now back to the query: With no rows the Query Execution Plan with EXPLAIN
looks as follows:
SQL> EXPLAIN SELECT COUNT(*) FROM cdrs WHERE start >= end; +------+-------------+-------+------+---------------+------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+-------------+ | 1 | SIMPLE | cdrs | ALL | NULL | NULL | NULL | NULL | 1 | Using where | +------+-------------+-------+------+---------------+------+---------+------+------+-------------+
and the query run time is this:
SQL> SELECT COUNT(*) FROM cdrs WHERE start >= end; +----------+ | COUNT(*) | +----------+ | 0 | +----------+ 1 row in set (0.004 sec)
If we pump up the table the query execution plan with EXPLAIN
looks like this:
SQL> EXPLAIN SELECT COUNT(*) FROM cdrs WHERE start >= end; +------+-------------+-------+------+---------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | cdrs | ALL | NULL | NULL | NULL | NULL | 1016314 | Using where | +------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
and the query run time is this:
SQL> SELECT COUNT(*) FROM cdrs WHERE start >= end; +----------+ | COUNT(*) | +----------+ | 1048576 | +----------+ 1 row in set (0.230 sec)
So we have a good starting point for testing tuning measures on this SQL query...
But caution: There is still a logical error in the data above... Did you find it?
If and how this query can be speed up is an other story...
Skew data distribution
What we got so far is a completely random data distribution or at least random in a specific range. This is very often far from reality. So we have to influence or manipulate this random data distribution a bit more into the direction that it reflects our reality.
There are 2 extreme cases: We are only searching for one unique value or all values are equal. In reality we are somewhere in between.
If you where using a UUID
or hash function to create the data they should be pretty unique. So this extreme case is covered:
UPDATE cdrs SET start = '2020-04-29 12:00:00.000000', end = '2020-04-29 12:13:13.999999' WHERE uniqueid = 'd00c7166-01a4-11eb-9cc7-a8a15920b138';
Or you can specifically UPDATE
ONE row to your needs. This other extreme case also can be solved by a simple UPDATE
statement:
UPDATE cdrs SET machine = 42;
If you want to set only every nth row to a specific value the modulo operation might help:
SET @nth = 7; UPDATE cdrs SET machine = 42 WHERE (id % @nth) = 0; UPDATE cdrs SET machine = IF(HEX(SUBSTR(UNIQUEID, 8, 1)) % @nth, 42, machine);
And finaly if you need monotonic increasing numbers this is a possibility to do it:
SET @row_number = 0; UPDATE cdrs SET machine = (@row_number := @row_number + 1);
Sources
- Mario García, 9 January 2023: How To Generate Test Data for Your Database Project With Python
- Shinguz's blog
- Log in or register to post comments