Creating synthetic data sets for tuning SQL queries

Fri, 2020-10-02 16:50 — Shinguz

When it comes to SQL Query tuning with customers we often get the slow running SQL query and possibly, in good cases, also the table structure. But very often, for various reasons, we do not get the data.

SQL query tuning on an empty table or a table with only little data is not really fun because either the results of the optimizer have nothing to do with reality or the response times do not really show if your change has improved anything. For example if your query response time before the change was 2 ms and after 1 ms this can be either the consequence of your improvement but more probable a hiccup of your system.

So what to do to get valid results from your SQL query optimizer during SQL query tuning?

The best case is you get real data from the customer in size and content.
The second best case is if you get real data from the customer in content. So you can analyze this content and synthetically pump it up.
The worst case is if you get no data at all from your customer. In this case you have to create your own data set in size (this is easy) and in content. And this is a bit more tricky.

So let us have a look at how we get to this synthetic data.

Creating data volume by pumping up the table

We get from the customer a slow query for a data cleansing job on the call detail record (CDR) table. This table is used in telecom solutions like VoIP (Asterix, OpenSIPS), PBX and so on.

SELECT COUNT(*) FROM cdrs WHERE start >= end;

And fortunately we also get the CDR table:

CREATE TABLE `cdrs` (
  `uniqueid` varchar(40) NOT NULL,
  `callid` varchar(40) NOT NULL,
  `asteriskid` varchar(20) NOT NULL DEFAULT '',
  `machine` int(11) NOT NULL DEFAULT 0,
  `status` varchar(15) NOT NULL DEFAULT '',
  `start` TIMESTAMP(6) NOT NULL,
  `end` TIMESTAMP(6) NOT NULL,
  `scustomer` int(11) NOT NULL DEFAULT 0,
  `stype` varchar(30) NOT NULL DEFAULT '',
  `snumber` varchar(255) NOT NULL DEFAULT '',
  `dcustomer` int(11) NOT NULL DEFAULT 0,
  `dtype` varchar(30) NOT NULL DEFAULT '',
  `dnumber` varchar(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`uniqueid`),
  KEY `server` (`callid`),
  KEY `start` (`start`),
  KEY `scustomer` (`scustomer`,`start`),
  KEY `dcustomer` (`dcustomer`,`start`),
  KEY `asteriskid` (`asteriskid`)
);

So how can we pump up this table to get a decent volume? For pumping up the table we use the concept of the Rice/Wheat and chessboard problem:

We first insert one row and then pump it up by adding rows from itself. This gives us an exponential growth and after a few statement we have enough data (possibly time becomes an issue sooner or later):

INSERT INTO cdrs
SELECT UUID(), MD5(RAND()), '', 0, '', FROM_UNIXTIME(ROUND(RAND() * UNIX_TIMESTAMP(), 6)), 0, 0, '', '', 0, '', ''
;
Query OK, 1 row affected (0.001 sec)
Records: 1  Duplicates: 0  Warnings: 0

INSERT INTO cdrs
SELECT UUID(), MD5(RAND()), '', 0, '', FROM_UNIXTIME(ROUND(RAND() * UNIX_TIMESTAMP(), 6)), 0, 0, '', '', 0, '', ''
FROM cdrs
;
-- Repeat this second query about 20 times to get 1 Mio rows.

Query OK, 1 row affected (0.001 sec)
Records: 1  Duplicates: 0  Warnings: 0

Query OK, 2 rows affected (0.001 sec)
Records: 2  Duplicates: 0  Warnings: 0

Query OK, 4 rows affected (0.000 sec)
Records: 4  Duplicates: 0  Warnings: 0

Query OK, 8 rows affected (0.000 sec)
Records: 8  Duplicates: 0  Warnings: 0

Query OK, 16 rows affected (0.001 sec)
Records: 16  Duplicates: 0  Warnings: 0

Query OK, 32 rows affected (0.002 sec)
Records: 32  Duplicates: 0  Warnings: 0

Query OK, 64 rows affected (0.002 sec)
Records: 64  Duplicates: 0  Warnings: 0

Query OK, 128 rows affected (0.094 sec)
Records: 128  Duplicates: 0  Warnings: 0

Query OK, 256 rows affected (1.406 sec)
Records: 256  Duplicates: 0  Warnings: 0

Query OK, 512 rows affected (2.747 sec)
Records: 512  Duplicates: 0  Warnings: 0

Query OK, 1024 rows affected (4.888 sec)
Records: 1024  Duplicates: 0  Warnings: 0

What happened here???

Query OK, 2048 rows affected (0.178 sec)
Records: 2048  Duplicates: 0  Warnings: 0

Query OK, 4096 rows affected (0.259 sec)
Records: 4096  Duplicates: 0  Warnings: 0

Query OK, 8192 rows affected (1.879 sec)
Records: 8192  Duplicates: 0  Warnings: 0

Query OK, 16384 rows affected (4.149 sec)
Records: 16384  Duplicates: 0  Warnings: 0

Query OK, 32768 rows affected (3.256 sec)
Records: 32768  Duplicates: 0  Warnings: 0

Query OK, 65536 rows affected (7.209 sec)
Records: 65536  Duplicates: 0  Warnings: 0

Query OK, 131072 rows affected (13.555 sec)
Records: 131072  Duplicates: 0  Warnings: 0

Buffer Pool seems to be full! More RAM helps more...

Query OK, 262144 rows affected (6 min 17.659 sec)
Records: 262144  Duplicates: 0  Warnings: 0

Increased Buffer Pool (online!) 6 times and waited for dirty page flushing.

Query OK, 524288 rows affected (1 min 14.629 sec)
Records: 524288  Duplicates: 0  Warnings: 0

It definitely helped! More RAM helps more!!!

How to get more or less useful or realistic data

Description	Function	Example
Unique ID:	SELECT UUID();	09e16608-017f-11eb-9cc7-a8a15920b138
Random Float from 0 TO 99:	SELECT RAND() * 100;	82.15320322863124
Random Integer from 10 to 19:	SELECT 10 + FLOOR(RAND() * 10);	10
Random Float for currencies:	SELECT ROUND(RAND() * 1000, 2);	628.07
Random String:	SELECT UUID(), MD5(RAND()), CRC32(RAND());	232fccee-017f-11eb-9cc7-a8a15920b138 0e468db120211529f5fc2940994024a8 263783538
Random Timestamp ≥ 1970-01-01 00:00:00 UTC:	SELECT FROM_UNIXTIME(ROUND(RAND() * UNIX_TIMESTAMP(), 6));	1992-06-30 11:04:04.784335
Random Timestamp ≥ 2020-01-01 and < 2020-12-31:	SELECT FROM_UNIXTIME(UNIX_TIMESTAMP('2020-01-01') + (365 * 24 * 3600 * RAND()));	2020-08-06 04:48:53.342219
Some kind of email address:	SELECT CONCAT(CRC32(RAND()), '@', MD5(RAND()), '.com');	1619088853@6b20a5dad4522feee5efbfd3ebb17d71.com
Time range of 21 days from now:	SELECT FROM_UNIXTIME(@begin := UNIX_TIMESTAMP()), FROM_UNIXTIME(@begin + (86400 * 21));	2020-10-02 11:14:48 2020-10-23 11:14:48
Street name:	SELECT CONCAT(CRC32(RAND()), 'street ', (1 + FLOOR(RAND() * 100)));	3416042219street 14
Value selection:	SELECT CASE ROUND(RAND()*3, 0) WHEN 1 THEN 'Value 1' WHEN 2 THEN 'Value 2' WHEN 3 THEN 'Value 3' ELSE 'All other values' END AS selection ;	Value 3
50% of data NULL	IF(ROUND(RAND()) = 0, NULL, 1);	NULL or 1

and there are for sure many more possibilities...

Now back to the query: With no rows the Query Execution Plan with EXPLAIN looks as follows:

SQL> EXPLAIN SELECT COUNT(*) FROM cdrs WHERE start >= end;
+------+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | cdrs  | ALL  | NULL          | NULL | NULL    | NULL | 1    | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+-------------+

and the query run time is this:

SQL> SELECT COUNT(*) FROM cdrs WHERE start >= end;
+----------+
| COUNT(*) |
+----------+
|        0 |
+----------+
1 row in set (0.004 sec)

If we pump up the table the query execution plan with EXPLAIN looks like this:

SQL> EXPLAIN SELECT COUNT(*) FROM cdrs WHERE start >= end;
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+
|    1 | SIMPLE      | cdrs  | ALL  | NULL          | NULL | NULL    | NULL | 1016314 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+---------+-------------+

and the query run time is this:

SQL> SELECT COUNT(*) FROM cdrs WHERE start >= end;
+----------+
| COUNT(*) |
+----------+
|  1048576 |
+----------+
1 row in set (0.230 sec)

So we have a good starting point for testing tuning measures on this SQL query...

But caution: There is still a logical error in the data above... Did you find it?

If and how this query can be speed up is an other story...

Skew data distribution

What we got so far is a completely random data distribution or at least random in a specific range. This is very often far from reality. So we have to influence or manipulate this random data distribution a bit more into the direction that it reflects our reality.

There are 2 extreme cases: We are only searching for one unique value or all values are equal. In reality we are somewhere in between.

If you where using a UUID or hash function to create the data they should be pretty unique. So this extreme case is covered:

UPDATE cdrs SET start = '2020-04-29 12:00:00.000000', end = '2020-04-29 12:13:13.999999'
 WHERE uniqueid = 'd00c7166-01a4-11eb-9cc7-a8a15920b138';

Or you can specifically UPDATE ONE row to your needs. This other extreme case also can be solved by a simple UPDATE statement:

UPDATE cdrs SET machine = 42;

If you want to set only every nth row to a specific value the modulo operation might help:

SET @nth = 7;
UPDATE cdrs SET machine = 42 WHERE (id % @nth) = 0;
UPDATE cdrs SET machine = IF(HEX(SUBSTR(UNIQUEID, 8, 1)) % @nth, 42, machine);

And finaly if you need monotonic increasing numbers this is a possibility to do it:

SET @row_number = 0; 
UPDATE cdrs SET machine = (@row_number := @row_number + 1);

Sources

Mario García, 9 January 2023: How To Generate Test Data for Your Database Project With Python

Taxonomy upgrade extras:

Shinguz's blog
Log in or register to post comments

Languages

Login

You are here

Creating synthetic data sets for tuning SQL queries

Creating data volume by pumping up the table

How to get more or less useful or realistic data

Skew data distribution

Sources

Navigation

Trademarks

Languages

Login

Search form

You are here

Creating synthetic data sets for tuning SQL queries

Creating data volume by pumping up the table

How to get more or less useful or realistic data

Skew data distribution

Sources

Navigation

Trademarks