You are here
Query performance comparison between MariaDB ColumnStore and other Storage Engines
Storage Engines like InnoDB, Aria and MyISAM are Row Stores. They store rows one after the other in blocks or even directly in a single file (MyISAM). On the other hand a Column Store like MariaDB ColumnStore stores all the same attributes (columns) of the rows together in chunks.
This is how the table sales_fact
looks like:
CREATE TABLE `sales_fact` ( `product_id` int(11) NOT NULL, `time_id` int(11) NOT NULL, `customer_id` int(11) NOT NULL, `promotion_id` int(11) NOT NULL, `store_id` int(11) NOT NULL, `store_sales` decimal(10,2) NOT NULL, `store_cost` decimal(10,4) NOT NULL, `unit_sales` int(11) NOT NULL ) ENGINE=ColumnStore DEFAULT CHARSET=utf8; CREATE TABLE `sales_fact` ( `product_id` int(11) NOT NULL, `time_id` int(11) NOT NULL, `customer_id` int(11) NOT NULL, `promotion_id` int(11) NOT NULL, `store_id` int(11) NOT NULL, `store_sales` decimal(10,2) NOT NULL, `store_cost` decimal(10,4) NOT NULL, `unit_sales` int(11) NOT NULL, KEY `i_sales_customer_id` (`customer_id`), KEY `i_sales_product_id` (`product_id`), KEY `i_sales_promotion_id` (`promotion_id`), KEY `i_sales_store_id` (`store_id`), KEY `i_sales_time_id` (`time_id`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
As you may have noted the ColumnStore table does NOT contain indexes!
And this shows how it is stored in memory and on disk (left Row Store, right ColumnStore; in reality it is a bit more complicated):
Considering this and depending on your queries a ColumnStore can have some significant response time advantages compared to a Row Store. And the other way around as well!
Comparison with MyISAM
So let us compare some queries with the same set of data. We are still working with one single-node MariaDB ColumnStore as described here.
We have chosen MyISAM for comparison with MariaDB ColumnStore because it is the fastest storage engine we have so far for low concurrency SELECT
statements. As data set we have used the good old foodmart schema prepared for ColumnStore. We concentrated on the sales_fact
table because this is the table with the biggest amount of rows. The MyISAM key_buffer_size
was set to 128 Mibyte which is big enough for all the indices.
Simple SELECT
queries
SELECT COUNT(*) FROM table
As experienced MyISAM users know already for this query MyISAM is unbeatable fast:
SQL> SELECT COUNT(*) FROM foodmart_cs.sales_fact; 1 row in set (0.040 sec) SQL> SELECT COUNT(*) FROM foodmart_myisam.sales_fact; 1 row in set (0.000 sec)
This query is mainly to make sure that both tables have the same amount of data...
SELECT * FROM table
or CHECKSUM TABLE table
In these queries we do a full table scans. This is probably the worst pattern you can do to ColumnStore. And MyISAM is significantly faster in this:
SQL> SELECT * FROM foodmart_cs.sales_fact; 1078880 rows in set (1.833 sec) SQL> SELECT * FROM foodmart_myisam.sales_fact; 1078880 rows in set (0.607 sec)
But this pattern is NOT what a ColumnStore is made for...
The ColumnStore architecture consists of 2 different types of modules: The User Module (UM) consisting of the MariaDB Server (mysqld
), the Execution Manager (ExeMgr
) and the Distribution Managers (DMLProc
, DDLProc
and cpimport
). These processes are responsible for parsing SQL and distributing and executing the SQL statements.
The other type of module is the Performance Module (PM) consisting of the Managing and Monitoring Process (ProcMgr
and ProcMon
), the Primary Process (PrimProc
) which handles the query execution and the Performance Module process which handles loads and writes (WriteEngineServer
and cpimport
). The Performance Module basically performs the work and does the I/O operations.
So we have the communication from the MariaDB server to the User Module to the Performance Module and this in normal situation over a network. So it is obvious that this costs a lot of time for huge data sets.
This can be shown when we execute the CHECKSUM TABLE
command which does similar things like SELECT * FROM table
but does NOT return the full result set:
SQL> CHECKSUM TABLE foodmart_cs.sales_fact; +------------------------+------------+ | Table | Checksum | +------------------------+------------+ | foodmart_cs.sales_fact | 2218293488 | +------------------------+------------+ 1 row in set (1.370 sec) SQL> CHECKSUM TABLE foodmart_myisam.sales_fact; +----------------------------+------------+ | Table | Checksum | +----------------------------+------------+ | foodmart_myisam.sales_fact | 2218293488 | +----------------------------+------------+ 1 row in set (0.853 sec)
It is interesting that ColumnStore is less slower here compared to the full table scan. It looks like it can already parallelize some of the work in this step already?
SELECT min(column), max(column) FROM table
Now let us come to more data warehouse (DWH) like queries:
SQL> SELECT MIN(time_id), MAX(time_id) FROM foodmart_cs.sales_fact; 1 row in set (0.104 sec) SQL> SELECT MIN(time_id), MAX(time_id) FROM foodmart_myisam.sales_fact; 1 row in set (0.001 sec)
Also here MyISAM is horribly fast because it can short cut:
SQL> EXPLAIN SELECT MIN(time_id), MAX(time_id) FROM foodmart_myisam.sales_fact; +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ | id | SELECT_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+ | 1 | SIMPLE | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away | +------+-------------+-------+------+---------------+------+---------+------+------+------------------------------+
It would be interesting to compare those queries with InnoDB which cannot do these short cuts.
SELECT column, COUNT(*) FROM table GROUP BY column
This is a query where MyISAM cannot shortcut and we see already an advantage for MariaDB ColumnsStore.
SQL> SELECT time_id, COUNT(*) FROM foodmart_cs.sales_fact GROUP BY time_id; 673 rows in set (0.099 sec) SQL> SELECT time_id, COUNT(*) FROM foodmart_myisam.sales_fact GROUP BY time_id; 673 rows in set (0.138 sec)
It would be interesting to see those number with really huge amount of rows (109 to 1010 rows) which do not fit into caches any more...
SELECT column, COUNT(*) FROM table GROUP BY column ORDER BY column
The last result was pretty much chaotic. So let us test an ordered result:
SQL> SELECT time_id, COUNT(*) FROM foodmart_cs.sales_fact GROUP BY time_id ORDER BY time_id; 673 rows in set (0.100 sec) SQL> SELECT time_id, COUNT(*) FROM foodmart_myisam.sales_fact GROUP BY time_id ORDER BY time_id; 673 rows in set (0.136 sec)
The ORDER BY
is probably executed in the MariaDB Server. So for this small data set the network communication can be ignored. No difference observed.
SELECT COUNT(*) FROM table where column >= value
This SELECT
queries data from a very big range (99%). We know MyISAM uses the index by doing an index-only-scan. MariaDB ColumnStore does NOT have indexes:
SQL> SELECT COUNT(*) FROM foodmart_cs.sales_fact WHERE time_id >= 400; 1 row in set (0.054 sec) SQL> SELECT COUNT(*) FROM foodmart_myisam.sales_fact WHERE time_id >= 400; 1 row in set (0.159 sec) SQL> EXPLAIN SELECT COUNT(*) FROM foodmart_myisam.sales_fact WHERE time_id >= 400; +------+-------------+------------+-------+-----------------+-----------------+---------+------+---------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+------------+-------+-----------------+-----------------+---------+------+---------+--------------------------+ | 1 | SIMPLE | sales_fact | range | i_sales_time_id | i_sales_time_id | 4 | NULL | 1055594 | Using where; Using index | +------+-------------+------------+-------+-----------------+-----------------+---------+------+---------+--------------------------+
ColumnStore outperforms MyISAM by factor of 3 already with a small data set.
The same query with a very small range of data (1%):
SQL> SELECT COUNT(*) FROM foodmart_cs.sales_fact WHERE time_id >= 1090; 1 row in set (0.042 sec) SQL> SELECT COUNT(*) FROM foodmart_myisam.sales_fact WHERE time_id >= 1090; 1 row in set (0.005 sec)
ColumnStore becomes only slightly faster with the smaller result set but here MyISAM key usage has a dramatic impact. Would be interesting to see the difference if the MyISAM data/key cannot be kept in memory any more.
SELECT SUM(column3) FROM table WHERE column1 = value AND column2 BETWEEN value AND value
SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_cs.sales_fact WHERE customer_id = 42 AND time_id BETWEEN 300 AND 1000; 1 row in set (0.072 sec) SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_myisam.sales_fact WHERE customer_id = 42 AND time_id BETWEEN 300 AND 1000; 1 row in set (0.002 sec)
As soon a MyISAM has an index with high cardinality and a strong filter it outperforms ColumnStore.
Let us go a step back to the query with the big range (99%). But forcing MyISAM to do a table access instead of an index-only-scan:
SQL> SELECT SUM(store_sales) FROM foodmart_cs.sales_fact WHERE time_id >= 400; 1 row in set (0.117 sec) SQL> SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact WHERE time_id >= 400; 1 row in set (0.133 sec)
Ohh! Here the MariaDB optimizer was clever enough to see that a full table scan is cheaper than accessing the index:
SQL> EXPLAIN SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact WHERE time_id >= 400; +------+-------------+------------+------+-----------------+------+---------+------+---------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+------------+------+-----------------+------+---------+------+---------+-------------+ | 1 | SIMPLE | sales_fact | ALL | i_sales_time_id | NULL | NULL | NULL | 1078880 | Using where | +------+-------------+------------+------+-----------------+------+---------+------+---------+-------------+
So MyISAM is only slightly slower than ColumnStore. But if we force MyISAM to use the index it becomes dramatically slow (about 9 times):
SQL> SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact FORCE INDEX (i_sales_time_id) WHERE time_id >= 400; 1 row in set (1.040 sec)
So here again: MariaDB ColumnStore starts making fun if a huge amount of data is used...
SELECT SUM(column3) FROM table WHERE column1 = value OR column2 = value
We know that WHERE
clauses with OR are always bad for the optimizer. So let us try this:
SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_cs.sales_fact WHERE promotion_id = 0 OR store_id = 13; 1 row in set (0.209 sec) MariaDB [(none)]> SELECT calGetTrace()\G *************************** 1. row *************************** calGetTrace(): Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM sales_fact 4995 (promotion_id,store_id,store_sales,unit_sales) 0 3163 0 0.172 132 TAS UM - - - - - - 0.153 1 TNS UM - - - - - - 0.000 1
ColumnStore has to touch 4 out of 8 column (50%) to get the result. How the Elapsed time is calculated I have to investigate some more...
SQL> SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_myisam.sales_fact WHERE promotion_id = 0 OR store_id = 13; 1 row in set (0.873 sec) SQL> EXPLAIN SELECT SUM(store_sales), SUM(unit_sales) FROM foodmart_myisam.sales_fact WHERE promotion_id = 0 OR store_id = 13; +------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+-----------------------------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+-----------------------------------------------------------------+ | 1 | SIMPLE | sales_fact | index_merge | i_sales_promotion_id,i_sales_store_id | i_sales_promotion_id,i_sales_store_id | 4,4 | NULL | 706357 | Using union(i_sales_promotion_id,i_sales_store_id); Using where | +------+-------------+------------+-------------+---------------------------------------+---------------------------------------+---------+------+--------+-----------------------------------------------------------------+
MyISAM tries to short cut by an INDEX MERGE
operation which we know is not super fast but still better than a full table scan. But still MariaDB ColumnStore is about 4 times faster than MyISAM.
More complex SELECT
queries from data warehouse benchmarks.
I borrowed these queries from Øystein's blog:
SELECT SUM(sf.store_sales) AS tot_yearly FROM foodmart_cs.sales_fact AS sf JOIN foodmart_cs.store AS s ON s.store_id = sf.store_id WHERE s.store_name = 'Store 13' AND sf.unit_sales < (SELECT 0.2 * AVG(isf.unit_sales) FROM foodmart_cs.sales_fact AS isf WHERE isf.store_id = s.store_id ) ; ERROR 1815 (HY000): Internal error: IDB-3012: Scalar filter and semi join are not from the same pair of tables. SELECT SUM(sf.store_sales) AS tot_yearly FROM foodmart_myisam.sales_fact AS sf JOIN foodmart_myisam.store AS s ON s.store_id = sf.store_id WHERE s.store_name = 'Store 13' AND sf.unit_sales < (SELECT 0.2 * AVG(isf.unit_sales) FROM foodmart_myisam.sales_fact AS isf WHERE isf.store_id = s.store_id ) ; 1 row in set (0.184 sec)
So some more complex queries seems not to work with ColumnStore yet. I have to figure out yet what to do in this case...
So let us try some more complex queries from Sergei:
SELECT SUM(store_sales) FROM foodmart_cs.sales_fact AS sf JOIN foodmart_cs.customer AS c ON c.customer_id = sf.customer_id WHERE c.total_children BETWEEN 1 AND 2 AND sf.unit_sales BETWEEN 2 AND 4 ; 1 row in set, 1 warning (0.210 sec)
We get a warning when using JOIN
s:
SQL> show warnings; +-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Level | Code | Message | +-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Note | 9999 | Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-0; CacheI/O-2651; BlocksTouched-2651; PartitionBlocksEliminated-0; MsgBytesIn-41KB; MsgBytesOut-74KB; Mode-Distributed | +-------+------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Seems to be some statistics about how the query/join was executed...
SELECT SUM(store_sales) FROM foodmart_myisam.sales_fact AS sf JOIN foodmart_myisam.customer AS c ON c.customer_id = sf.customer_id WHERE c.total_children BETWEEN 1 AND 2 AND sf.unit_sales BETWEEN 2 AND 4 ; 1 row in set (0.412 sec)
With JOIN
s I was a bit more sceptic because of my former experience with NDB but it looks like also with JOIN
s ColumnStore outperforms MyISAM with already a small amount of data.
The table customer
is a typical dimension table so this would probably be a candidate for hybrid approach (HTAP)?
Converting the customer
table to MyISAM did NOT make it significantly slower:
1 row in set, 1 warning (0.215 sec) +------+---------------+-------+------+---------------+------+---------+------+------+-------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+---------------+-------+------+---------------+------+---------+------+------+-------+ | 1 | PUSHED SELECT | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | +------+---------------+-------+------+---------------+------+---------+------+------+-------+ Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows CES UM - - - - - - 0.013 4093 BPS PM sf 4995 (customer_id,store_sales,unit_sales) 0 2636 0 0.182 132 HJS PM sf-c 4995 - - - - ----- - TAS UM - - - - - - 0.153 1 TNS UM - - - - - - 0.000 1
Converting the foodmart data warehouseschema to MariaDB ColumnsStore
Converting the existing foodmart schema to MariaDB ColumnStore was a bit cumbersome...
We used the normal mysqldump
and replace the storage engine:
$ zcat foodmart_dump.sql.gz | sed 's/ENGINE=InnoDB/ENGINE=ColumnStore/' | mysql --user=root foodmart_cs ERROR 1178 (42000) at line 25: The storage engine for the table doesn't support The syntax or the data type(s) is not supported by Columnstore. Please check the Columnstore syntax guide for supported syntax or data types.
The command ALTER TABLE ... ENGINE = ColumnStore
seems to be a bit more flexible but it is horribly slow...
The foodmart
dump optimized for MariaDB ColumnStore can be found here.
Problems between MariaDB server and ColumnStore dictionary
SQL> RENAME TABLE customer TO customer_a; Query OK, 0 rows affected (0.137 sec) SQL> CREATE TABLE customer LIKE customer_a; Query OK, 0 rows affected (0.018 sec) SQL> ALTER TABLE customer ENGINE = ColumnStore; ERROR 1815 (HY000): Internal error: CAL0009: Internal create table error for foodmart_cs.customer : table already exists (your schema is probably out-of-sync) SQL> DROP TABLE customer; Query OK, 0 rows affected (0.014 sec) SQL> ALTER TABLE customer ENGINE = ColumnStore; ERROR 1146 (42S02): Table 'foodmart_cs.customer' doesn't exist SQL> CREATE TABLE customer LIKE customer_a; Query OK, 0 rows affected (0.017 sec) SQL> ALTER TABLE customer ENGINE = ColumnStore; ERROR 1815 (HY000): Internal error: CAL0009: Internal create table error for foodmart_cs.customer : table already exists (your schema is probably out-of-sync)
To fix this the following DDL command will help:
SQL> CREATE TABLE customer (id INT) ENGINE = ColumnStore COMMENT='SCHEMA SYNC ONLY'; Query OK, 0 rows affected (0.006 sec)
Conclusion
MariaDB ColumnStore can outperform MyISAM already with a single-node set-up and a small dataset if you are choosing the right queries. It would be interesting to see the performance gains with a multi-node set-up and a huge data set. I was told that MariaDB ColumnStore makes sense from 100 Gibyte upwards... Our data set was about 100 Mibyte!
- Shinguz's blog
- Log in or register to post comments