Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark (part2)

14 May

This is Part2 of the MySQL based engines I have been testing lately.  Part1 covered time loading data, size on disk for the tables, and system specs.  On Part1 we could see how the ARCHIVE engine and both free and commercial versions of Infobright gave significant gains in disk space.  In Load times, the commercial version of Infobright (IEE) was the fastest; while MyISAM (disabling keys), ARCHIVE, and Infobright ICE were similar.  InnoDB performed poorly importing data from CSV, taking days to import.  InnoDB works better with INSERT statements than CSV imports.  The next Figure show query performance on this test data for two different queries:

query performance for different MySQL engines

*Infobrigtht is based in MySQL **Less is better

 

First, about the queries:

Query1 counts the number of rows grouped by day

Query2 performs aggregation (DISTINCT, AVG, SUM) of 3 fields and groups them by day

 

Infobright IEE is by far the fastest on both queries.  It does take advantage of multi-cores for single queries, from a top command it could be seen that it used the 16 cores.  InnoDB (uncompressed) and MyISAM did well for Query1, as the fields were indexed.  Besides the compressed InnoDB table, the rest of the engines performed similarly for Query2, with about 11 minutes of query execution.  Two queries are not enough to compare the engines, but since my IEE licence expired, can’t run any more experiments.  I did ran several queries on all to get a feeling of working with the engines, here are some insights and recommendations on the engines:

 

  • The compressed version of InnoDB is not worth using for OLAP.  Will have to test how it behaves with OLTP traffic compared to MyISAM, but I will discard it right away, you can use the COMPRESS() function on your queries to save the same space.

  • MyISAM has fast loading, medium size on disk and decent query performance if fields are indexed.  However, with the 2 years dataset, had many crashes having to repair the tables multiple times.  Had problems with large number of partitions (1 per day and 1 per week).  Also queries that would never finished or ended in: ERROR 9 (HY000): Unexpected eof found when reading file ‘/data/mysql/tmp/MYboApUd’ (Errcode: 175)

  • Would only use InnoDB if tables come from OLTP systems and don’t have to import data (snapshots maybe), or if data-size is not a problem.  The default mysqldump format is the fastest for loading from text files.

  • ARCHIVE is decent all around, disk space usage is minimal, loading time fast, query performance is not great, but is similar to other engines when using the whole table/partition and many fields.

  • All engines (except for IEE) have similar query times when using all the table/partition and most columns.  So if there is basic planning on the queries and the table is partitioned ARCHIVE can be a good candidate.

  • The only advantage of ICE is that it occupies less space, half than ARCHIVE.  As disadvantages have that it requires a separate installation.  No INSERTS, UPDATES, or DELETES.  The separate installation is also not capable of having MyISAM or other engines installed.

  • I would discard ICE as an option unless there is a chance that IEE would be bought in the future, to get familiar with the product.  However IEE doesn’t import ICE data.

  • IEE is by far better, but of course is not free.

  • Most queries on IEE are  almost instantaneous.  IEE would be an option if you need fast query performance on large amount of data, not real-time, but fast for interactive querying.  And MySQL compatible.

  • IEE performance improvements seem to be related to it’s ability to use many cores for the same query, and the way it splits the data to each core.

  • Will be interesting to see how an engine such as ARCHIVE or MyISAM can be improved to use multi-cores for a single query.  It doesn’t seem like Infobright will bring their commercial technologies to open source.

 

After these couple of months having to process 2 years of data, still don’t have a clear choice for big-data data analytics.  ARCHIVE is decent, but it can be tedious if you’re querying it often.  If you can have some planning, you can improve this by partitioning or using MyISAM on not so large tables and having major fields indexed.  If you need to work with data for a short-time you might give Infobright’s IEE a try.  They have a one-month trial.  As a side note, I had to contact them about the results I was getting, but had no reply from them (although they’ve sent me a lot of automated emails about my license).