Next is a brief quick benchmark/comparison of different MySQL based storage engines I have been working lately for Big Data analytics. The comparison includes: disk space used, load time, query performance, as well as some comments. It is not intended as a formal benchmark.
During the last few days I have running out of disk space in my 2TB partition I use for my research experiments. On that partition, I mainly have a MySQL database with partitioned tables by week and over 2 years of web log and performance data. At first, I was comparing InnoDB vs MyISAM query performance and disk usage. MyISAM is a quite faster than InnoDB loading data, specially when DISABLING KEYS first, but then, reenabling the keys was a problem MyISAM on large tables. MyISAM doesn’t seem to scale well to a large number of partitions, while InnoDB does. An advantage of MyISAM tables besides fast loading, is that the tables occupy less disk space than InnoDB: InnoDB occupies about 40% more space than MyISAM for this type of tables, consisting of various numeric IDs and a couple of URLs. However, had many crashes with MyISAM having to repair tables many times. For data analysis that is annoyance but not a serious problem. Wouldn’t use MyISAM in production/OLTP servers, maybe if back in the early 2000’s…
Anyhow, after optimizing the configuration for both engines and having to choose between:
InnoDB: reliable, but large size on disk and slow to load tables. It could take a week to load the 2 years of data.
- MyISAM: faster to load, medium size on disk (a bit less than in CSV), but unreliable for large tables
ARCHIVE: a compressed engine for MySQL, doesn’t support keys (except for the numeric primary key). Already familiar with it for backups and integrated into MySQL. Supports basic partitioning.
InfoBright ICE: a compressed column table storage, fork from MySQL, open source with fast loading. As cons, requires a different installation, and advanced features only in the commercial version.
InfoBright IEE: commercial version of the storage. Promesses multi-core improvements for query and loading over the open-source version. Decided to give it a try for comparison.
*InfoBright is based in MySQL **Less is better
This first figure shows the total space on disk for the different tested engines. InnoDB is larger, depending on how many indexes you have on the table. With the ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=4 InnoDB option, the table size was reduced to half. MyISAM a bit less than the original, ARCHIVE about one third, while both versions of InfoBright an 8:1 improvement.
The Original file, is a CSV file, with one month of data, ~180 million rows, 29 columns. About 25 columns are integers, 2 VARCHARS for visiting URL and referer, a couple of fixed short CHARS.
For Data Analytics, load time is also an important factor to consider besides size on disk. For loading data, the MySQL LOAD DATA INTO command was used, as is reported to be the fastest, and the recommended option for InfoBright. The next figures shows the load time:
*InfoBright is based in MySQL **Less is better
InnoDB is not displayed on this Figure as it took more than 48hrs to load the data; even disabling AUTOCOMMIT and UNIQUE_CHECKS. The compressed option took 13 days to load the data. MyISAM loading with KEYS DISABLED is definitely faster than the regular import (however had crashes trying to build keys for larger tables), takes about the same time than ICE, and a bit faster than ARCHIVE, but not a big winner. IEE takes a bit more than half the time, and it is the fastest. Looking at CPU utilization, IEE used many cores even when importing, the rest of the engines only 2.
The LOAD DATA command offers fast loading, especially for Infobright. However, it is not the most common storage format for backups. Usually mysqldump is used for backups, in SQL insert format. Using SQL inserts, Infobright takes as much as MyISAM to insert, so you need to prepare the data sets specifically.
All tests are performed with no other workload or user process running on the system.
The most important bottleneck for all DB systems is disk I/O. This is also the case here. I have been using a new SATA (using SATA-2) 3TB hard disk. There are faster disks, but for big data analytics, one needs to rely on commodity, large storage disks, and all tests are performed in equal conditions.
Disk benchmarks with hdparm: (souce Arch wiki)
hdparm -t --direct /dev/sda
sync;time bash -c "(dd if=/dev/zero of=bf bs=8k count=500000; sync)"
As for the system, I was first using a regular desktop with 3.4GB of RAM. Had many MyISAM crashes with it, and it was slow for this amount of data. I was able to get a server with 16-cores and 48GB of RAM, larger than what usually is used for HDFS. The specs are the following:
CPU (x16): Intel(R) Xeon(R) CPU L5630 @ 2.13GHz, 1596 MHz
Storage: Intel 82801JI (ICH10 Family) 4 port SATA IDE Controller
eth1 Intel 82575EB Gigabit Network Connection
eth2 Intel 82575EB Gigabit Network Connection
Disk: WDC WD30EZRX-00M
Intel X58 I/O Hub to ESI Port
Intel X58 I/O Hub PCI Express Root Port 1
Intel X58 I/O Hub PCI Express Root Port 3
Intel X58 I/O Hub PCI Express Root Port 7
Intel X58 I/O Hub PCI Express Root Port 9
Intel X58 I/O Hub PCI Express Root Port 10
Intel 82801JI (ICH10 Family) PCI Express Port 1
Intel 82801JI (ICH10 Family) PCI Express Port 5
Intel 82801JI (ICH10 Family) PCI Express Port 6
Intel 82801 PCI Bridge
Intel 82801JIR (ICH10R) LPC Interface Controller
Mysql Server 5.5.29
Infobright ICE 4.0.7-0-x86_64-ice
Infobright IEE 4.0.7-2-x86_64-eval