Archive | MySQL RSS feed for this section

Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark (part2)

14 May

This is Part2 of the MySQL based engines I have been testing lately.  Part1 covered time loading data, size on disk for the tables, and system specs.  On Part1 we could see how the ARCHIVE engine and both free and commercial versions of Infobright gave significant gains in disk space.  In Load times, the commercial version of Infobright (IEE) was the fastest; while MyISAM (disabling keys), ARCHIVE, and Infobright ICE were similar.  InnoDB performed poorly importing data from CSV, taking days to import.  InnoDB works better with INSERT statements than CSV imports.  The next Figure show query performance on this test data for two different queries:

query performance for different MySQL engines


Data Analytics: MyISAM vs ARCHIVE vs Infobright ICE vs Infobright IEE benchmark

3 Apr

Next is a brief quick benchmark/comparison of different MySQL based storage engines I have been working lately for Big Data analytics.  The comparison includes: disk space used, load time, query performance, as well as some comments.  It is not intended as a formal benchmark.

During the last few days I have running out of disk space in my 2TB partition I use for my research experiments.  On that partition, I mainly have a MySQL database with partitioned tables by week and over 2 years of web log and performance data.  At first, I was comparing InnoDB vs MyISAM query performance and disk usage.  MyISAM is a quite faster than InnoDB loading data, specially when DISABLING KEYS first, but then, reenabling the keys was a problem MyISAM on large tables.  MyISAM doesn’t seem to scale well to a large number of partitions, while InnoDB does.  An advantage of MyISAM tables besides fast loading, is that the tables occupy less disk space than InnoDB:  InnoDB occupies about 40% more space than MyISAM for this type of tables, consisting of various numeric IDs and a couple of URLs.  However, had many crashes with MyISAM having to repair tables many times.   For data analysis that is annoyance but not a serious problem.  Wouldn’t use MyISAM in production/OLTP servers, maybe if back in the early 2000’s…

 Anyhow, after optimizing the configuration for both engines and having to choose between:

  • InnoDB: reliable, but large size on disk and slow to load tables.  It could take a week to load the 2 years of data.

  • MyISAM: faster to load, medium size on disk (a bit less than in CSV), but unreliable for large tables

Decided to explore other non distributed file system options like Hadoop, with easy MySQL migration and found:

  • ARCHIVE: a compressed engine for MySQL, doesn’t support keys (except for the numeric primary key).  Already familiar with it for backups and integrated into MySQL.  Supports basic partitioning.

  • InfoBright ICE: a compressed column table storage, fork from MySQL, open source with fast loading.  As cons, requires a different installation, and advanced features only in the commercial version.

  • InfoBright IEE: commercial version of the storage.  Promesses multi-core improvements for query and loading over the open-source version. Decided to give it a try for comparison.



GROUP BY a SQL query by N minutes

2 Feb

Just a quick post about a tip to group in SQL [tested on MySQL] a date-time field by N number of minutes using FLOOR().

Functions such as DAY(), HOUR(), MINUTE() are very useful to group by dates, but what about 30, 15, or 10 minutes?


Calculating 95th, 99th, 50th (median) with a single MySQL query

16 Nov

As a first post, let’s start with a basic tip for SQL queries I use very often.  First let’s start with some basic info on percentiles:

Percentiles are one of the best indicators of how well our web site is performing, unlike averages.  Averages tend to hide information on outlier values, and while they might be showing you adequate numbers, they might be hiding how a significant portion of  users are seeing your site.

For websites, a good percentile to base measurements is the 95th percentile, while for network equipment the 99th.  This is because you might have some pages that are inherently longer to process, such an availability search for a specific product, while network equipment doesn’t have this constrain.  You should really use the number that best represents your workload, 95th seems to be set as the standard but you can also use the 85th according to the type of site you are monitoring.  The median (not average) corresponds to the 50th percentile.  (more…)