Calculating 95th, 99th, 50th (median) with a single MySQL query

16 Nov

As a first post, let’s start with a basic tip for SQL queries I use very often.  First let’s start with some basic info on percentiles:

Percentiles are one of the best indicators of how well our web site is performing, unlike averages.  Averages tend to hide information on outlier values, and while they might be showing you adequate numbers, they might be hiding how a significant portion of  users are seeing your site.

For websites, a good percentile to base measurements is the 95th percentile, while for network equipment the 99th.  This is because you might have some pages that are inherently longer to process, such an availability search for a specific product, while network equipment doesn’t have this constrain.  You should really use the number that best represents your workload, 95th seems to be set as the standard but you can also use the 85th according to the type of site you are monitoring.  The median (not average) corresponds to the 50th percentile. 

So what are percentiles? Imagine there are 100 students in a class and they all take a test, the 95th percentile is the 95th worst grade for the students.  But wait a minute, your users are not the students, you are!  If we are talking about page load times, the 95th percentile will almost be your worst response time, that 5% of your users are having this response time or worse.  Which might not be a small number is you have thousands of requests.

Try setting in your mind the 95th percentile as your new average!

Well, what is the cons on percentiles? Well, they are not simple to calculate.  After looking around at different queries, finally found a way to calculate them with good enough performance.

Enough text, let’s get to the  query:

SELECT
 CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
  GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
   ',', 95/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `95th Per`
 FROM table_name;

 

Where field_name is your response time field, table_name your table.  The 95, is the percentage you are after, such as: 99, 95, or 85.

The query look complicated, but really MySQL manages it with great speed.  It uses the GROUP_CONCAT function to order the response times, the takes the 95 position, by using the SUBSTRING_INDEX function, and finally CASTing the result to a DECIMAL

To calculate the median is the same, the next query calculates both the median and the 95th percentile:

SELECT
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
',', 50/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `Median`,
CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(
GROUP_CONCAT(field_name ORDER BY field_name SEPARATOR ','),
',', 95/100 * COUNT(*) + 1), ',', -1) AS DECIMAL) AS `95th Per`
FROM table_name;

 

Just give it a try, I regularly.  Just don’t forget to check your group_concat_max_len maximum value, which normally is low on the default configuration, you can update it on runtime by executing this before the percentile query:

SET group_concat_max_len = 10485760; #10MB max length

 

Well let me know what you think, all the credit really goes to Roland Bouman, on his blog you can find a deeper discussion on the query and percentiles.

Switch to our mobile site