At my employer (Messagebird) we are using Vividcortex (a.k.a. Solarwinds DPM nowadays) for query insights and this gives us a top 20 list of our most time consuming queries on a specific host. When you start to make one of your slow queries on your primary faster or offload them to read-replicas, new ones will get on top. That’s how I found a puzzling query that at first glance seemed to be quite okay, but was at the top of the list.
The example query below is quite similar to the one I found, but I changed the subject a bit to make its usage clearer:
SELECT discount_percentage FROM customer_discounts WHERE customer_type=1 AND customer_country=2 AND valid_from <= NOW() AND valid_till > NOW() LIMIT 1
What this query attempts to do is get a valid discount percentage for a customer segregated by
customer_country. In itself there seems to be little to no harm in this query.
Also the table definition and indexes looked reasonable :
CREATE TABLE customer_discounts (
id int(11) NOT NULL auto increment,
customer_type tinyint(3) NOT NULL,
customer_country tinyint(3) NOT NULL,
valid_from datetime NOT NULL,
valid_till datetime NOT NULL,
discount_percentage tinyint(3) NOT NULL,
created_at datetime NOT NULL,
KEY country_idx (customer_country,
KEY discount_idx (customer_type,
We have a covering index for the query above, so we have nothing to worry about. Or at least, that is until you think about it a little bit longer. In our case I saw multiple rows returned for a covering index, so how can that be? The query contains a double range scan, which is impossible to perform on an index. This means our index is only effective for
valid_from and then basically MySQL returns the data found from the index and performs a WHERE on the remaining rows.
The index is pretty deterministic on most of our discount codes, but what if we have years and years of discount codes lingering around in our table? Let’s suppose on average each
customer_country combination we have around 1000 discount codes in our table. This means for each lookup MySQL copies on average 1000 results from the index and performs a WHERE on it. If you have hundreds of queries per second (yes this is a large store), suddenly this index isn’t that efficient anymore.
This is what the execution plan would look like:
So how do we fix this? We have to go back to the origins of what we want to achieve here. We want to get the current active (valid) discount for our customer segregation. If we look at our index, we are covering the start date (
valid_from) of the discount first. When we look at our query, we are querying where the discount starts before today. For most of our discount entries this is condition is true because most of our discount entries were valid somewhere in the past. I would argue that probably 99% of the discounts would have a valid start date to satisfy this condition. That’s what you get if you keep the expired discounts in your table.
So what we need here is a paradigm shift. We need to stop looking at the past, but actually look ahead in the future. Most probably we will have far less discount entries where the end date lies somewhere in the future. Most probably older discount codes that are no longer valid will have an end date in the past and will not match our
valid_till condition. So we need to change our index and switch the order of start and end date, like this:
KEY discount_idx (customer_type,
You may argue that we still have the double range scan and we still can’t use the index beyond
valid_till. Yes that might be true, but at the same time reversing the two date columns caused our query to find a single entry in the index for 99% of our queries while it can still return all data from the index.
And now our explain plan looks like this:
The plan didn’t change that much, except we now have indeed only a single row returned. This means we don’t have to do a separate where on the returned result set.
In our own case this change didn’t magically lower the load on the server or bumped the query out of the top 20 list. We don’t have an uniform distribution in this table, but a few hundred combinations with thousands of rows and thousands of combinations with maybe 3 or 4 rows. After the change this query is still at the top of that top 20 list, but the average latency of this query is more evenly now and the latency has dropped (from 280μs) with 20μs to 40μs. It’s a small drop, but on thousands of requests per second it will certainly count!
So how did this happen? Who was to blame for this? I can be very brief about that: everybody and nobody.
It’s all got to do with our human brains here: we think of time in a linear concept where there is a start and an end to everything. Because we think linear about time, this also means we favour to put start before end. So to our human logic the most natural thing to do is to put the start column before the end column in a table definition. This makes it more readable for us humans. Just imagine reversing those two columns: it doesn’t matter for the database in which order the columns are. But for us humans it probably wouldn’t make much sense, looks simply wrong and many data architects would automatically switch the two back the way they are supposed to be. Similarly we would do the same thing with our indexes if we see those columns reversed.
So that’s the reason why I’m sharing this story here. Next time when you create a table with a start and end date for expiring discounts/products/rates/tickets, I hope you will remember this blogpost and reverse the start and end date order in the index. If anyone disagrees with you, please refer them to this article.