Runaway history list

On one of the clusters at Spil we noticed a sudden increase in the length of the history list and a steep increase in the ibdata file in the MySQL directory.
I did post a bit about this topic earlier regarding MySQL 5.5 but this cluster is still running 5.1 and unfortunately 5.1 does not have the same configurable options to influence the purging of the undo log…

History list

Now I did find a couple of great resources that explain the purge lag problem into detail: Pythian, DimitriK and Marco Tusa.

What it boils down to is that the purge lag is largely influenced by the length of the history list and the purge lag:
((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds.
On 5.5 it is also influenced by the number of purge threads and purge batch size. I toyed around with these settings in my earlier post and tuning them helped. However the only setting I could change on 5.1 is the purge lag in milliseconds that was already set to 0. In other words: I could not fiddle around with this. This time it wasn’t an upgrade to 5.5 either so I could not blame that again. 😉

So what was different on this server then? Well the only difference was that it did have “a bit” of disk utilization: around 80% during peak hours. Since it is not used as a front end server it does not affect the end users, but only the (background) job processes that process and store data on this server. However it could be the case that due to the IO utilization it started to lag behind and created a too large history list to catch up with its current configuration.

How did we resolve it then? After I read this quote of Peter Zaitsev on Marco Tusa‘s posting the solution became clear:

Running Very Long Transaction If you’re running very long transaction, be it even SELECT, Innodb will be unable to purge records for changes which are done after this transaction has started, in default REPEATABLE-READ isolation mode. This means very long transactions are very bad causing a lot of garbage to be accommodated in the database. It is not limited to undo slots. When we’re speaking about Long Transactions the time is a bad measure. Having transaction in read only database open for weeks does no harm, however if database has very high update rate, say 10K+ rows are modified every second even 5 minute transaction may be considered long as it will be enough to accumulate about 3 million of row changes.

The transaction isolation is default set to REPEATABLE-READ and we favor it on many of our systems, especially because it performs better than READ-COMMITTED. However a background job running storage server does not need this transaction isolation, especially not if it is was blocking the purge to be performed!

So in the end changing the transaction isolation to READ-COMMITTED did fix the job for us.

Some other things: tomorrow my team is attending the MySQL User Group NL and in three weeks time I’ll be speaking at Percona London:
Percona Live London, December 3-4, 2012
So see you there!

MySQL 5.5 upgrade blues (part one)

At the company I work for we are still running Percona Server 5.1 in production and are slowly heading towards a Percona Server 5.5 rollout. It did take a lot of preparation in the past few months (write a my.cnf conversion script for example) and a lot of testing. A couple of machines already have been upgraded this week to 5.5 to compare performance and stability. So far the machines proved to be stable enough to keep them on 5.5 and even better: we already see a couple of benefits! However, the title wouldn’t have been blues if everything would have been a breeze, right? 😉

First problem we ran into was that our Cacti templates broke due to the changed InnoDB status output. So I headed towards the Cacti templates and looked in the issue tracker if the issue was already known. Apparently it was already known, but unfortunately not fixed yet. Lucky enough writing the fixes myself wasn’t much of a problem.

Secondly we ran into the issue that the history list was growing from a “steady” 200 to 4000 after upgrade. Searching on this topic revealed a problem with the purge operations but it was not clear to me what exactly was the problem. According to the MySQL documentation the default should suffice. Uhm, right?

Now I knew some things have changed in 5.5 regarding purging: a separate purge thread was already introduced in 5.1 but could it have been so different then? So I tried to find out what each and every purge variable would do:

innodb_max_purge_lag 0
innodb_purge_batch_size 20
innodb_purge_threads 1
relay_log_purge ON

At first I assumed that increasing the batch size would make the purging more efficient: the larger the batch the more it could handle, right?
Wrong: the larger you set it, the later it will purge! I found this on the MySQL documentation about it:

The granularity of changes, expressed in units of redo log records, that trigger a purge operation, flushing the changed buffer pool blocks to disk.

So the name is actually confusing! In our case it went from 20 to 40 making things worse and then from 40 to 10 making the history list go from 4000 to 1800.

Then I decided to see what the purge lag would do. Changing the purge lag as described by Peter did indeed lower the history list for a short while, but MMM also kicked the 5.5 server out of its pool because it started lagging behind in replication! So this is definitely something to keep in the back of your mind!

I did not change the purge threads to 0 since it is a machine that runs in our production environment. Also confusing is the deprecated innodb_use_purge_thread that could be set to different values than 0 and 1 but is marked as experimental.

This graph shows best how it worked out:
InnoDB transactions over a week

In the end I lowered the purge batch size to 1 and the history list went back from 4000 to its “normal” 200.

I’m positive a part two will come shortly, so stay tuned. 😉