It has been a while since I posted on this blog. The more in depth articles are all posted on the Spil Games Engineering blog and I’m overcrowded with work in the past months that I hardly have any time left.
One of the reasons for having too much on my plate was my own doing: I volunteered to be a conference committee member for Percona Live London and we, as a committee, worked our way through all proposals while you all were enjoying your holidays in the sun. Well, I must admit I did review a couple of them sitting in the sun, next to the pool enjoying a cold drink. ;)
I must say there were so many great proposals it was really hard to choose which ones would be left out.
I also proposed a talk for Percona Live London this year and my fellow committee members liked it enough to make it to the final schedule: Serve out any page with an HA Sphinx environment.
In basis it is a MySQL case study where I will show how we at Spil Games use Sphinx Search in three different use cases: our main search, friend search and serving out our content on any page. I’ll also describe how we handle high availability, how we handle index creation and show benchmark results between MySQL and Sphinx in various use cases.
In case you are interested in Sphinx or the benchmark results: the session will be on the 4th of November at 3:10pm – 4:00pm in Cromwell 1 & 2.
Also don’t hesitate to ask me things when I’m wandering around in the corridors and rooms. Or maybe we’ll meet at the MySQL Community Dinner?
See you next week!
Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.
After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless! :P
Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.
The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.
Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode. ;)
At the company I work for we are still running Percona Server 5.1 in production and are slowly heading towards a Percona Server 5.5 rollout. It did take a lot of preparation in the past few months (write a my.cnf conversion script for example) and a lot of testing. A couple of machines already have been upgraded this week to 5.5 to compare performance and stability. So far the machines proved to be stable enough to keep them on 5.5 and even better: we already see a couple of benefits! However, the title wouldn’t have been blues if everything would have been a breeze, right? ;)
First problem we ran into was that our Cacti templates broke due to the changed InnoDB status output. So I headed towards the Cacti templates and looked in the issue tracker if the issue was already known. Apparently it was already known, but unfortunately not fixed yet. Lucky enough writing the fixes myself wasn’t much of a problem.
Secondly we ran into the issue that the history list was growing from a “steady” 200 to 4000 after upgrade. Searching on this topic revealed a problem with the purge operations but it was not clear to me what exactly was the problem. According to the MySQL documentation the default should suffice. Uhm, right?
Now I knew some things have changed in 5.5 regarding purging: a separate purge thread was already introduced in 5.1 but could it have been so different then? So I tried to find out what each and every purge variable would do:
At first I assumed that increasing the batch size would make the purging more efficient: the larger the batch the more it could handle, right?
Wrong: the larger you set it, the later it will purge! I found this on the MySQL documentation about it:
The granularity of changes, expressed in units of redo log records, that trigger a purge operation, flushing the changed buffer pool blocks to disk.
So the name is actually confusing! In our case it went from 20 to 40 making things worse and then from 40 to 10 making the history list go from 4000 to 1800.
Then I decided to see what the purge lag would do. Changing the purge lag as described by Peter did indeed lower the history list for a short while, but MMM also kicked the 5.5 server out of its pool because it started lagging behind in replication! So this is definitely something to keep in the back of your mind!
I did not change the purge threads to 0 since it is a machine that runs in our production environment. Also confusing is the deprecated innodb_use_purge_thread that could be set to different values than 0 and 1 but is marked as experimental.
I’m positive a part two will come shortly, so stay tuned. ;)
Just started MySQL Quicksand and hopefully it will turn out to be something fruitful. :)
Well, I feel the urge to share some experience and willing to learn from the feedback.
What to expect here?
Well I do have a lot of warstories to share and tons of stuff that I try to find out when reading other peoples blogs… More stuff to come soon!