Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.
After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless! 😛
Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.
The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.
Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode. 😉
It took me a while to figure out why our new instance wasn’t graphing: the response time query was performing and the script was picking up its values. Also the rrd files were created but for some bizarre reason all values were set to “NaN”, in rrd/cacti terms: Not a Number. If you search on that subject you will come across a lot of (forum) postings stating you need to change your graph type to “GAUGE” or change the MIN/MAX values for your data templates. Strange as this was already set to a minimum of 0 and a maximum of an unsigned 64 bit integer.
After running the ss_get_mysql_stats.php script manually for these graphs I got the error stating that a -1 value was not allowed. Indeed the output of the script contained a -1 value for the last measurement and I quickly found the culprit: an uninitialized array key inside the script will cause it to return a -1 value. Now why was this array key not initialized? Simply because the query filling the array was capped to 13 rows instead of the expected 14 rows.
This left me with three options:
1. Change cacti templates to allow -1 value
2. Change cacti templates to only contain 13 data points instead of 14
3. Change the query in the ss_get_mysql_stats.php script
Naturally I patched the script and after 10 minutes the graphs started to get colorful again! 😉
So if you have the same problem as we do, you can find the patch attached to my bugreport:
Query reponse time and Query time histogram not graphing
I know, I haven’t been posting much lately. 5.5 upgrades got postponed due to the new storage platform needing my immediate attention and being a speaker at the Percona Live conference in April also needs a lot of attention.
One of the things I want to try out is running multiple MySQL instances on the same machine. The concept remained in the back of my mind ever since I attended Ryan Thiessen’s presentation on the MySQL conference 2011 but we never actually got a proper usecase for it. Well, with the new storage platform it would be really beneficial so an excellent use case to try it out! So what have I been busy with in the past week? That’s right: running multiple instances MySQL on one single server. 😉
Even though it is not well documented and nobody describes the process in depth it is not that complicated to get multiple instances running next to each other. However it does involve a lot of changes in the surrounding tools, scripts and monitoring. For example, this is what I changed so far:
1. MySQL startup script
Yes, you really want this baby to support multiple instances. I’ve learned my lesson with the wildgrowth of copies of the various MMM init scripts.
2. Templating of configs
If you want to maintain the instances well you should definitely start using a fixed template which includes a defaults file. In our case I created one defaults file for all instances and each and every instance will override the settings of the defaults file. Also some tuning parts are now separated from the main config.
3. Automation of adding new instances to a host
Apart from a bunch of config files, data directory you really want to have some intelligence when adding another instance. For example only the innodb_buffer_pool_size needs to be adjusted for each new instance you add.
4. Automation of removing instances from a host
Part of the step above: if you can add instances, they you need to be able to remove an instance. This should be done with care as it will be destructive. 😉
After this there is still a long extensive list of things to be taken care of:
1. Automation of replication setup
The plan is to keep things simple and have two hosts replicate all instances to each other. So the instance 3307 on host 1 will replicate to instance 3307 on host 2 (and back), instance 3308 on host 1 will replicate to instance 3308 on host 2 (and back), etc.
2. HA Monitoring needs thinking/replacement
I haven’t found a HA Monitoring tool that can handle multiple instances on one host.
Why is this a problem?
If only one of the MySQL instances needs maintenance you can’t use the current tools unless you are willing to make all other instances unavailable as well. Also what will you do when the connection pool of one instance gets exhausted? Or if one instance on both servers die?
3. Backup scripts needs some changes
Obviously our backup tools (wrapper scripts around xtrabackup) need some alteration. We are now running multiple instances, so we need to backup more than one database.
4. Cloning scripts need some changes
We have a script that can clone a live database (utilizing xtrabackup) to a new host. Apart from the fact that it assumes it needs to clone only one single database we might also go for full cloning of all instances
5. Monitoring needs to understand multiple instances
Our current (performance) monitoring tools, like Nagios/Cacti/etc, only assume one MySQL instance per host. At best I can implement the templates multiple times, but that also increases the number of other checks with the same factor.
And there obviously a lot more things I haven’t thought of yet. As you can see I’ll be quite occupied in the upcoming period…