Some days ago I got a call from our support engineer on duty that MySQL on one of our database servers was lagging more than 1000 seconds behind in replication and the server got kicked out of the pool because of the delay. He was unable to find out why and there was absolutely nothing in the mysql log files. When I got the call it was still lagging behind but the lag was slowly decreasing again.
After a quick peek in all our monitoring systems I isolated it to this message:
Cache Battery 0 in controller 0 is Charging (Ready) [probably harmless]
Apparently not that harmless!
Obviously we did encountered this situation a couple of times before but apparently there was no detection on this machine.
The relearn cycle happens every 90 days and gets first scheduled when the machine gets powered on. Now imagine this happening in a master-master setup where both machines were powered on at the same time. Lucky enough you can use omconfig to reschedule the cycle up to 7 days, but then you obviously need to have detection in place.
Why did nobody come up with the idea to have a dual battery backed up cache with alternating relearn cycles? That way you can have your battery relearn without the controller going back into write-through mode.
At first I thought it would be an easy problem of the master and slaves being out of sync and the row-based replication failing on not finding the row, but then I noticed all machines were actually still running statement-based replication. As far as I could recall we did that to circumvent another issue that has already been solved months ago but for some reason we never put it back to row-based replication.
A simple SHOW SLAVE STATUS revealed something similar to this:
Last_Error: Error 'Duplicate entry '272369' for key 'PRIMARY'' on query. Default database: 'userdata'. Query: 'DELETE u, uf1, uf2, up FROM `users` u LEFT JOIN `friends` uf1 ON u.id = uf1.id LEFT JOIN `friends` uf2 ON u.id = uf2.friend_id LEFT JOIN `prefs` up ON u.id = up.id WHERE u.name = 'testuser';