At first I thought it would be an easy problem of the master and slaves being out of sync and the row-based replication failing on not finding the row, but then I noticed all machines were actually still running statement-based replication. As far as I could recall we did that to circumvent another issue that has already been solved months ago but for some reason we never put it back to row-based replication.
A simple SHOW SLAVE STATUS revealed something similar to this:
Last_Error: Error 'Duplicate entry '272369' for key 'PRIMARY'' on query. Default database: 'userdata'. Query: 'DELETE u, uf1, uf2, up FROM `users` u LEFT JOIN `friends` uf1 ON u.id = uf1.id LEFT JOIN `friends` uf2 ON u.id = uf2.friend_id LEFT JOIN `prefs` up ON u.id = up.id WHERE u.name = 'testuser';
So a delete query with a triple left join on depending tables. That’s weird I thought: delete queries should not complain about duplicate entries on the primary key… Of course I first tried to execute the query by hand and it ran just fine. So I executed a sql_slave_skip_counter and restarted replication and the replication started up just fine as it should. A fluke I thought…
Well, I was wrong: few seconds later it failed again. Same type of query with the same replication error. So I dug a bit deeper and executed it as a select query. As expected it gave me a couple of rows and after execution of the delete those records had disappeared from the database. All sorts of causes crossed my mind and I suspected setting the cluster to row-based replication could solve the problem, but I wasn’t 100% sure of that. Since it was a Saturday night (second busiest night of the week) I decided to postpone the digging till after the weekend. I put a slave-skip-erros in place with the idea that having too much data (it is a delete anyway!) is always better than having a slave that is far behind in replication.
The machine performed all other the replication fine during the weekend and a verification with mk-table-checksum revealed the only differences were to be found in the tables where I let the replication skip on errors. While looking into the possible causes it suddenly crossed my mind that the users tables contained a trigger for inserting messages into an email queue so I looked up the trigger in the database. It was not only one trigger, but actually three of them! This is the on delete trigger:
DROP TRIGGER IF EXISTS `userdata`.`ad_trigger`//
CREATE TRIGGER `userdata`.`ad_trigger` AFTER DELETE ON `userdata`.`users`
FOR EACH ROW BEGIN
INSERT INTO `queue` (`id`,`action`, `email`) VALUES (OLD.id, 'D', OLD.email);
So I immediately put my suspicion on the delete trigger and concentrated on that. I sources that confirmed this: inserting on an auto_increment field through a trigger fired through statement-based replication is not considered safe in MySQL 5.1.40 and above.
This seems to only happen if:
1) at least one of the transactions is preceded by autocommit=0 or is using an explicit
transactional statement such as ‘start transaction’ or ‘begin’.
2) the transaction under 1) is doing an insert, update or delete on an innodb table
3) the transaction under 1) causes a trigger to insert 2 or more rows into the table with
the auto_increment column (this table doesn’t need to be an innodb table – I’ve tried both
myisam an innodb)
4) before the transaction under 1) does a commit (or possibly a rollback) another
concurrent transaction causes an insert on the auto_increment table.
Also at the bottom of the issue:
For 5.1, We mark all the statements that invoke a trigger or call a function that updated autoinc fields as unsafe and will switch to row-format in mixed mode. But in statement mode, we didn’t emitted the warning.
So what happened? Quite simple: the output of triggers are not replicated when using statement-based replication, they are triggered on the slave again by the insert/update/delete statement that gets executed. Then the trigger starts up a transaction for each deleted row it will do an insert on the queue table which contain an auto_increment on a unnamed column in the insert. The reason why it gets executed multiple times is still a bit shady: I could not replicate this issue on our staging environment, but I suspect either the multiple records returned due to the left joins are executed at the same time causing MySQL to remove the record multiple times or we may have encountered some other issue here.
This led me to two simple conclusions:
- I got fooled by the replication error message pointing me in the wrong direction
- My initial thought was right: row based replication would have saved us from this problem.
If anyone has any idea why the trigger created a duplicate record inside the queue table this puzzle would be solved completely and I would be very very grateful. 😛
Sidenote: the tables and data are, obviously, fictional and data has been obfuscated but they do resemble the problem closely