Is Galera trx_commit=2 and sync_binlog=0 evil?

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

See me speak at Percona Live Europe 2019

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0 to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While  Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!

Advertisements

Presenting Sphinx Search at Percona Live London

It has been a while since I posted on this blog. The more in depth articles are all posted on the Spil Games Engineering blog and I’m overcrowded with work in the past months that I hardly have any time left.

One of the reasons for having too much on my plate was my own doing: I volunteered to be a conference committee member for Percona Live London and we, as a committee, worked our way through all proposals while you all were enjoying your holidays in the sun. Well, I must admit I did review a couple of them sitting in the sun, next to the pool enjoying a cold drink. 😉
I must say there were so many great proposals it was really hard to choose which ones would be left out.

I also proposed a talk for Percona Live London this year and my fellow committee members liked it enough to make it to the final schedule: Serve out any page with an HA Sphinx environment.
In basis it is a MySQL case study where I will show how we at Spil Games use Sphinx Search in three different use cases: our main search, friend search and serving out our content on any page. I’ll also describe how we handle high availability, how we handle index creation and show benchmark results between MySQL and Sphinx in various use cases.

In case you are interested in Sphinx or the benchmark results: the session will be on the 4th of November at 3:10pm – 4:00pm in Cromwell 1 & 2.
Also don’t hesitate to ask me things when I’m wandering around in the corridors and rooms. Or maybe we’ll meet at the MySQL Community Dinner?
See you next week!

Presenting at FOSDEM and Percona Live

Very short update from my side: I’ll be presenting at FOSDEM in Brussels (1-2 February 2014) and Percona Live MySQL Conference in Santa Clara (1-4 April 2014).

FOSDEM
At FOSDEM I will present about Galera replication in real life which is concentrate around two use cases for Galera: adding new clusters to our sharded environment and migrating existing clusters into a new Galera cluster.

Percona Live MySQL Conference and Expo, April 1-4, 2014
At Percona Live MySQL Conference I will present about our globally distributed storage layers. Next to our globally sharded environment we have built a new environment called ROAR (Read Often, Alter Rarely) that also needs to be distributed globally.

Both are interesting talks and I really look forward to present at these great conferences. So if you have the way and means to attend either one: you should!

Percona Live London is already next Monday!

I’m currently wrapping up last things in the office to be prepared for my presentation about MySQL-Statsd at the Percona Live London conference next week. It will be a revised version of the talk I gave at the Percona Live Conference & Expo in Santa Clara, but this time it will be more focussed to the practical side and, obviously, the MySQL-Statsd daemon we open sourced recently, so if you missed that talk or think a follow up is needed you should definitely attend! Slides will also become immediately available after the talk.

I’m also looing forward to attend a lot of other talks, for instance The Shard Revisited: Tools and Techniques Used at Etsy and MySQL, the ebay Classifieds Way and of course the tutorials.

It is great to see the bustrip to the community dinner organized by MariaDB. I did not really mind taking the tube for 17 minutes, but getting a busride in a Roadmaster is obviously a lot more fun and comfortable!

So see you at the conference next Monday and Tuesday!

Spil Games MySQL StatsD project on Github

Yesterday evening I already announced the happy news on the Spil Games Engineering blog, you can find our MySQL StatsD daemon in our Github repository here:
https://github.com/spilgames/mysql-statsd

It is far from perfect and the todo list is still long. You are, as it is a public repo, welcome to collaborate with us and make it even better. I can’t wait to see the first pull-request coming in!

You can also talk to us directly at Percona Live London where we will bring a revised version of our talk at Santa Clara. This time it will be aimed more at the practicality: It will contain less comparison with existing tools, be less theoretical and focus more on the how to do it.

A bit of a shame to do it this way: I tried several times to add the Spil Games Engineering feed to the Planet MySQL feed adder, but it fails on the feed validation and suggests to validate it via feedvalidator.org. However feedvalidator.org validates it perfectly fine as a RSS 2.0 feed. I did use the Planet MySQL feedback form twice but no reply on that and I think I will file a bug report later today.

Organizing the MySQL UG NL on 22nd of February

It has been a bit of a rollercoaster ride for us since the Percona Live London posting. The team has expanded with two new DBAs, I was invided to give a talk on selected the Percona Live Conference & Expo 2013 in April and at the same time Spil Games is organizing the second MySQL User Group NL meeting on Friday February 22nd of February. I did not realize it is next week and never posted about it, so there it is!

The meeting schedule is as following:
17:00 Spil Games Pub Open
18:00 Introduction
18:15 “MySQL User Defined Functions” by Roland Bouman
19:00 Pizza
20:00 “Total cost of ownership” by Zsolt Fabian (Spil Games)

Before and after the meeting drinks and snacks will be served in our pub. You can chat up with others, mingle with the Spil Games employees or if you are very shy play some pool/foosball/pinball.

I’m happy we are presenting the TCO on this User Group. Zsolt will show his findings on several things you need to keep in mind if you wish to calculate your TCO, so it will be more a general guidance on how to do it yourself. Of course we will share some of our own WTFs/facepalms and other interesting facts we found during our own investigation. 😉

In case you are attending, there are several ways to get to the Spil Games HQ:
If you travel by car, just punch in our address in your navigation:
Arendstraat 23
1223 RE Hilversum
Do take notice our entrance has moved to the new building on our campus behind these nicely graffiti painted doors:
spilgames-gates

Second option would be coming by public transport.
Coming from the direction of Amsterdam/Amerfoort:
Take the train to Hilversum (central) and either walk to our new office using Google Maps (about 15 minutes walk). Otherwise you can take bus #2 (to Snelliuslaan) and hop off on the Minckelersstraat (ask the driver) and walk the remaining few hundred meters.

Coming from the direction of Utrecht:
Take the train to Hilversum Sportpark and walk to our new office using Google Maps (about 8 to 10 minutes walk).

Hope to see you all next Friday at the Spil HQ! 🙂

Percona Live London 2012 slides available

Many thanks to all those who attended my talk at the Percona Live London 2012 conference!
I did put the location in the last slide, but just in case you missed the last slide (or missed my talk) you can find them here:
http://spil.com/perconalondon2012

I did receive a couple of questions afterwards (in the hallways of the conference) that made me realize that I forgot to clear up a couple of things.

First of all the essence of shifting the data ownership of a specific GID towards a specific datacenter and ensuring data consistency also means one Erlang process within that very same datacenter is the owner of that data. This does also mean this Erlang process is the only that can write to the data of this GID. Don’t worry: for every GID there should be a process that is the data owner and Erlang should be able to cope with the enormous scale here.

Second of all the whole purpose of the satellite datacenter (all virtualized) is to have a disposable datacenter while the master datacenter (mostly virtualized, except for storage) is permanent. Imagine that next to the existing presence (master or satellite DC) in one country we also expect big growth due to the launch of a new game we could easily create a new satellite datacenter by getting a couple of machines in the cloud. This way our hybrid cloud can easily be expanded either by virtuals or by datacenters. I thought this was a bit too offtopic but apparently it raised some questions.

If you have any questions, don’t hesitate to ask! 🙂

Runaway history list

On one of the clusters at Spil we noticed a sudden increase in the length of the history list and a steep increase in the ibdata file in the MySQL directory.
I did post a bit about this topic earlier regarding MySQL 5.5 but this cluster is still running 5.1 and unfortunately 5.1 does not have the same configurable options to influence the purging of the undo log…

History list

Now I did find a couple of great resources that explain the purge lag problem into detail: Pythian, DimitriK and Marco Tusa.

What it boils down to is that the purge lag is largely influenced by the length of the history list and the purge lag:
((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds.
On 5.5 it is also influenced by the number of purge threads and purge batch size. I toyed around with these settings in my earlier post and tuning them helped. However the only setting I could change on 5.1 is the purge lag in milliseconds that was already set to 0. In other words: I could not fiddle around with this. This time it wasn’t an upgrade to 5.5 either so I could not blame that again. 😉

So what was different on this server then? Well the only difference was that it did have “a bit” of disk utilization: around 80% during peak hours. Since it is not used as a front end server it does not affect the end users, but only the (background) job processes that process and store data on this server. However it could be the case that due to the IO utilization it started to lag behind and created a too large history list to catch up with its current configuration.

How did we resolve it then? After I read this quote of Peter Zaitsev on Marco Tusa‘s posting the solution became clear:

Running Very Long Transaction If you’re running very long transaction, be it even SELECT, Innodb will be unable to purge records for changes which are done after this transaction has started, in default REPEATABLE-READ isolation mode. This means very long transactions are very bad causing a lot of garbage to be accommodated in the database. It is not limited to undo slots. When we’re speaking about Long Transactions the time is a bad measure. Having transaction in read only database open for weeks does no harm, however if database has very high update rate, say 10K+ rows are modified every second even 5 minute transaction may be considered long as it will be enough to accumulate about 3 million of row changes.

The transaction isolation is default set to REPEATABLE-READ and we favor it on many of our systems, especially because it performs better than READ-COMMITTED. However a background job running storage server does not need this transaction isolation, especially not if it is was blocking the purge to be performed!

So in the end changing the transaction isolation to READ-COMMITTED did fix the job for us.

Some other things: tomorrow my team is attending the MySQL User Group NL and in three weeks time I’ll be speaking at Percona London:
Percona Live London, December 3-4, 2012
So see you there!

Using CURDATE() the wrong way

It has been a while since I wrote on this blog. Basically I had too much on my mind (expanding my department, holidays, etc) to actually write here and I’ll promise to post more regularly from now onwards. 😉

Syntax error!

Anyway, as the title already suggests: I found out how you can use CURDATE() in a wrong way. One of the developers in my company asked me to help him out as his code all of a sudden did not work properly anymore. Or even better: it used to process several thousands of rows and all of a sudden it processed none.

I looked at his code snippet and it was quite a long query with a lot of brackets:

SELECT SUM(some_count_col), logdate, loghour FROM logs
WHERE (logdate = CURDATE() AND loghour = HOUR(NOW())
GROUP BY logdate, loghour;

Column wise logdate is of the type DATE and loghour of the type TINYINT.
note that this is, obviously, not the original query, but it is similar

Apart from the fact that his usage of brackets makes the query quite unreadable I was quickly able to simplify the query to this: Continue reading “Using CURDATE() the wrong way”

The Percona Live MySQL Conference 2012

Thank you very much if you attended my session at the Percona Live MySQL Conference!
I promised some people to share my slides, so I posted them on the page at Percona:
Spil Games: Outgrowing an internet startup (Percona Live MySQL Conference 2012) on SlideShare
Click here if you need a direct link

My opinion of the conference is that it was amazing! The conference was very well organized, the atmosphere was great and I met so many great people that I had a tough time remembering all their names and companies. The contents of all talks were really well balanced and most of the ones I attended were very interesting.

The most interesting talk of the conference was the Scripting MySQL with Lua and libdrizzle inside Nginx. It was a shame only a few people attended the talk and that they ran out of time before they could complete the presentation. 😦

Apart from that I had a really great time and hope to see you all next year! (or later this year in London)