Is Galera trx_commit=2 and sync_binlog=0 evil?

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

See me speak at Percona Live Europe 2019

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0 to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While  Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!

Advertisements

Presenting at FOSDEM and Percona Live

Very short update from my side: I’ll be presenting at FOSDEM in Brussels (1-2 February 2014) and Percona Live MySQL Conference in Santa Clara (1-4 April 2014).

FOSDEM
At FOSDEM I will present about Galera replication in real life which is concentrate around two use cases for Galera: adding new clusters to our sharded environment and migrating existing clusters into a new Galera cluster.

Percona Live MySQL Conference and Expo, April 1-4, 2014
At Percona Live MySQL Conference I will present about our globally distributed storage layers. Next to our globally sharded environment we have built a new environment called ROAR (Read Often, Alter Rarely) that also needs to be distributed globally.

Both are interesting talks and I really look forward to present at these great conferences. So if you have the way and means to attend either one: you should!

Organizing the MySQL UG NL on 22nd of February

It has been a bit of a rollercoaster ride for us since the Percona Live London posting. The team has expanded with two new DBAs, I was invided to give a talk on selected the Percona Live Conference & Expo 2013 in April and at the same time Spil Games is organizing the second MySQL User Group NL meeting on Friday February 22nd of February. I did not realize it is next week and never posted about it, so there it is!

The meeting schedule is as following:
17:00 Spil Games Pub Open
18:00 Introduction
18:15 “MySQL User Defined Functions” by Roland Bouman
19:00 Pizza
20:00 “Total cost of ownership” by Zsolt Fabian (Spil Games)

Before and after the meeting drinks and snacks will be served in our pub. You can chat up with others, mingle with the Spil Games employees or if you are very shy play some pool/foosball/pinball.

I’m happy we are presenting the TCO on this User Group. Zsolt will show his findings on several things you need to keep in mind if you wish to calculate your TCO, so it will be more a general guidance on how to do it yourself. Of course we will share some of our own WTFs/facepalms and other interesting facts we found during our own investigation. 😉

In case you are attending, there are several ways to get to the Spil Games HQ:
If you travel by car, just punch in our address in your navigation:
Arendstraat 23
1223 RE Hilversum
Do take notice our entrance has moved to the new building on our campus behind these nicely graffiti painted doors:
spilgames-gates

Second option would be coming by public transport.
Coming from the direction of Amsterdam/Amerfoort:
Take the train to Hilversum (central) and either walk to our new office using Google Maps (about 15 minutes walk). Otherwise you can take bus #2 (to Snelliuslaan) and hop off on the Minckelersstraat (ask the driver) and walk the remaining few hundred meters.

Coming from the direction of Utrecht:
Take the train to Hilversum Sportpark and walk to our new office using Google Maps (about 8 to 10 minutes walk).

Hope to see you all next Friday at the Spil HQ! 🙂

Percona Live London 2012 slides available

Many thanks to all those who attended my talk at the Percona Live London 2012 conference!
I did put the location in the last slide, but just in case you missed the last slide (or missed my talk) you can find them here:
http://spil.com/perconalondon2012

I did receive a couple of questions afterwards (in the hallways of the conference) that made me realize that I forgot to clear up a couple of things.

First of all the essence of shifting the data ownership of a specific GID towards a specific datacenter and ensuring data consistency also means one Erlang process within that very same datacenter is the owner of that data. This does also mean this Erlang process is the only that can write to the data of this GID. Don’t worry: for every GID there should be a process that is the data owner and Erlang should be able to cope with the enormous scale here.

Second of all the whole purpose of the satellite datacenter (all virtualized) is to have a disposable datacenter while the master datacenter (mostly virtualized, except for storage) is permanent. Imagine that next to the existing presence (master or satellite DC) in one country we also expect big growth due to the launch of a new game we could easily create a new satellite datacenter by getting a couple of machines in the cloud. This way our hybrid cloud can easily be expanded either by virtuals or by datacenters. I thought this was a bit too offtopic but apparently it raised some questions.

If you have any questions, don’t hesitate to ask! 🙂

The Percona Live MySQL Conference 2012

Thank you very much if you attended my session at the Percona Live MySQL Conference!
I promised some people to share my slides, so I posted them on the page at Percona:
Spil Games: Outgrowing an internet startup (Percona Live MySQL Conference 2012) on SlideShare
Click here if you need a direct link

My opinion of the conference is that it was amazing! The conference was very well organized, the atmosphere was great and I met so many great people that I had a tough time remembering all their names and companies. The contents of all talks were really well balanced and most of the ones I attended were very interesting.

The most interesting talk of the conference was the Scripting MySQL with Lua and libdrizzle inside Nginx. It was a shame only a few people attended the talk and that they ran out of time before they could complete the presentation. 😦

Apart from that I had a really great time and hope to see you all next year! (or later this year in London)

Running multiple MySQL instances in parallel

I know, I haven’t been posting much lately. 5.5 upgrades got postponed due to the new storage platform needing my immediate attention and being a speaker at the Percona Live conference in April also needs a lot of attention.

One of the things I want to try out is running multiple MySQL instances on the same machine. The concept remained in the back of my mind ever since I attended Ryan Thiessen’s presentation on the MySQL conference 2011 but we never actually got a proper usecase for it. Well, with the new storage platform it would be really beneficial so an excellent use case to try it out! So what have I been busy with in the past week? That’s right: running multiple instances MySQL on one single server. 😉

Even though it is not well documented and nobody describes the process in depth it is not that complicated to get multiple instances running next to each other. However it does involve a lot of changes in the surrounding tools, scripts and monitoring. For example, this is what I changed so far:

1. MySQL startup script
Yes, you really want this baby to support multiple instances. I’ve learned my lesson with the wildgrowth of copies of the various MMM init scripts.

2. Templating of configs
If you want to maintain the instances well you should definitely start using a fixed template which includes a defaults file. In our case I created one defaults file for all instances and each and every instance will override the settings of the defaults file. Also some tuning parts are now separated from the main config.

3. Automation of adding new instances to a host
Apart from a bunch of config files, data directory you really want to have some intelligence when adding another instance. For example only the innodb_buffer_pool_size needs to be adjusted for each new instance you add.

4. Automation of removing instances from a host
Part of the step above: if you can add instances, they you need to be able to remove an instance. This should be done with care as it will be destructive. 😉

After this there is still a long extensive list of things to be taken care of:
1. Automation of replication setup
The plan is to keep things simple and have two hosts replicate all instances to each other. So the instance 3307 on host 1 will replicate to instance 3307 on host 2 (and back), instance 3308 on host 1 will replicate to instance 3308 on host 2 (and back), etc.

2. HA Monitoring needs thinking/replacement
I haven’t found a HA Monitoring tool that can handle multiple instances on one host.
Why is this a problem?
If only one of the MySQL instances needs maintenance you can’t use the current tools unless you are willing to make all other instances unavailable as well. Also what will you do when the connection pool of one instance gets exhausted? Or if one instance on both servers die?

3. Backup scripts needs some changes
Obviously our backup tools (wrapper scripts around xtrabackup) need some alteration. We are now running multiple instances, so we need to backup more than one database.

4. Cloning scripts need some changes
We have a script that can clone a live database (utilizing xtrabackup) to a new host. Apart from the fact that it assumes it needs to clone only one single database we might also go for full cloning of all instances

5. Monitoring needs to understand multiple instances
Our current (performance) monitoring tools, like Nagios/Cacti/etc, only assume one MySQL instance per host. At best I can implement the templates multiple times, but that also increases the number of other checks with the same factor.

And there obviously a lot more things I haven’t thought of yet. As you can see I’ll be quite occupied in the upcoming period…

MySQL 5.5 upgrade blues (part two)

Shortly after the MySQL 5.5 upgrade the whole cluster was upgraded with extra ram. This was a nice test to see how differently 5.1 and 5.5 behave when they A) innodb bufferpool is too small and B) when the innodb bufferpool has enough room to fit everything in memory.

The MySQL 5.5 had just the same pattern in terms of disk utilization as the other nodes before (around 30% to 40%) and after the upgrade (4% to 5%), so not much difference at all. However the number of free pages within the bufferpool is significantly lower (about 10%) than on the other nodes. This definitely needs some further investigation.

Apart from that the machine is stable and it seems we will proceed with the upgrade on the whole cluster soon.

A sidenote: I’m happy to announce that I was selected as a speaker at the Percona Live MySQL Conference & Expo in San Francisco, April 2012. I’ll be talking about Spil Games (the company I work for) and how our new architecture will solve or ease up the majority of our database issues.
Percona Live MySQL User's Conference, San Francisco, April 10-12th, 2012