Is Galera trx_commit=2 and sync_binlog=0 evil?

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

See me speak at Percona Live Europe 2019

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0 to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While  Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!

Advertisements

Presenting Sphinx Search at Percona Live London

It has been a while since I posted on this blog. The more in depth articles are all posted on the Spil Games Engineering blog and I’m overcrowded with work in the past months that I hardly have any time left.

One of the reasons for having too much on my plate was my own doing: I volunteered to be a conference committee member for Percona Live London and we, as a committee, worked our way through all proposals while you all were enjoying your holidays in the sun. Well, I must admit I did review a couple of them sitting in the sun, next to the pool enjoying a cold drink. 😉
I must say there were so many great proposals it was really hard to choose which ones would be left out.

I also proposed a talk for Percona Live London this year and my fellow committee members liked it enough to make it to the final schedule: Serve out any page with an HA Sphinx environment.
In basis it is a MySQL case study where I will show how we at Spil Games use Sphinx Search in three different use cases: our main search, friend search and serving out our content on any page. I’ll also describe how we handle high availability, how we handle index creation and show benchmark results between MySQL and Sphinx in various use cases.

In case you are interested in Sphinx or the benchmark results: the session will be on the 4th of November at 3:10pm – 4:00pm in Cromwell 1 & 2.
Also don’t hesitate to ask me things when I’m wandering around in the corridors and rooms. Or maybe we’ll meet at the MySQL Community Dinner?
See you next week!

Presenting at FOSDEM and Percona Live

Very short update from my side: I’ll be presenting at FOSDEM in Brussels (1-2 February 2014) and Percona Live MySQL Conference in Santa Clara (1-4 April 2014).

FOSDEM
At FOSDEM I will present about Galera replication in real life which is concentrate around two use cases for Galera: adding new clusters to our sharded environment and migrating existing clusters into a new Galera cluster.

Percona Live MySQL Conference and Expo, April 1-4, 2014
At Percona Live MySQL Conference I will present about our globally distributed storage layers. Next to our globally sharded environment we have built a new environment called ROAR (Read Often, Alter Rarely) that also needs to be distributed globally.

Both are interesting talks and I really look forward to present at these great conferences. So if you have the way and means to attend either one: you should!

Percona Live London is already next Monday!

I’m currently wrapping up last things in the office to be prepared for my presentation about MySQL-Statsd at the Percona Live London conference next week. It will be a revised version of the talk I gave at the Percona Live Conference & Expo in Santa Clara, but this time it will be more focussed to the practical side and, obviously, the MySQL-Statsd daemon we open sourced recently, so if you missed that talk or think a follow up is needed you should definitely attend! Slides will also become immediately available after the talk.

I’m also looing forward to attend a lot of other talks, for instance The Shard Revisited: Tools and Techniques Used at Etsy and MySQL, the ebay Classifieds Way and of course the tutorials.

It is great to see the bustrip to the community dinner organized by MariaDB. I did not really mind taking the tube for 17 minutes, but getting a busride in a Roadmaster is obviously a lot more fun and comfortable!

So see you at the conference next Monday and Tuesday!

Spil Games MySQL StatsD project on Github

Yesterday evening I already announced the happy news on the Spil Games Engineering blog, you can find our MySQL StatsD daemon in our Github repository here:
https://github.com/spilgames/mysql-statsd

It is far from perfect and the todo list is still long. You are, as it is a public repo, welcome to collaborate with us and make it even better. I can’t wait to see the first pull-request coming in!

You can also talk to us directly at Percona Live London where we will bring a revised version of our talk at Santa Clara. This time it will be aimed more at the practicality: It will contain less comparison with existing tools, be less theoretical and focus more on the how to do it.

A bit of a shame to do it this way: I tried several times to add the Spil Games Engineering feed to the Planet MySQL feed adder, but it fails on the feed validation and suggests to validate it via feedvalidator.org. However feedvalidator.org validates it perfectly fine as a RSS 2.0 feed. I did use the Planet MySQL feedback form twice but no reply on that and I think I will file a bug report later today.

Percona Live London 2012 slides available

Many thanks to all those who attended my talk at the Percona Live London 2012 conference!
I did put the location in the last slide, but just in case you missed the last slide (or missed my talk) you can find them here:
http://spil.com/perconalondon2012

I did receive a couple of questions afterwards (in the hallways of the conference) that made me realize that I forgot to clear up a couple of things.

First of all the essence of shifting the data ownership of a specific GID towards a specific datacenter and ensuring data consistency also means one Erlang process within that very same datacenter is the owner of that data. This does also mean this Erlang process is the only that can write to the data of this GID. Don’t worry: for every GID there should be a process that is the data owner and Erlang should be able to cope with the enormous scale here.

Second of all the whole purpose of the satellite datacenter (all virtualized) is to have a disposable datacenter while the master datacenter (mostly virtualized, except for storage) is permanent. Imagine that next to the existing presence (master or satellite DC) in one country we also expect big growth due to the launch of a new game we could easily create a new satellite datacenter by getting a couple of machines in the cloud. This way our hybrid cloud can easily be expanded either by virtuals or by datacenters. I thought this was a bit too offtopic but apparently it raised some questions.

If you have any questions, don’t hesitate to ask! 🙂

The Percona Live MySQL Conference 2012

Thank you very much if you attended my session at the Percona Live MySQL Conference!
I promised some people to share my slides, so I posted them on the page at Percona:
Spil Games: Outgrowing an internet startup (Percona Live MySQL Conference 2012) on SlideShare
Click here if you need a direct link

My opinion of the conference is that it was amazing! The conference was very well organized, the atmosphere was great and I met so many great people that I had a tough time remembering all their names and companies. The contents of all talks were really well balanced and most of the ones I attended were very interesting.

The most interesting talk of the conference was the Scripting MySQL with Lua and libdrizzle inside Nginx. It was a shame only a few people attended the talk and that they ran out of time before they could complete the presentation. 😦

Apart from that I had a really great time and hope to see you all next year! (or later this year in London)

Speaking at Percona Live MySQL conference

Suitcase packed? Check!
Laptop? Check!
Presentation? Check!
I’m ready for my departure to San Francisco tomorrow morning!

I already mentioned before that I will be a speaker at the MySQL conference, but I think the session has moved since. It is now scheduled Thursday between 1:00 and 1:50 in ballrooom E. Be there if you want to know more about what Spil Games is doing!

I also determined what the most interesting talks are going to be for me and here are some of the highlights:
One to Many: The Story of Sharding at Box (Wed 1:00 – 1:50pm)
Sounds very interesting to see how different their story is from the one at Spil Games.

The Etsy Shard Architecture: Starts with S and Ends With Hard (Wed 2:00 – 2:50)
Same as above but then with the difference that, from the description, it seems they are implementing almost the same solution as we are. 😀

Scripting MySQL with Lua and libdrizzle inside Nginx (Wed 3:30 – 16:20)
Very interesting thought of combining Nginx with Lua and a database connection through Libdrizzle. It seems you can easily implement lightweight services this way. So definitely a reccomended session!

Percona XtraDB Cluster: New HA solution (Thu 11:00 – 11:50)
Percona XtraDB Cluster came to me as a complete surprise earlier this year. I’ve been playing around with it a little bit and now that it has gone GA last week I’m even more anxious to attend this session. I think it could be a good candidate to become one of the building blocks for Spil Games in the future.

Common Schema: a framework for MySQL server administration (Thu 2:00 – 2:50)
I haven’t done much with Common Schema so far and it is already available on our platform so I think it would be a good idea to attend this session and get more practical insights.

So I’m off to SF in about 19 hours. If you are also attending the conference: see you there!