Hyperconverging and Galera cluster

September 18, 2019September 12, 2019Art van Scheppingen 3 Comments

What is hyperconverging?

Hyperconverging is the latest hype: do things more efficiently with the resources that you have by cramming as many virtual machines on the same hypervisor. In theory this should allow you to mix and match various workloads to make the optimum use of your hypervisor (e.g. all cores used 100% of the time, overbooking your memory up to 200%, moving virtuals around like there is no tomorrow). Any cloud provider is hyperconverging their infrastructure and this has pros and cons. The pro is that it’s much cheaper to run many different workloads while the con clearly is when you encounter noisy neighbors. As Jeremy Cole said: “We are utilizing our virtual machines to the max. If you are on the same hypervisor as us, sorry!”

Apart from cloud providers, you could hyperconverge your infrastructure yourself. There are a few hardware/software vendors out there that will help you with that and at one of my previous employers we got a helping hand from one such vendor!

DIY hyperconverging

In our case the entire infrastructure was migrated to a new hyperconverged infrastructure where we would have multiple infrastructure clusters (read: four hypervisors in one chassis) in multiple data centers. Infra marked one of these DCs suitable for our customer facing projects as the peering was performed in that DC. The idea behind this new infrastructure is that the VM can basically run anywhere in your infrastructure and copied realtime to another hypervisor within the same cluster (read: chassis). This copy process (including memory) obviously required some (short) locking, but it even worked amazingly well. We even had some software running that would move around VMs to optimize the workloads and still retain some spare capacity. Magic!

Now there was an additional benefit to choose for this vendor: if a hypervisor would go down the same VM could be spun up immediately on another hypervisor, albeit without copying the memory contents. To be able to do this, the disks are synced to at least one other hypervisor. This means some cluster magic detects one of the hypervisors being down and automagically spins up the same VMs on another (available) hypervisor that contains the latest data of this VM. To spread the load among various hypervisors the replication factor of the disks is advised to be set to 2, where 2 means to be copied to (at least) two other hypervisors.

Hyperconverging Galera

Our Galera cluster consisted out of three Galera nodes and three asynchronous read replicas attached (see image below).

Galera cluster with read slaves

In this picture every Galera node stores every transaction in the GCache, InnoDB flushes the transaction to disk (ibdata*) and asynchronous replication dictates another write to the binlogs. That means that every transaction in our Galera node will already be stored three times on disk.

The hyperconverged cluster where we hosted Galera had the replication factor set to 2. That means every byte written to disk will be written to at least two other storage controllers (VMs), as shown in the image below. This write operation over the network is synchronously, so the filesystem has to wait until both controllers acknowledged the write. Latency of this write is negligible as the write is super fast and performed over a low latency network. The magic behind this synchronous disk replication is out of the scope for this blog post, but I can hint that a certain NoSQL database (named after some Greek mythology) is managing the storage layer.

Hyperconverge write amplification: every write to disk will be written three times!

This means that every write to disk in our Galera node will also be synced an additional two hypervisors. To make matters worse, due to semi-synchronous replication, all three nodes Galera perform the exact same operations at (almost) the exact same time!

1 transaction = 3 nodes (3 writes locally + 6 writes over the network) = 27 writes

As you can guess from the simple formula above: 9 writes are performed locally and 18 writes are performed over the network. As every write to disk is performed synchronously over the network, this write adds a bit more than negligible latency when it spawns 18 writes over the network at the same time. As 1 transaction to Galera can cause 18 synchronous writes over the network, imagine what latency you will encounter if you have a baseline of 200 transactions per second! And we’re not even counting the asynchronous replicas performing similar write operation again mere (milli)seconds later!

Galera managed to cope, but instability only happened on set intervals. We could trace these back to our so called stock-updates or pricing-updates: every half-an-hour stock levels were pushed from the warehouse database and every few hours new pricing information was also pushed via the enterprise service bus.

With more than a million products in the database these torrents of writes quickly caused disk latency in the entire hyperconverged cluster and we have seen the disk latency shoot up well beyond 80ms. This no longer affected the Galera cluster, but this was causing cluster wide issues on the distributed storage layer as well. And to make matters even worse: latency on the entire network was also shooting up.

Benchmarking semi-synchronously replicated hyperconverged clusters

At first nobody believed us, even when we showed the graphs to the vendor. This new infrastructure was so much more expensive than our old that it simply couldn’t be true. Only after conducting benchmarks, reproducing the latency on an empty test cluster, we were taken seriously. The benchmarks revealed that the write amplification saturated the network interfaces of the cluster and we worked with the vendor on seeking a solution to the problem. Even after upgrading the network (10G interface bonding, enabling jumbo frames, hypervisor tuning) we still found latency issues.

The issue with our hyperconverged cluster was that there was no (separate) internal network handling the inter-hypervisor network traffic. Of course we could now achieve the double amount of transactions, but that didn’t solve the underlying issue of also causing latency on other VMs and also causing latency on ingress and egress network of our applications.

Conclusion

We came to the conclusion that (semi-)synchronous replicated databases and hyperconverged infrastructures with high replication factors don’t match. Unfortunately this replication factor could only be set on cluster level and not on an individual VM level. Also the reasoning behind the synchronous disk replication did not make sense (see also my previous blog post) as Galera would wipe the disk contents anyway and in general it would take quite some time for the database to recover, so a quick failover would not happen anyway. That’s why we ran Galera+ProxySQL in the first place: to allow us to have a failover happen within seconds!

We also ran other (semi-)synchronous replicated databases: MongoDB, SOLR and Elasticsearch for example and each an everyone of them basically the same lack of need for disk replication.

The only option left was to migrate the Galera cluster back to our old hardware that, luckily/sadly, was still switched on. At the same time we started working on a migration to a real cloud vendor as they could offer us better performance without the risk of a single point of failure (e.g. single data center).

So what difference would a benchmark up front have made?

This only happened due to bad requirements without analyzing the workload that was supposed to be converged. We would have seen these issues before migrating to the new hyperconverged infrastructure if we would have benchmarked beforehand. We would have saved us from many instabilities, outages and post mortems. We might even have chosen a totally different setup or have chosen to split our workloads over multiple (smaller) hyperconverged clusters.

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Is Galera trx_commit=2 and sync_binlog=0 evil?

September 11, 2019September 19, 2019Art van Scheppingen 3 Comments

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both ~~innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0~~ innodb_flush_log_at_trx_commit set to 1 and sync_binlog set to 1 (correction by Przemek Malkowski) to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!

Presenting at FOSDEM and Percona Live

January 22, 2014Art van Scheppingen Leave a comment

Very short update from my side: I’ll be presenting at FOSDEM in Brussels (1-2 February 2014) and Percona Live MySQL Conference in Santa Clara (1-4 April 2014).

At FOSDEM I will present about Galera replication in real life which is concentrate around two use cases for Galera: adding new clusters to our sharded environment and migrating existing clusters into a new Galera cluster.

At Percona Live MySQL Conference I will present about our globally distributed storage layers. Next to our globally sharded environment we have built a new environment called ROAR (Read Often, Alter Rarely) that also needs to be distributed globally.

Both are interesting talks and I really look forward to present at these great conferences. So if you have the way and means to attend either one: you should!

Organizing the MySQL UG NL on 22nd of February

February 15, 2013February 15, 2013Art van Scheppingen 2 Comments

It has been a bit of a rollercoaster ride for us since the Percona Live London posting. The team has expanded with two new DBAs, I was invided to give a talk on selected the Percona Live Conference & Expo 2013 in April and at the same time Spil Games is organizing the second MySQL User Group NL meeting on Friday February 22nd of February. I did not realize it is next week and never posted about it, so there it is!

The meeting schedule is as following:
17:00 Spil Games Pub Open
18:00 Introduction
18:15 “MySQL User Defined Functions” by Roland Bouman
19:00 Pizza
20:00 “Total cost of ownership” by Zsolt Fabian (Spil Games)

Before and after the meeting drinks and snacks will be served in our pub. You can chat up with others, mingle with the Spil Games employees or if you are very shy play some pool/foosball/pinball.

I’m happy we are presenting the TCO on this User Group. Zsolt will show his findings on several things you need to keep in mind if you wish to calculate your TCO, so it will be more a general guidance on how to do it yourself. Of course we will share some of our own WTFs/facepalms and other interesting facts we found during our own investigation. 😉

In case you are attending, there are several ways to get to the Spil Games HQ:
If you travel by car, just punch in our address in your navigation:
Arendstraat 23
1223 RE Hilversum
Do take notice our entrance has moved to the new building on our campus behind these nicely graffiti painted doors:

Second option would be coming by public transport.
Coming from the direction of Amsterdam/Amerfoort:
Take the train to Hilversum (central) and either walk to our new office using Google Maps (about 15 minutes walk). Otherwise you can take bus #2 (to Snelliuslaan) and hop off on the Minckelersstraat (ask the driver) and walk the remaining few hundred meters.

Coming from the direction of Utrecht:
Take the train to Hilversum Sportpark and walk to our new office using Google Maps (about 8 to 10 minutes walk).

Hope to see you all next Friday at the Spil HQ! 🙂

Percona Live London 2012 slides available

December 6, 2012Art van Scheppingen Leave a comment

Many thanks to all those who attended my talk at the Percona Live London 2012 conference!
I did put the location in the last slide, but just in case you missed the last slide (or missed my talk) you can find them here:
http://spil.com/perconalondon2012

I did receive a couple of questions afterwards (in the hallways of the conference) that made me realize that I forgot to clear up a couple of things.

First of all the essence of shifting the data ownership of a specific GID towards a specific datacenter and ensuring data consistency also means one Erlang process within that very same datacenter is the owner of that data. This does also mean this Erlang process is the only that can write to the data of this GID. Don’t worry: for every GID there should be a process that is the data owner and Erlang should be able to cope with the enormous scale here.

Second of all the whole purpose of the satellite datacenter (all virtualized) is to have a disposable datacenter while the master datacenter (mostly virtualized, except for storage) is permanent. Imagine that next to the existing presence (master or satellite DC) in one country we also expect big growth due to the launch of a new game we could easily create a new satellite datacenter by getting a couple of machines in the cloud. This way our hybrid cloud can easily be expanded either by virtuals or by datacenters. I thought this was a bit too offtopic but apparently it raised some questions.

If you have any questions, don’t hesitate to ask! 🙂

The Percona Live MySQL Conference 2012

April 16, 2012Art van Scheppingen 1 Comment

Thank you very much if you attended my session at the Percona Live MySQL Conference!
I promised some people to share my slides, so I posted them on the page at Percona:
Spil Games: Outgrowing an internet startup (Percona Live MySQL Conference 2012) on SlideShare
Click here if you need a direct link

My opinion of the conference is that it was amazing! The conference was very well organized, the atmosphere was great and I met so many great people that I had a tough time remembering all their names and companies. The contents of all talks were really well balanced and most of the ones I attended were very interesting.

The most interesting talk of the conference was the Scripting MySQL with Lua and libdrizzle inside Nginx. It was a shame only a few people attended the talk and that they ran out of time before they could complete the presentation. 😦

Apart from that I had a really great time and hope to see you all next year! (or later this year in London)

Running multiple MySQL instances in parallel

March 8, 2012March 8, 2012Art van Scheppingen 2 Comments

I know, I haven’t been posting much lately. 5.5 upgrades got postponed due to the new storage platform needing my immediate attention and being a speaker at the Percona Live conference in April also needs a lot of attention.

One of the things I want to try out is running multiple MySQL instances on the same machine. The concept remained in the back of my mind ever since I attended Ryan Thiessen’s presentation on the MySQL conference 2011 but we never actually got a proper usecase for it. Well, with the new storage platform it would be really beneficial so an excellent use case to try it out! So what have I been busy with in the past week? That’s right: running multiple instances MySQL on one single server. 😉

Even though it is not well documented and nobody describes the process in depth it is not that complicated to get multiple instances running next to each other. However it does involve a lot of changes in the surrounding tools, scripts and monitoring. For example, this is what I changed so far:

1. MySQL startup script
Yes, you really want this baby to support multiple instances. I’ve learned my lesson with the wildgrowth of copies of the various MMM init scripts.

2. Templating of configs
If you want to maintain the instances well you should definitely start using a fixed template which includes a defaults file. In our case I created one defaults file for all instances and each and every instance will override the settings of the defaults file. Also some tuning parts are now separated from the main config.

3. Automation of adding new instances to a host
Apart from a bunch of config files, data directory you really want to have some intelligence when adding another instance. For example only the innodb_buffer_pool_size needs to be adjusted for each new instance you add.

4. Automation of removing instances from a host
Part of the step above: if you can add instances, they you need to be able to remove an instance. This should be done with care as it will be destructive. 😉

After this there is still a long extensive list of things to be taken care of:
1. Automation of replication setup
The plan is to keep things simple and have two hosts replicate all instances to each other. So the instance 3307 on host 1 will replicate to instance 3307 on host 2 (and back), instance 3308 on host 1 will replicate to instance 3308 on host 2 (and back), etc.

2. HA Monitoring needs thinking/replacement
I haven’t found a HA Monitoring tool that can handle multiple instances on one host.
Why is this a problem?
If only one of the MySQL instances needs maintenance you can’t use the current tools unless you are willing to make all other instances unavailable as well. Also what will you do when the connection pool of one instance gets exhausted? Or if one instance on both servers die?

3. Backup scripts needs some changes
Obviously our backup tools (wrapper scripts around xtrabackup) need some alteration. We are now running multiple instances, so we need to backup more than one database.

4. Cloning scripts need some changes
We have a script that can clone a live database (utilizing xtrabackup) to a new host. Apart from the fact that it assumes it needs to clone only one single database we might also go for full cloning of all instances

5. Monitoring needs to understand multiple instances
Our current (performance) monitoring tools, like Nagios/Cacti/etc, only assume one MySQL instance per host. At best I can implement the templates multiple times, but that also increases the number of other checks with the same factor.

And there obviously a lot more things I haven’t thought of yet. As you can see I’ll be quite occupied in the upcoming period…

MySQL 5.5 upgrade blues (part two)

January 24, 2012Art van Scheppingen Leave a comment

Shortly after the MySQL 5.5 upgrade the whole cluster was upgraded with extra ram. This was a nice test to see how differently 5.1 and 5.5 behave when they A) innodb bufferpool is too small and B) when the innodb bufferpool has enough room to fit everything in memory.

The MySQL 5.5 had just the same pattern in terms of disk utilization as the other nodes before (around 30% to 40%) and after the upgrade (4% to 5%), so not much difference at all. However the number of free pages within the bufferpool is significantly lower (about 10%) than on the other nodes. This definitely needs some further investigation.

Apart from that the machine is stable and it seems we will proceed with the upgrade on the whole cluster soon.

A sidenote: I’m happy to announce that I was selected as a speaker at the Percona Live MySQL Conference & Expo in San Francisco, April 2012. I’ll be talking about Spil Games (the company I work for) and how our new architecture will solve or ease up the majority of our database issues.

Art will save you from MySQL quicksand!

Small disruptions can cause great disaster

Tag percona live