Putting Galera SST Compression on the benchmark

I mentioned in my Galera crash (un)safe post that it’s bad if a SST is triggered on a large dataset. On a Galera node an SST will simply wipe the contents of the MySQL data directory and copy a snapshot from another node in the cluster. For consistency this is a favorable thing to do, but from an HA perspective it isn’t: one node in the cluster is unavailable and another one is acting as donor. This means, if you have a three node Galera cluster, only two are available where one has “degraded” performance.

This could have quite high impact, for instance in the proxysql_galera_checker script only synced nodes are selected to receive traffic. By default the donor nodes will not receive any traffic and now the cluster only has one node left to serve traffic. Don’t worry: if there are no synced nodes available the donor node will still be selected as available, so in the case you only have one node left (bootstrapping) it can still serve traffic. This, obviously, isn’t a favorable situation you want to be in.

Partial failure on Galera

Now imagine we have a multi DC Galera cluster, as pictured below:

galera-multi-dc-small

A power failure happens in DC #1 and nodes 1 to 3 are all gone. As the nodes in DC #2 are not a majority anymore, the nodes shut down out of safety. As I wrote before in the Galera crash (un)safe blog post: a crash unsafe Galera cluster will force SST to happen on crashed nodes and this happened as well in our case of the nodes in DC #1. The most favorable node to recover from is a cleanly shut down node in DC #2.

The bootstrap cluster automation performing the recovery of the cluster ran in sequential order through its configured hosts. Unfortunately the first node in the list was an unclean node in DC #1, forcing SST to happen on this host. If the other host in DC #2 would have been chosen, a simple IST would have sufficed and the second node in the cluster would have been brought up within minutes. Now the automation was waiting for the SST on the node in DC #1 and this caused a delay of 4 hours for the next node in the automation to be brought up.

Painful GCache durability

Durability of the GCache on a Galera node is calculated by looking up the first and last item in the ringbuffer and then calculate the time between the two. A Galera incremental state transfer (IST) can only happen if the delta of the writeset requested by the joining node is present in the GCache on the donor node. This durability is basically the time you are allowed to perform an IST from this node and this metric is one of the ones to keep an eye upon. Our GCache durability was somewhere in the range of 45 minutes.

When binary logging is set to row based replication and binlog_row_image is set to full (default), the data written to the binlog will contain the previous row state and the updated values. The same applies to the GCache ringbuffer: when set to full the entire previous state of the row will be written to the GCache.

On this particular cluster one of the tables looked similar to this:

CREATE TABLE product (
  pid BIGINT NOT NULL AUTO INCREMENT,
  title varchar(255) NOT NULL,
  description TEXT NOT NULL,
  specifications TEXT NOT NULL,
  stock INT NOT NULL DEFAULT 0,
  price INT NOT NULL DEFAULT 0,
  image_1 VARCHAR(255) NOT NULL,
  image_2 VARCHAR(255) NOT NULL,
  image_3 VARCHAR(255) NOT NULL,
  thumbnail_1 VARCHAR(255) NOT NULL,
  thumbnail_2 VARCHAR(255) NOT NULL,
  thumbnail_3 VARCHAR(255) NOT NULL,
  ...
  another 50+ columns
  ...
PRIMARY KEY(pid)) ENGINE=InnoDB;

You might think a SST isn’t a big deal, but if you have a data size of 1TB it will take about 3 hours to transfer on a 1Gb network or, if you’re lucky, 21 minutes on a 10Gb network. The connection between DC #1 and #2 was a VPN that was limited in bandwidth (about 400Mbps) and moving the 400GB snapshot between node 4 and node 1 took about 4 hours to complete.

As the bootstrapped node was online according to the ProxySQL script, it now allowed writes to the cluster while being a donor node. The mass-update jobs (stock and pricing sync) started to run once again altering all sorts of data inside the cluster. This means a SST taking 4 hours to complete would make IST impossible between the cleanly shut down nodes in DC #2! As a result inside DC #2 the SST will also happen between nodes 4 and 5. Luckily the local network in DC #2 is a bit faster than the VPN but it was a totally unnecessary recovery.

Recovery improvements

If we look at the outage, the cluster was unavailable for a few minutes thanks to ProxySQL sending all traffic to a single node. But according to the definition of high availability, a single node is not HA. So if we take into account that we need at least two nodes to be available, the recovery happens only after the first SST has completed. This means our Mean Time To Recover Service (MTTRS) on Galera was at least 4 hours!

MTTR-Galera-flow

But as explained earlier: the ProxySQL script will not allow traffic to be sent to a donor, unless there is no other node available. This means right after node 1 has been recovered the ProxySQL script will remove the bootstrapped node 4 from available nodes and we’re back to one single node availability. Technically then the MTTRS should be extended till after node 2 has been recovered. But we’ll keep this discussion out of the scope for now.

In the post mortem of this incident we identified a couple of improvements we could make to speed up the recovery of a Galera cluster:

  1. Faster network (between DC #1 and DC #2)
  2. Make GCache bigger
  3. Increase GCache durability
  4. Make the transfer smaller

The first option was not feasible as it would increase cost dramatically, so we focused on option 2 and 3 at first. The GCache was already set to 4GB, which is already quite large. So we simply changed the binlog_row_image parameter from full to minimal. Since most modifications per row were only a few bytes this increased the durability from 45 minutes to somewhere between of 4 to 5 days! I felt a lot more confident that unnecessary SSTs wouldn’t happen anymore.

How to make a Galera State Snapshot Transfer smaller?

Now the fourth option proved to be a bit more challenging: how to decrease the data size of 400GB? We could just drop a couple of the tables larger than 50GB, but I think the product managers would be very angry at me. Another option would be to enable table compression, but I rather not go through the slow migration process on tables larger than 50GB with distributed storage on a hyperconverged infrastructure

Another option is to compress during the creation of the xtrabackup snapshot and this is done for example by using the following configuration:

[sst]
compressor="gzip"
decompressor="gzip -dc"

This would save us probably something near half of the bytes to send over the line, and every byte compressed means we can send the SST faster, right? Naturally we wanted to test this and benchmark how much faster we would be able to move data between the two locations than our current uncompressed SST.

Benchmarking State Snapshot Transfers

We selected the following methods:

  1. Uncompressed
  2. Gzip
  3. Pigz (normal and fast)
  4. LZ4
  5. Snappy
  6. Qpress (LZH built in xtrabackup)

The method for testing was limit the bandwidth between two nodes in our Galera test cluster (4 cores and 8GB of memory) to 300Mbit, force SST to happen on one node and then measure the time between the SST to start and the moment when the node has rejoined the cluster. The amount of bytes transferred was measured through the network interface and we re-ran each method three times to get a reliable outcome.

First of all the throughput tells us already a couple of things:

Galera SST benchmark network throughput

Galera SST benchmark network throughput

LZ4 and Snappy seem to be able to almost fully utilize the bandwidth of our test system, which is good.

We can see the throughput of gzip and pigz is very low. Gzip is renowned for being slow in the compression phase, but pigz (parallel gzip) is supposed to tackle this by allowing parallel execution of chunks of data. On the test machine the graphs of the gzip run were showing it actually was limited by the single core being utilized fully:

Benchmark SST donor GZIP (second set of peaks)

Benchmark SST donor GZIP (second set of peaks)

The peaks before the gzip runs are the uncompressed runs. Throughput really is dwarfed by the uncompressed run.

You may think pigz is any better, but alas even in fast compression mode it’s not much better:

Benchmark SST donor PIGZ fast compression (last three peaks)

Benchmark SST donor PIGZ fast compression (last three peaks)

Yes it will compresses now on all four cores, but the throughput is only twice as much.

So what about the total amount of bytes transferred?

Galera SST benchmark - total bytes transferred

Galera SST benchmark – total bytes transferred

Gzip and pigz are truly king (and queen) here: they save almost 60% of total bytes transferred! This is an amazing achievement and it saves a lot of network bandwidth. LZ4 and Snappy do a much worse job at 25% and 32% respectively.

Now the most important graph, the total duration that we were after:

Galera SST benchmark - total duration

Galera SST benchmark – total duration

LZ4 and Snappy are clear winners here by saving us 15% and 11% in duration of the SST.

Conclusion

We applied Snappy to our production Galera cluster. As our production servers were slightly different we were able to save about 20% on the duration of the state snapshot transfer.

I think the outcome would also be a bit more favorable for pigz on these production servers as they ran on 8 cores instead of only 4 cores. However I fear for the responsiveness of a bootstrapped server utilizing all 8 cores to the maximum for a few hours. Snappy and LZ4 may compress a bit less, but they are much “friendlier” to your donor.

And the final advice I can give you is to keep an eye on the GCache durability as this may be a life saver one day!

Want more?

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

See me speak at Percona Live Europe 2019

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Hyperconverging and Galera cluster

What is hyperconverging?

Hyperconverging is the latest hype: do things more efficiently with the resources that you have by cramming as many virtual machines on the same hypervisor. In theory this should allow you to mix and match various workloads to make the optimum use of your hypervisor (e.g. all cores used 100% of the time, overbooking your memory up to 200%, moving virtuals around like there is no tomorrow). Any cloud provider is hyperconverging their infrastructure and this has pros and cons. The pro is that it’s much cheaper to run many different workloads while the con clearly is when you encounter noisy neighbors. As Jeremy Cole said: “We are utilizing our virtual machines to the max. If you are on the same hypervisor as us, sorry!”

Apart from cloud providers, you could hyperconverge your infrastructure yourself. There are a few hardware/software vendors out there that will help you with that and at one of my previous employers we got a helping hand from one such vendor!

DIY hyperconverging

In our case the entire infrastructure was migrated to a new hyperconverged infrastructure where we would have multiple infrastructure clusters (read: four hypervisors in one chassis) in multiple data centers. Infra marked one of these DCs suitable for our customer facing projects as the peering was performed in that DC. The idea behind this new infrastructure is that the VM can basically run anywhere in your infrastructure and copied realtime to another hypervisor within the same cluster (read: chassis). This copy process (including memory) obviously required some (short) locking, but it even worked amazingly well. We even had some software running that would move around VMs to optimize the workloads and still retain some spare capacity. Magic!

Now there was an additional benefit to choose for this vendor: if a hypervisor would go down the same VM could be spun up immediately on another hypervisor, albeit without copying the memory contents. To be able to do this, the disks are synced to at least one other hypervisor. This means some cluster magic detects one of the hypervisors being down and automagically spins up the same VMs on another (available) hypervisor that contains the latest data of this VM. To spread the load among various hypervisors the replication factor of the disks is advised to be set to 2, where 2 means to be copied to (at least) two other hypervisors.

Hyperconverging Galera

Our Galera cluster consisted out of three Galera nodes and three asynchronous read replicas attached (see image below).

Galera cluster with read slaves

Galera cluster with read slaves

In this picture every Galera node stores every transaction in the GCache, InnoDB flushes the transaction to disk (ibdata*) and asynchronous replication dictates another write to the binlogs. That means that every transaction in our Galera node will already be stored three times on disk.

The hyperconverged cluster where we hosted Galera had the replication factor set to 2. That means every byte written to disk will be written to at least two other storage controllers (VMs), as shown in the image below. This write operation over the network is synchronously, so the filesystem has to wait until both controllers acknowledged the write. Latency of this write is negligible as the write is super fast and performed over a low latency network. The magic behind this synchronous disk replication is out of the scope for this blog post, but I can hint that a certain NoSQL database (named after some Greek mythology) is managing the storage layer.

Hyperconverge write amplification: every write to disk will be written three times!

Hyperconverge write amplification: every write to disk will be written three times!

This means that every write to disk in our Galera node will also be synced an additional two hypervisors. To make matters worse, due to semi-synchronous replication, all three nodes Galera perform the exact same operations at (almost) the exact same time!

1 transaction = 3 nodes (3 writes locally + 6 writes over the network) = 27 writes

As you can guess from the simple formula above: 9 writes are performed locally and 18 writes are performed over the network. As every write to disk is performed synchronously over the network, this write adds a bit more than negligible latency when it spawns 18 writes over the network at the same time. As 1 transaction to Galera can cause 18 synchronous writes over the network, imagine what latency you will encounter if you have a baseline of 200 transactions per second! And we’re not even counting the asynchronous replicas performing similar write operation again mere (milli)seconds later!

Galera managed to cope, but instability only happened on set intervals. We could trace these back to our so called stock-updates or pricing-updates: every half-an-hour stock levels were pushed from the warehouse database and every few hours new pricing information was also pushed via the enterprise service bus.

With more than a million products in the database these torrents of writes quickly caused disk latency in the entire hyperconverged cluster and we have seen the disk latency shoot up well beyond 80ms. This no longer affected the Galera cluster, but this was causing cluster wide issues on the distributed storage layer as well. And to make matters even worse: latency on the entire network was also shooting up.

Benchmarking semi-synchronously replicated hyperconverged clusters

At first nobody believed us, even when we showed the graphs to the vendor. This new infrastructure was so much more expensive than our old that it simply couldn’t be true. Only after conducting benchmarks, reproducing the latency on an empty test cluster, we were taken seriously. The benchmarks revealed that the write amplification saturated the network interfaces of the cluster and we worked with the vendor on seeking a solution to the problem. Even after upgrading the network (10G interface bonding, enabling jumbo frames, hypervisor tuning) we still found latency issues.

The issue with our hyperconverged cluster was that there was no (separate) internal network handling the inter-hypervisor network traffic. Of course we could now achieve the double amount of transactions, but that didn’t solve the underlying issue of also causing latency on other VMs and also causing latency on ingress and egress network of our applications.

Conclusion

We came to the conclusion that (semi-)synchronous replicated databases and hyperconverged infrastructures with high replication factors don’t match. Unfortunately this replication factor could only be set on cluster level and not on an individual VM level. Also the reasoning behind the synchronous disk replication did not make sense (see also my previous blog post) as Galera would wipe the disk contents anyway and in general it would take quite some time for the database to recover, so a quick failover would not happen anyway. That’s why we ran Galera+ProxySQL in the first place: to allow us to have a failover happen within seconds!

We also ran other (semi-)synchronous replicated databases: MongoDB, SOLR and Elasticsearch for example and each an everyone of them basically the same lack of need for disk replication.

The only option left was to migrate the Galera cluster back to our old hardware that, luckily/sadly, was still switched on. At the same time we started working on a migration to a real cloud vendor as they could offer us better performance without the risk of a single point of failure (e.g. single data center).

So what difference would a benchmark up front have made?

This only happened due to bad requirements without analyzing the workload that was supposed to be converged. We would have seen these issues before migrating to the new hyperconverged infrastructure if we would have benchmarked beforehand. We would have saved us from many instabilities, outages and post mortems. We might even have chosen a totally different setup or have chosen to split our workloads over multiple (smaller) hyperconverged clusters.

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

See me speak at Percona Live Europe 2019

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Is Galera trx_commit=2 and sync_binlog=0 evil?

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

See me speak at Percona Live Europe 2019

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0 innodb_flush_log_at_trx_commit set to 1 and sync_binlog set to 1 (correction by Przemek Malkowski) to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While  Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!