10 reasons why MySQL Query Cache should be switched off

This post is not meant as a rant, so I hope it will not be interpreted this way. A few weeks ago I was asked the question why we decided to switch off the MySQL Query Cache on all our production servers. Couldn’t we just enable it to squeeze some extra performance out of those expensive 16 and 32 core machines? A question like this would make most MySQL database engineers cringe, but of course we know what’s wrong with the MySQL Query Cache. I could explain this perfectly well, but I knew it was going to be a lengthy write up. So I looked for an existing source that could explain this to anyone.

And to be honest: I couldn’t find a clear (blog) article to explain this to a person who isn’t a database engineer. Even worse: I could find many articles describing how easy it is to enable it for performance reasons, including a tutorial by DigitalOcean! Now if Digital Ocean is telling me how to improve performance, why should I doubt their words?

Time to clear things up once and for all by giving 10 reasons why you should switch it off! But not before I actually explained what the MySQL Query Cache is!

What is the MySQL Query Cache?

The MySQL Query Cache is nothing more than a slab of reserved memory within the MySQL daemon to store the results of a query. The next time someone executes the same query the result is readily available and can be returned instantly. The Query Cache was created in a time when hardware resources like CPU and disk were precious and expensive. This was back when single CPU (Linux) servers were king and a fast disk meant you probably bought a 10K rpm SCSI disk with a whopping 9GB of storage. In order to prevent expensive disk access and waste CPU cycles on ordering and filtering the data, the (simple) query cache was invented.

The MySQL Query Cache works on a high level as following: (side note: this is a simplified drawing and obviously not 100% accurate)

mysql-query-cache-high-level

Continue reading “10 reasons why MySQL Query Cache should be switched off”

SELECT … FOR UPDATE on non-existent rows

TL; DR

SELECT … FOR UPDATE has a (not so) surprising side effect on non-existent rows: it could cause a (serious) performance penalty and even prevent you from inserting new rows at all.

Locking rows for update

A development team of ours was working on an application that needed to ensure an update on a single row item isn’t modified by another transaction. Naturally they started making use of SELECT … FOR UPDATE to lock the row before updating it. This worked excellent to keep anyone else from updating this row. However they started to get some lock wait timeouts on new inserts of totally unrelated items during a load test and they asked me to look into this.

SELECT … FOR UPDATE is described as following in the MySQL documentation:
A SELECT ... FOR UPDATE reads the latest available data, setting exclusive locks on each row it reads. Thus, it sets the same locks a searched SQL UPDATE would set on the rows.
So far so good: this behavior is expected to happen. It also doesn’t mention anything about locking anything but the rows it reads.

I asked the team whether they were attempting to insert the same data as they were locking in the other transaction and they said they were not.

In pseudo code they were doing this:
SELECT ... WHERE uuid='some-uuid' FOR UPDATE;
if row { UPDATE row }
else { INSERT row }

The uuid column here is the primary key of the table. This code executed fine and had no issues by itself as a single transaction.

You may wonder why not use the INSERT … ON DUPLICATE KEY UPDATE or REPLACE INTO?
First of all we are inserting only occasionally, so that would fail the insert command 99% of the time. Second of all we may only be updating a single column within a single row, so that implies we would need to know the entire row up front when we have to insert or replace the row.

No row to update

Now what would happen if there is no row to update?

According to the description in the MySQL documentation it sets an exclusive lock on each row it reads, but what about when there is no row to read? This other excerpt on gap locks might hint what it actually does do:
For example, SELECT c1 FROM t WHERE c1 BETWEEN 10 and 20 FOR UPDATE; prevents other transactions from inserting a value of 15 into column t.c1, whether or not there was already any such value in the column, because the gaps between all existing values in the range are locked.
If there is no row, a lock will still be set on the associated index entries. This shouldn’t be bad, right? We can still insert other rows, right? Wrong: it isn’t a gap lock alone, but a next-key lock!

Since the lock is set on a non-existent index entry, a next-key lock is created. Depending on where you would insert in the index, you may find a whole range being locked as it needs to insert within this range. In our version of UUID this shouldn’t happen very often as there is a random factor, but it still can happen often if you only have a few rows in your table. In that case the gaps between UUIDs is large and due to the random factor yu may end up locking large gaps within the index. As cross region latency is present on this system, this keeps the next-key locks open longer and the chance of a collision in gap increases also a little bit. So that explains the behavior during the load test. So all’s well, ends well?

Nasty side effect of Next-Key locks

There is one nasty side effect with the next-key lock: if the index value would be greater than the largest value in the table it locks everything above the largest value until infinity.

So what would happen to a table where the primary key is sequential like an integer? For example this table with the following rows:
CREATE TABLE sometable (
id int(11) NOT NULL,
some_data varchar(255) NOT NULL default '',
PRIMARY KEY (some_id)
);
INSERT INTO sometable VALUES
(1, 'test'),
(2, 'test'),
(3, 'test'),
(4, 'test'),
(5, 'test'),
(10, 'test'),
(11, 'test'),
(12, 'test'),
(13, 'test'),
(14, 'test'),
(15, 'test');

This would create a gap between 5 and 10 and a gap from 15 till infinity.

When we are selecting within the gap between 5 and 10, we create a next-key lock between 5 and 10 and we can’t insert new rows inside this gap. We can still insert new rows at the end of the table though. However if we would select a row on id greater than 15 we would put a next-key lock on 15 till infinity. This means nobody can append anything to this table anymore until we have committed our transaction! This could become a serious bottleneck if you insert more rows than update.

Conclusion

I wasn’t aware of the fact that SELECT … FOR UPDATE actually locked the index for rows that don’t exist yet. I would have reasoned that if there were no rows to be selected, there wouldn’t be anything to lock. And if there wouldn’t be anything to lock, the whole SELECT … FOR UPDATE would simply fizzle.

Even though SELECT … FOR UPDATE sounds like a great way to ensure your transaction is the only one who modifies a specific row, it’s quite dangerous as it could lock out or delay other transactions. If you would take our example above, the safe way to do it (in pseudo code) is:
SELECT ... WHERE uuid='some-uuid';
if row { SELECT ... WHERE uuid='some-uuid' FOR UPDATE; UPDATE row }
else { INSERT row }

This would ensure the lock would only be set when there actually is a row present, however this is at the expense of an additional query.

Edit: added that UUID gaps can be significant

The ProxySQL multiplexing wild goose chase

TL;DR – We encountered multiplexing issues with ProxySQL and after going around in circles for a while we found that the impact of mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms was not documented. Read the blog post why you should check these values in your configuration and check how your codebase handles last insert identifiers!

At my present company we are using a multi-layer ProxySQL setup to route our traffic to the appropriate database cluster’s primary or replica hosts. For this post it doesn’t matter whether you run a single or a multi-layer setup, have many backend servers configured or not, so I’ll simplify our architecture to a single ProxySQL layer where the application connects to all three proxies evenly:

proxysql-multiplexing-wild-goose-chase-topology

The reason for having N+2 proxies is that this ensures us that we can retain high availability after failure of a single node. I’ve also added a fourth (canary) node and sent a small amount of traffic to that to see how any change would affect multiplexing.

One of the key features of ProxySQL is that it’s able to multiplex connections. In short: ProxySQL will attempt to send queries to your database servers over idle database connections. In effect queries of separate incoming connections can be sent to the master with as little backend connections as possible, lowering the connection overhead on your database servers. This could mean queries from the same incoming connection are being multiplexed over multiple backend connections. The limitation is that you can’t multiplex queries within a transaction and neither if you place locks, as this would require the connection to stick with the same backend connection. See also the ProxySQL multiplexing documentation on every condition where multiplexing will be (temporarily) disabled.

Problems encountered with high connection counts

After migration of a single database into multiple shards, we found that ProxySQL was hardly multiplexing connections at all and due to the increase in connections by adding a shard this wasn’t a scalable solution. The number of ingress connections is about 600 per shard, while the number egress connections stuck around 400 per shard. This meant that the ratio of ingress vs egress is about 66% and that’s not a good sign as ProxySQL is supposed to be multiplexing. A good and effective ratio would be more in the lines of 5%. We were certain ProxySQL was multiplexing on other hostgroups before as the ratio was far more favorable there.

proxysql-multiplexing-wild-goose-chase-first-layer-high-connections

Incoming and outgoing ProxySQL connections are ridiculously high

Why is this a big issue then? For us having more than 2000 active connections frontend and about 1400 backend meant ProxySQL was using a lot of CPU time. On a 4 core machine we noticed our CPUs were over 70% busy all the time, which wouldn’t allow us to lose a single ProxySQL node anymore. To stay within safe margins we first upgraded all ProxySQL hosts to 8 cores and this kept the CPUs within the regions of 40% to 50%. Adding a new shard would increase the incoming connections with 600 and backend connections with 400. That would only allow us to add maybe one or two shards before we would no longer be able to lose a ProxySQL node again and upgrade once more. Adding more ProxySQL hosts would work, but as we have a multi-layer ProxySQL architecture we need to scale all layers to keep up with the incoming connections on the other layers as well. In other words: our sharded environment just faced a new challenge as scale out wasn’t an (easy) option anymore.

If you search for ProxySQL and Multiplexing, the top results will be the documentation of ProxySQL Multiplexing and a few blog posts describing the wonders of ProxySQL multiplexing. In the documentation a few conditions when multiplexing is disabled are described, so naturally we examined if our application was meeting these conditions. We examined the general log, query digest inside ProxySQL, tcpdumps and concluded it wasn’t meeting any of these conditions and thus theoretically multiplexing should be functioning. But as we already established with the high incoming and backend connection it clearly wasn’t. Our first dead end.

Investigating ProxySQL bug reports

Then we started to investigate bug reports of ProxySQL to see if anything matched there. At first we found issue #1897 where Multiplexing was erroneously was disabled and there were a few other solved issues that hinted we should try to upgrade. However upgrading a production environment without knowing the impact is never favorable, so instead we added a fourth ProxySQL host and sent 1% of the traffic towards it. This allowed us to easily upgrade the host to various ProxySQL versions and see if this would resolve the issue. Unfortunately we found none of the upgrades would resolve our multiplexing issues. Another dead end.

We did notice the set names metric in ProxySQL was increasing fast, so this lead us to issue #1019 where multi-layered ProxySQL can have issues with set names where the database has a different characterset as ProxySQL’s default characterset.  This was the case with for us, but the documentation doesn’t mention set names influences multiplexing and the previous upgrade to 2.0.8 didn’t resolve our issues. At least we found out why the number of set names is increasing. Another dead end.

Then one of our developers pointed me towards ProxySQL issue #2339 in which a wjordan requests a documentation update on temporary conditions when multiplexing is disabled. He describes the mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms variables to be missing in this page.  I totally ignored this when I searched through the ProxySQL issues as the title contained the word documentation. Insert face palm here.

The two variables basically are a workaround for issue #1048 (Problem with “LAST_INSERT_ID()” returning 0), issue #1421 (Forward SELECTs on LAST_INSERT_ID) and issue #1452 (select last_insert_id() as ID) and are bundled in issue #1828. Ironically issue #1828 is the source for issue #1897, so I went full circle there! Insert second facepalm here.

How ProxySQL delays multiplexing

So what do these two variables do then? First we have to understand the origin of the problem here: with multiplexing enabled, whenever an insert or update would cause an auto increment column to increase, the LAST_INSERT_ID() query would not be guaranteed to be run on the same connection or right after the insert happened. So either it would be proxied over another connection (returning a wrong identifier or 0) or it would be run out of order due to multiplexing (again returning a wrong identifier or 0). These two variables allow you to have the connection stop multiplexing for X-number of queries or X-number of milliseconds after an insert or update happened on an auto increment column.

But wait! How does ProxySQL detect an auto increment column increased then? That’s quite simple: on every successfully query the MySQL binary protocol returns an OK packet to the client. Whenever an insert or update would trigger an auto increment column this OK packet also returns the last inserted identifier. This is what triggers ProxySQL to stop multiplexing for X-number of queries or milliseconds.

But wait! Why did the majority of applications still function correctly before ProxySQL 1.4.14 then? That’s also quite simple: most programming languages are using the native MySQL drivers. For instance PHP PDO makes use of mysqlnd that uses the native MySQL C-driver. Just like ProxySQL, mysqlnd reads the OK packet and stores the last inserted identifier internally. So when you make use of the LastInsertId function in PDO, this will retrieve the value from mysqlnd internally using the function . In general you can assume database drivers never run a SELECT LAST_INSERT_ID() against a database. However you should be cautious with some ORMs, like Hybernate, that are actually depending on queries like this.

proxysql-multiplexing-wild-goose-chase-mysqlnd-native-driver

mysqlnd stores the last insert identifier internally

The default for mysql-auto_increment_delay_multiplex is 5 and mysql-connection_delay_multiplex_ms is 0. So whenever ProxySQL encounters an OK packet with last inserted identifier set it will disable multiplexing on the connection for 5 consecutive queries. This basically locks the frontend connection to the backend connection for at least 5 queries. This to allow an ORM (or application) to run a LAST_INSERT_ID() query on the same connection. New incoming connections will then have to use a different connection from the connection pool. Obviously whenever ProxySQL encounters another OK packet with last inserted identifier it will reset this connection again to 5 consecutive queries again.

Back to our own problem statement

So why did this happen only to the sharded databases then? The primary key of our main table in the sharded database contains a UUID and not an auto increment. There was, however, an auto increment column present on the table to facilitate a sequence number in our records. We clearly overlooked this in our investigation as well. There are various reasons why we currently have to keep this column, so it wasn’t feasible to remove the column for now.

Also our sharded databases have a read/write ratio of around 3:2. That means 2 out of 5 queries will cause an auto increment to trigger and thus lock the frontend connection to a backend connection for at least 5 consecutive queries! With that ratio we will hardly ever multiplex! Once we changed this variable on our canary ProxySQL we immediately saw a significant drop in backend connections on ProxySQL. First we set mysql-auto_increment_delay_multiplex to 0, which caused all 16 incoming connections to be proxied over an average of 0 connections! When we changed the value to 1 it averaged over 3 connections!

proxysql-multiplexing-wild-goose-chase-connection-drop-canary

Drop in active backend connections after setting mysql-auto_increment_delay_multiplex to 0

Before applying this to all other ProxySQL hosts, there was one more hurdle: both variables are global, so they apply to all hostgroups. Once we established our codebase never ran a LAST_INSERT_ID() anywhere we changed it on all other ProxySQL hosts. The difference is night and day:

proxysql-multiplexing-wild-goose-chase-first-layer-connection-drop

Effect of setting mysql-auto_increment_delay_multiplex to 0 on first ProxySQL layer

And the CPU usage dropped to a more sane 30%:

proxysql-multiplexing-wild-goose-chase-first-layer-cpu-usage-drop

CPU usage decreases dramatically after multiplexing works again

Conclusion

There are clearly three problems with the introduction of mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms:

  1. Both variables are global variables. If you change them, they will apply to all ProxySQL hostgroups, servers, users and connections
  2. The default of mysql-auto_increment_delay_multiplex is set to 5. This means multiplexing will be disabled or less effective on any write-intensive workload that contains an auto increment column. This is regardless whether your application actually uses the LAST_INSERT_ID statement or not.
  3. With the introduction of both variables in version 1.4.14, the documentation on both variables was updated. However it’s impact on multiplexing has never been updated.

By the time this blogpost goes public I have already made a change in the documentation of ProxySQL on multiplexing. I’ve also created a feature request for ProxySQL for the ability to control the delay per server or per user. But after I submitted the feature request I realized that this might actually make it only worse: even more variables, configurations and query rules to check and it doesn’t fix the problem at it’s root. It would be much cleaner for ProxySQL to store the last inserted identifier returned by the OK packet in a variable bound to the frontend connection that made the request. This identifier can then be returned whenever that frontend connection has a query that contains the LAST_INSERT_ID function.

Putting Galera SST Compression on the benchmark

I mentioned in my Galera crash (un)safe post that it’s bad if a SST is triggered on a large dataset. On a Galera node an SST will simply wipe the contents of the MySQL data directory and copy a snapshot from another node in the cluster. For consistency this is a favorable thing to do, but from an HA perspective it isn’t: one node in the cluster is unavailable and another one is acting as donor. This means, if you have a three node Galera cluster, only two are available where one has “degraded” performance.

This could have quite high impact, for instance in the proxysql_galera_checker script only synced nodes are selected to receive traffic. By default the donor nodes will not receive any traffic and now the cluster only has one node left to serve traffic. Don’t worry: if there are no synced nodes available the donor node will still be selected as available, so in the case you only have one node left (bootstrapping) it can still serve traffic. This, obviously, isn’t a favorable situation you want to be in.

Partial failure on Galera

Now imagine we have a multi DC Galera cluster, as pictured below:

galera-multi-dc-small

A power failure happens in DC #1 and nodes 1 to 3 are all gone. As the nodes in DC #2 are not a majority anymore, the nodes shut down out of safety. As I wrote before in the Galera crash (un)safe blog post: a crash unsafe Galera cluster will force SST to happen on crashed nodes and this happened as well in our case of the nodes in DC #1. The most favorable node to recover from is a cleanly shut down node in DC #2.

The bootstrap cluster automation performing the recovery of the cluster ran in sequential order through its configured hosts. Unfortunately the first node in the list was an unclean node in DC #1, forcing SST to happen on this host. If the other host in DC #2 would have been chosen, a simple IST would have sufficed and the second node in the cluster would have been brought up within minutes. Now the automation was waiting for the SST on the node in DC #1 and this caused a delay of 4 hours for the next node in the automation to be brought up.

Painful GCache durability

Durability of the GCache on a Galera node is calculated by looking up the first and last item in the ringbuffer and then calculate the time between the two. A Galera incremental state transfer (IST) can only happen if the delta of the writeset requested by the joining node is present in the GCache on the donor node. This durability is basically the time you are allowed to perform an IST from this node and this metric is one of the ones to keep an eye upon. Our GCache durability was somewhere in the range of 45 minutes.

When binary logging is set to row based replication and binlog_row_image is set to full (default), the data written to the binlog will contain the previous row state and the updated values. The same applies to the GCache ringbuffer: when set to full the entire previous state of the row will be written to the GCache.

On this particular cluster one of the tables looked similar to this:

CREATE TABLE product (
  pid BIGINT NOT NULL AUTO INCREMENT,
  title varchar(255) NOT NULL,
  description TEXT NOT NULL,
  specifications TEXT NOT NULL,
  stock INT NOT NULL DEFAULT 0,
  price INT NOT NULL DEFAULT 0,
  image_1 VARCHAR(255) NOT NULL,
  image_2 VARCHAR(255) NOT NULL,
  image_3 VARCHAR(255) NOT NULL,
  thumbnail_1 VARCHAR(255) NOT NULL,
  thumbnail_2 VARCHAR(255) NOT NULL,
  thumbnail_3 VARCHAR(255) NOT NULL,
  ...
  another 50+ columns
  ...
PRIMARY KEY(pid)) ENGINE=InnoDB;

You might think a SST isn’t a big deal, but if you have a data size of 1TB it will take about 3 hours to transfer on a 1Gb network or, if you’re lucky, 21 minutes on a 10Gb network. The connection between DC #1 and #2 was a VPN that was limited in bandwidth (about 400Mbps) and moving the 400GB snapshot between node 4 and node 1 took about 4 hours to complete.

As the bootstrapped node was online according to the ProxySQL script, it now allowed writes to the cluster while being a donor node. The mass-update jobs (stock and pricing sync) started to run once again altering all sorts of data inside the cluster. This means a SST taking 4 hours to complete would make IST impossible between the cleanly shut down nodes in DC #2! As a result inside DC #2 the SST will also happen between nodes 4 and 5. Luckily the local network in DC #2 is a bit faster than the VPN but it was a totally unnecessary recovery.

Recovery improvements

If we look at the outage, the cluster was unavailable for a few minutes thanks to ProxySQL sending all traffic to a single node. But according to the definition of high availability, a single node is not HA. So if we take into account that we need at least two nodes to be available, the recovery happens only after the first SST has completed. This means our Mean Time To Recover Service (MTTRS) on Galera was at least 4 hours!

MTTR-Galera-flow

But as explained earlier: the ProxySQL script will not allow traffic to be sent to a donor, unless there is no other node available. This means right after node 1 has been recovered the ProxySQL script will remove the bootstrapped node 4 from available nodes and we’re back to one single node availability. Technically then the MTTRS should be extended till after node 2 has been recovered. But we’ll keep this discussion out of the scope for now.

In the post mortem of this incident we identified a couple of improvements we could make to speed up the recovery of a Galera cluster:

  1. Faster network (between DC #1 and DC #2)
  2. Make GCache bigger
  3. Increase GCache durability
  4. Make the transfer smaller

The first option was not feasible as it would increase cost dramatically, so we focused on option 2 and 3 at first. The GCache was already set to 4GB, which is already quite large. So we simply changed the binlog_row_image parameter from full to minimal. Since most modifications per row were only a few bytes this increased the durability from 45 minutes to somewhere between of 4 to 5 days! I felt a lot more confident that unnecessary SSTs wouldn’t happen anymore.

How to make a Galera State Snapshot Transfer smaller?

Now the fourth option proved to be a bit more challenging: how to decrease the data size of 400GB? We could just drop a couple of the tables larger than 50GB, but I think the product managers would be very angry at me. Another option would be to enable table compression, but I rather not go through the slow migration process on tables larger than 50GB with distributed storage on a hyperconverged infrastructure

Another option is to compress during the creation of the xtrabackup snapshot and this is done for example by using the following configuration:

[sst]
compressor="gzip"
decompressor="gzip -dc"

This would save us probably something near half of the bytes to send over the line, and every byte compressed means we can send the SST faster, right? Naturally we wanted to test this and benchmark how much faster we would be able to move data between the two locations than our current uncompressed SST.

Benchmarking State Snapshot Transfers

We selected the following methods:

  1. Uncompressed
  2. Gzip
  3. Pigz (normal and fast)
  4. LZ4
  5. Snappy
  6. Qpress (LZH built in xtrabackup)

The method for testing was limit the bandwidth between two nodes in our Galera test cluster (4 cores and 8GB of memory) to 300Mbit, force SST to happen on one node and then measure the time between the SST to start and the moment when the node has rejoined the cluster. The amount of bytes transferred was measured through the network interface and we re-ran each method three times to get a reliable outcome.

First of all the throughput tells us already a couple of things:

Galera SST benchmark network throughput

Galera SST benchmark network throughput

LZ4 and Snappy seem to be able to almost fully utilize the bandwidth of our test system, which is good.

We can see the throughput of gzip and pigz is very low. Gzip is renowned for being slow in the compression phase, but pigz (parallel gzip) is supposed to tackle this by allowing parallel execution of chunks of data. On the test machine the graphs of the gzip run were showing it actually was limited by the single core being utilized fully:

Benchmark SST donor GZIP (second set of peaks)

Benchmark SST donor GZIP (second set of peaks)

The peaks before the gzip runs are the uncompressed runs. Throughput really is dwarfed by the uncompressed run.

You may think pigz is any better, but alas even in fast compression mode it’s not much better:

Benchmark SST donor PIGZ fast compression (last three peaks)

Benchmark SST donor PIGZ fast compression (last three peaks)

Yes it will compresses now on all four cores, but the throughput is only twice as much.

So what about the total amount of bytes transferred?

Galera SST benchmark - total bytes transferred

Galera SST benchmark – total bytes transferred

Gzip and pigz are truly king (and queen) here: they save almost 60% of total bytes transferred! This is an amazing achievement and it saves a lot of network bandwidth. LZ4 and Snappy do a much worse job at 25% and 32% respectively.

Now the most important graph, the total duration that we were after:

Galera SST benchmark - total duration

Galera SST benchmark – total duration

LZ4 and Snappy are clear winners here by saving us 15% and 11% in duration of the SST.

Conclusion

We applied Snappy to our production Galera cluster. As our production servers were slightly different we were able to save about 20% on the duration of the state snapshot transfer.

I think the outcome would also be a bit more favorable for pigz on these production servers as they ran on 8 cores instead of only 4 cores. However I fear for the responsiveness of a bootstrapped server utilizing all 8 cores to the maximum for a few hours. Snappy and LZ4 may compress a bit less, but they are much “friendlier” to your donor.

And the final advice I can give you is to keep an eye on the GCache durability as this may be a life saver one day!

Want more?

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

See me speak at Percona Live Europe 2019

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Hyperconverging and Galera cluster

What is hyperconverging?

Hyperconverging is the latest hype: do things more efficiently with the resources that you have by cramming as many virtual machines on the same hypervisor. In theory this should allow you to mix and match various workloads to make the optimum use of your hypervisor (e.g. all cores used 100% of the time, overbooking your memory up to 200%, moving virtuals around like there is no tomorrow). Any cloud provider is hyperconverging their infrastructure and this has pros and cons. The pro is that it’s much cheaper to run many different workloads while the con clearly is when you encounter noisy neighbors. As Jeremy Cole said: “We are utilizing our virtual machines to the max. If you are on the same hypervisor as us, sorry!”

Apart from cloud providers, you could hyperconverge your infrastructure yourself. There are a few hardware/software vendors out there that will help you with that and at one of my previous employers we got a helping hand from one such vendor!

DIY hyperconverging

In our case the entire infrastructure was migrated to a new hyperconverged infrastructure where we would have multiple infrastructure clusters (read: four hypervisors in one chassis) in multiple data centers. Infra marked one of these DCs suitable for our customer facing projects as the peering was performed in that DC. The idea behind this new infrastructure is that the VM can basically run anywhere in your infrastructure and copied realtime to another hypervisor within the same cluster (read: chassis). This copy process (including memory) obviously required some (short) locking, but it even worked amazingly well. We even had some software running that would move around VMs to optimize the workloads and still retain some spare capacity. Magic!

Now there was an additional benefit to choose for this vendor: if a hypervisor would go down the same VM could be spun up immediately on another hypervisor, albeit without copying the memory contents. To be able to do this, the disks are synced to at least one other hypervisor. This means some cluster magic detects one of the hypervisors being down and automagically spins up the same VMs on another (available) hypervisor that contains the latest data of this VM. To spread the load among various hypervisors the replication factor of the disks is advised to be set to 2, where 2 means to be copied to (at least) two other hypervisors.

Hyperconverging Galera

Our Galera cluster consisted out of three Galera nodes and three asynchronous read replicas attached (see image below).

Galera cluster with read slaves

Galera cluster with read slaves

In this picture every Galera node stores every transaction in the GCache, InnoDB flushes the transaction to disk (ibdata*) and asynchronous replication dictates another write to the binlogs. That means that every transaction in our Galera node will already be stored three times on disk.

The hyperconverged cluster where we hosted Galera had the replication factor set to 2. That means every byte written to disk will be written to at least two other storage controllers (VMs), as shown in the image below. This write operation over the network is synchronously, so the filesystem has to wait until both controllers acknowledged the write. Latency of this write is negligible as the write is super fast and performed over a low latency network. The magic behind this synchronous disk replication is out of the scope for this blog post, but I can hint that a certain NoSQL database (named after some Greek mythology) is managing the storage layer.

Hyperconverge write amplification: every write to disk will be written three times!

Hyperconverge write amplification: every write to disk will be written three times!

This means that every write to disk in our Galera node will also be synced an additional two hypervisors. To make matters worse, due to semi-synchronous replication, all three nodes Galera perform the exact same operations at (almost) the exact same time!

1 transaction = 3 nodes (3 writes locally + 6 writes over the network) = 27 writes

As you can guess from the simple formula above: 9 writes are performed locally and 18 writes are performed over the network. As every write to disk is performed synchronously over the network, this write adds a bit more than negligible latency when it spawns 18 writes over the network at the same time. As 1 transaction to Galera can cause 18 synchronous writes over the network, imagine what latency you will encounter if you have a baseline of 200 transactions per second! And we’re not even counting the asynchronous replicas performing similar write operation again mere (milli)seconds later!

Galera managed to cope, but instability only happened on set intervals. We could trace these back to our so called stock-updates or pricing-updates: every half-an-hour stock levels were pushed from the warehouse database and every few hours new pricing information was also pushed via the enterprise service bus.

With more than a million products in the database these torrents of writes quickly caused disk latency in the entire hyperconverged cluster and we have seen the disk latency shoot up well beyond 80ms. This no longer affected the Galera cluster, but this was causing cluster wide issues on the distributed storage layer as well. And to make matters even worse: latency on the entire network was also shooting up.

Benchmarking semi-synchronously replicated hyperconverged clusters

At first nobody believed us, even when we showed the graphs to the vendor. This new infrastructure was so much more expensive than our old that it simply couldn’t be true. Only after conducting benchmarks, reproducing the latency on an empty test cluster, we were taken seriously. The benchmarks revealed that the write amplification saturated the network interfaces of the cluster and we worked with the vendor on seeking a solution to the problem. Even after upgrading the network (10G interface bonding, enabling jumbo frames, hypervisor tuning) we still found latency issues.

The issue with our hyperconverged cluster was that there was no (separate) internal network handling the inter-hypervisor network traffic. Of course we could now achieve the double amount of transactions, but that didn’t solve the underlying issue of also causing latency on other VMs and also causing latency on ingress and egress network of our applications.

Conclusion

We came to the conclusion that (semi-)synchronous replicated databases and hyperconverged infrastructures with high replication factors don’t match. Unfortunately this replication factor could only be set on cluster level and not on an individual VM level. Also the reasoning behind the synchronous disk replication did not make sense (see also my previous blog post) as Galera would wipe the disk contents anyway and in general it would take quite some time for the database to recover, so a quick failover would not happen anyway. That’s why we ran Galera+ProxySQL in the first place: to allow us to have a failover happen within seconds!

We also ran other (semi-)synchronous replicated databases: MongoDB, SOLR and Elasticsearch for example and each an everyone of them basically the same lack of need for disk replication.

The only option left was to migrate the Galera cluster back to our old hardware that, luckily/sadly, was still switched on. At the same time we started working on a migration to a real cloud vendor as they could offer us better performance without the risk of a single point of failure (e.g. single data center).

So what difference would a benchmark up front have made?

This only happened due to bad requirements without analyzing the workload that was supposed to be converged. We would have seen these issues before migrating to the new hyperconverged infrastructure if we would have benchmarked beforehand. We would have saved us from many instabilities, outages and post mortems. We might even have chosen a totally different setup or have chosen to split our workloads over multiple (smaller) hyperconverged clusters.

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

See me speak at Percona Live Europe 2019

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Is Galera trx_commit=2 and sync_binlog=0 evil?

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

See me speak at Percona Live Europe 2019

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0 innodb_flush_log_at_trx_commit set to 1 and sync_binlog set to 1 (correction by Przemek Malkowski) to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While  Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!

Presenting Sphinx Search at Percona Live London

It has been a while since I posted on this blog. The more in depth articles are all posted on the Spil Games Engineering blog and I’m overcrowded with work in the past months that I hardly have any time left.

One of the reasons for having too much on my plate was my own doing: I volunteered to be a conference committee member for Percona Live London and we, as a committee, worked our way through all proposals while you all were enjoying your holidays in the sun. Well, I must admit I did review a couple of them sitting in the sun, next to the pool enjoying a cold drink. 😉
I must say there were so many great proposals it was really hard to choose which ones would be left out.

I also proposed a talk for Percona Live London this year and my fellow committee members liked it enough to make it to the final schedule: Serve out any page with an HA Sphinx environment.
In basis it is a MySQL case study where I will show how we at Spil Games use Sphinx Search in three different use cases: our main search, friend search and serving out our content on any page. I’ll also describe how we handle high availability, how we handle index creation and show benchmark results between MySQL and Sphinx in various use cases.

In case you are interested in Sphinx or the benchmark results: the session will be on the 4th of November at 3:10pm – 4:00pm in Cromwell 1 & 2.
Also don’t hesitate to ask me things when I’m wandering around in the corridors and rooms. Or maybe we’ll meet at the MySQL Community Dinner?
See you next week!

Presenting at FOSDEM and Percona Live

Very short update from my side: I’ll be presenting at FOSDEM in Brussels (1-2 February 2014) and Percona Live MySQL Conference in Santa Clara (1-4 April 2014).

FOSDEM
At FOSDEM I will present about Galera replication in real life which is concentrate around two use cases for Galera: adding new clusters to our sharded environment and migrating existing clusters into a new Galera cluster.

Percona Live MySQL Conference and Expo, April 1-4, 2014
At Percona Live MySQL Conference I will present about our globally distributed storage layers. Next to our globally sharded environment we have built a new environment called ROAR (Read Often, Alter Rarely) that also needs to be distributed globally.

Both are interesting talks and I really look forward to present at these great conferences. So if you have the way and means to attend either one: you should!

Percona Live London is already next Monday!

I’m currently wrapping up last things in the office to be prepared for my presentation about MySQL-Statsd at the Percona Live London conference next week. It will be a revised version of the talk I gave at the Percona Live Conference & Expo in Santa Clara, but this time it will be more focussed to the practical side and, obviously, the MySQL-Statsd daemon we open sourced recently, so if you missed that talk or think a follow up is needed you should definitely attend! Slides will also become immediately available after the talk.

I’m also looing forward to attend a lot of other talks, for instance The Shard Revisited: Tools and Techniques Used at Etsy and MySQL, the ebay Classifieds Way and of course the tutorials.

It is great to see the bustrip to the community dinner organized by MariaDB. I did not really mind taking the tube for 17 minutes, but getting a busride in a Roadmaster is obviously a lot more fun and comfortable!

So see you at the conference next Monday and Tuesday!

Spil Games MySQL StatsD project on Github

Yesterday evening I already announced the happy news on the Spil Games Engineering blog, you can find our MySQL StatsD daemon in our Github repository here:
https://github.com/spilgames/mysql-statsd

It is far from perfect and the todo list is still long. You are, as it is a public repo, welcome to collaborate with us and make it even better. I can’t wait to see the first pull-request coming in!

You can also talk to us directly at Percona Live London where we will bring a revised version of our talk at Santa Clara. This time it will be aimed more at the practicality: It will contain less comparison with existing tools, be less theoretical and focus more on the how to do it.

A bit of a shame to do it this way: I tried several times to add the Spil Games Engineering feed to the Planet MySQL feed adder, but it fails on the feed validation and suggests to validate it via feedvalidator.org. However feedvalidator.org validates it perfectly fine as a RSS 2.0 feed. I did use the Planet MySQL feedback form twice but no reply on that and I think I will file a bug report later today.