Changing the xbcloud filesize | xtrabackup streaming to cloud storage

When Percona released Percona XtraBackup 2.4.14 on May 1st 2019 they included a very important feature: streaming a backup to S3 and GCS using the xbcloud binary! My team implemented streaming our backups to shorten the total time for taking our backups. We found the streaming of our encrypted backups to GCS to be shorter than making them locally, but this is a subject for another blog post. The matter at hand in today’s post is that we found xbcloud was creating an enormous amount of small files. We found that xtrabackup creates files of 10MB by default. No matter what we tried, we couldn’t have the xbcould binary make larger files. In the end we managed to solve it with the (public) help of Percona!

TL;DR

The issue is that you can set the  –read-buffer-size to change the size of the files created. However if you also encrypt your backup, this change gets lost. You then need to change the  –encrypt-chunk-size to the same size as your –read-buffer-size and this will then create the changed file size.

The xbcloud problem

The problem is the construction of taking a backup. Xtrabackup will handle the MySQL part of the backup and pipes the backup to the xbstream and xbcloud binaries. The xbcloud binary will then stream the data to the backup bucket in 10MB chunks. This means that every time your backup stream reaches 10MB, a new 10MB file will be created in the bucket. The reasoning behind this is that if you hit a network error, not the whole file is lost and xbcloud can reupload the failed 10MB file and not the entire file. The number of retries by xbcloud is 3.

The issue we found is that large backups (8TB of data or larger) will create many many many files. Browsing through the GCS bucket shows you Google doesn’t even wish to estimate how many files there are unless you go to the final page. But a quick calculation says to me that an uncompressed and encrypted 1TB table should end up in at least 100,000 files. Unfortunately there are a couple of downsides attached to this.

Many many many small 10MB files are uploaded. Even GCS doesn't dare to say how many...

Reliability of storage

When we look at the Cloud, we assume everything is stable and networking is always there. Google assumes GCS is reliable and only publishes numbers on the uptime of GCS (99.9%, so 3 nines). Let’s take this SLA and then assume there is a failure rate of 1 out of 1,000 file stores for storing a file in GCS. As xbcloud has 3 retry attempts, that means in theory we would have a chance of 1 out of 1,000,000,000 to get 3 consecutive errors. But when we get into the big numbers of files to upload, the failure rate of 800,000 files should be 0,0008 file stores. That means on our total batch we have a likelihood of 0.08% on a file storage failure. That doesn’t sound too bad, does it?

But it is far more likely that consecutive errors happen as there is a reason why the file storage failed in the first place. It could be the case a queue is full or the pod running the API that accepts the request is failing. So the likelihood of file storage failure is far bigger than what I just made up in the previous paragraph. Also the larger the numbers of total files get, the more likely you will end up with an error on storing a file. This error will then be propagated upwards by xbcloud to xtrabackup, which in turn make the entire backup fail. Retaking a backup to 8TB with a duration of 6 hours isn’t going to be my definition of fun. 

And this is exactly what we encountered. Large backups tended to fail more often and even our automated retry mechanism on the backup would fail to produce a backup from time to time. With the help of support we have improved things by adding more retries for the file uploads. This lowered the chance of failed file stores, but it could still happen. So if we can lower the number of files we upload, we have less chance of a failed backup.

Cost of file operations

With cloud providers you pay mainly for storage used. However people also tend to forget you also have to pay for file operations. For instance with GCS uploading a file is a Class A operation and this costs $0.05 per 10,000 operations for normal storage and $0.10 per 10.000 operations for Nearline and Coldline storage. When we create a backup, we store it in a temporary bucket on Normal storage at first. We then stage the backup (test and verify) before moving it to a protected bucket that retains it for 30 days on Nearline. That means for a backup of 1TB (100,000 files), we actually do 100,000 Class A operations (store) and 100,000 Class B operations (get files to verify) in Normal and 100,000 Class A operations (moving to protected bucket) in Nearline storage. The cost for the operations alone in these various buckets are already $0.50, $0.04 and $1.00 per backup. This means the total costs for a 1TB backup are $1.54 for all bucket operations.

The costs for storing 1TB of data for a month in Nearline storage is $10.00. This means $1.54 costs for bucket operations make up a significant portion (about 13%) of the total costs for the backup. So if we can create larger files (chunks), we would be able to lower the total number of files and we would also be able to lower the total number of operations on these buckets.

Xtrabackup backup duration

Another potential issue of uploading 100,000 files is that there is extra overhead for transferring every individual file. The xbcloud binary is streaming data to the cloud provider as files one by one, so it has to wait for every file to complete and 10MB files are relatively small and quick to upload. Transferring these files is done via HTTPS, so there is also SSL overhead. You can make xbcloud upload files in parallel with the –parallel parameter, we configured 7 parallel streams and this makes it a bit better. However then all our 7 parallel streams will be waiting individually for files to complete during the transfer. We raised this concern, but were told file upload benchmarks showed there was no added benefit from uploading larger files than the default size. To me this felt very counter intuitive.

Why do we want a shorter backup duration? Remember we also stage our backups where we test and verify them? This process takes time as well and we’d rather have these backups as quickly as possible in our verified protected backup bucket. If a backup takes 8 hours to create, our backup verification process would also take at least 8 hours. This means we at most have a usable backup 16 hours after we initiated the backup. We then have to catch up 16 hours of transactions to get our recovered host to get back to 0 seconds replication lag. This means that if we can reduce the backup time to 4 hours, we would only have to catch up to 12 hours of transactions. This means we would get a 25% reduction on our time to recover for this specific database.

Fixing the file size problem

Naturally we reached out to our support and even with their help we couldn’t make it work. We knew it had to do with the –read-buffer-size parameter for xtrabackup, but increasing this to 100MB still created 10MB files inside the bucket. We suspected the cause to be the xbcloud binary creating the files was to blame as it contained a hard coded value. Our ticket remained open for over six months without a proper resolution. 

We were also aware of a bug report (PXB-1877) on the public bugtracker of Percona that described our exact problem. In this bug we also actively contributed and commented. In late December 2021 the bug was kicked up once more and, as you can read in the bug report, the same results were shared there as well. In mid February it was kicked up again by the same person as he was having similar issues as we had: high cost and failed backups. 

This time Marcelo Altmann responded once again that the  –read-buffer-size could be used to control this. We responded we already tried that, but failed to see any result. But this time Marcelo asked us to share the full command. When we shared it he noticed we were using encryption via xtrabackup as well and told us to also set the –encrypt-chunk-size to the same size as –read-buffer-size. This fixed our issue and we’re happy ever after!

Conclusion

What probably caused the problem in the first place is that xtrabackup also pipes the stream through the xbcrypt binary. This then messes up the file size set in the –read-buffer-size. We implemented the change from external encryption to internal some time before streaming backups to a bucket. Also this behavior is not documented anywhere by Percona.

We now run a file size of 200MB each and this means we have a reduction of 20 times. Now all of a sudden Google finds showing file number estimates in our buckets a lot easier to do. We also definitely see a big improvement on large database clusters: backup duration dropped from 18,000 seconds to 13,000 seconds. 

Backup duration on a huge database dropped from 18K seconds to 13K seconds after setting both encrypt chunk size and read buffer size

I do have to add that on small database clusters (smaller than 1TB) we do see increased backup duration. The example below shows this and this is about 10 minutes on a 18 minute backup.

Backup duration on a small database increased from 1K seconds to 1.3K seconds after setting both encrypt chunk size and read buffer size

We can’t see the improvement on cost effectiveness yet as our buckets are inside a shared project. However as we have achieved a reduction of 20, we expect the cost for bucket operations to become negligible compared to the storage costs.

What surprised us most were numerous claims that there wouldn’t be any benefit from increasing the file size. Perhaps the tests conducted by Percona were performed on hosts within the range between 100GB and 2TB of data. In this range you can probably see there is no real benefit from increasing these buffers, especially not to very large sizes.

One word of caution: don’t increase these two buffers too much. We toyed with various sizes and when we reached 1GB, it caused the xtrabackup chain to try to allocate an extra 2GB of memory for the buffers. Naturally the OOM killer dropped by to say hi. Not recommendable.

Paradigm shift on double date range scans

At my employer (Messagebird) we are using Vividcortex (a.k.a. Solarwinds DPM nowadays) for query insights and this gives us a top 20 list of our most time consuming queries on a specific host. When you start to make one of your slow queries on your primary faster or offload them to read-replicas, new ones will get on top. That’s how I found a puzzling query that at first glance seemed to be quite okay, but was at the top of the list.

The example query below is quite similar to the one I found, but I changed the subject a bit to make its usage clearer:

SELECT discount_percentage FROM customer_discounts WHERE customer_type=1 AND customer_country=2 AND valid_from <= NOW() AND valid_till > NOW() LIMIT 1

What this query attempts to do is get a valid discount percentage for a customer segregated by customer_type and customer_country. In itself there seems to be little to no harm in this query.

Also the table definition and indexes looked reasonable :

CREATE TABLE customer_discounts (
id int(11) NOT NULL auto increment,
customer_type tinyint(3) NOT NULL,
customer_country tinyint(3) NOT NULL,
valid_from datetime NOT NULL,
valid_till datetime NOT NULL,
discount_percentage tinyint(3) NOT NULL,
created_at datetime NOT NULL,
PRIMARY KEY(id),
KEY country_idx (customer_country,

valid_from,
valid_till)
KEY discount_idx (customer_type,

customer_country,
valid_from,
valid_till,
discount_percentage)
) ENGINE=InnoDB;

Covering index

We have a covering index for the query above, so we have nothing to worry about. Or at least, that is until you think about it a little bit longer. In our case I saw multiple rows returned for a covering index, so how can that be? The query contains a double range scan, which is impossible to perform on an index. This means our index is only effective for customer_type, customer_country and valid_from and then basically MySQL returns the data found from the index and performs a WHERE on the remaining rows.

The index is pretty deterministic on most of our discount codes, but what if we have years and years of discount codes lingering around in our table? Let’s suppose on average each customer_type and customer_country combination we have around 1000 discount codes in our table. This means for each lookup MySQL copies on average 1000 results from the index and performs a WHERE on it. If you have hundreds of queries per second (yes this is a large store), suddenly this index isn’t that efficient anymore.

This is what the execution plan would look like:

{
"query_block": {
"select_id": 1,
"table": {
"access_type": "range",
"attached_condition": "(where)",
"filtered": 33,
"key": "discount_idx",
"key_length": "12",
"possible_keys": [
"discount_idx",
"country_idx"
],
"rows": 867,
"table_name": "customer_discounts",
"using_index": true
}
}

Paradigm shift

So how do we fix this? We have to go back to the origins of what we want to achieve here. We want to get the current active (valid) discount for our customer segregation. If we look at our index, we are covering the start date (valid_from) of the discount first. When we look at our query, we are querying where the discount starts before today. For most of our discount entries this is condition is true because most of our discount entries were valid somewhere in the past. I would argue that probably 99% of the discounts would have a valid start date to satisfy this condition. That’s what you get if you keep the expired discounts in your table.

So what we need here is a paradigm shift. We need to stop looking at the past, but actually look ahead in the future. Most probably we will have far less discount entries where the end date lies somewhere in the future. Most probably older discount codes that are no longer valid will have an end date in the past and will not match our valid_till condition. So we need to change our index and switch the order of start and end date, like this:

KEY discount_idx (customer_type,
customer_country,
valid_till,
valid_from,
discount_percentage)

Profit

You may argue that we still have the double range scan and we still can’t use the index beyond valid_till. Yes that might be true, but at the same time reversing the two date columns caused our query to find a single entry in the index for 99% of our queries while it can still return all data from the index.

And now our explain plan looks like this:

{
"query_block": {
"select_id": 1,
"table": {
"access_type": "range",
"attached_condition": "(where)",
"filtered": 33,
"key": "discount_idx",
"key_length": "12",
"possible_keys": [
"discount_idx",
"country_idx"
],
"rows": 1,
"table_name": "customer_discounts",
"using_index": true
}
}
}

The plan didn’t change that much, except we now have indeed only a single row returned. This means we don’t have to do a separate where on the returned result set.

In our own case this change didn’t magically lower the load on the server or bumped the query out of the top 20 list. We don’t have an uniform distribution in this table, but a few hundred combinations with thousands of rows and thousands of combinations with maybe 3 or 4 rows. After the change this query is still at the top of that top 20 list, but the average latency of this query is more evenly now and the latency has dropped (from 280μs) with 20μs to 40μs. It’s a small drop, but on thousands of requests per second it will certainly count!

Conclusion

So how did this happen? Who was to blame for this? I can be very brief about that: everybody and nobody.

It’s all got to do with our human brains here: we think of time in a linear concept where there is a start and an end to everything. Because we think linear about time, this also means we favour to put start before end. So to our human logic the most natural thing to do is to put the start column before the end column in a table definition. This makes it more readable for us humans. Just imagine reversing those two columns: it doesn’t matter for the database in which order the columns are. But for us humans it probably wouldn’t make much sense, looks simply wrong and many data architects would automatically switch the two back the way they are supposed to be. Similarly we would do the same thing with our indexes if we see those columns reversed.

So that’s the reason why I’m sharing this story here. Next time when you create a table with a start and end date for expiring discounts/products/rates/tickets, I hope you will remember this blogpost and reverse the start and end date order in the index. If anyone disagrees with you, please refer them to this article.

10 reasons why MySQL Query Cache should be switched off

This post is not meant as a rant, so I hope it will not be interpreted this way. A few weeks ago I was asked the question why we decided to switch off the MySQL Query Cache on all our production servers. Couldn’t we just enable it to squeeze some extra performance out of those expensive 16 and 32 core machines? A question like this would make most MySQL database engineers cringe, but of course we know what’s wrong with the MySQL Query Cache. I could explain this perfectly well, but I knew it was going to be a lengthy write up. So I looked for an existing source that could explain this to anyone.

And to be honest: I couldn’t find a clear (blog) article to explain this to a person who isn’t a database engineer. Even worse: I could find many articles describing how easy it is to enable it for performance reasons, including a tutorial by DigitalOcean! Now if Digital Ocean is telling me how to improve performance, why should I doubt their words?

Time to clear things up once and for all by giving 10 reasons why you should switch it off! But not before I actually explained what the MySQL Query Cache is!

What is the MySQL Query Cache?

The MySQL Query Cache is nothing more than a slab of reserved memory within the MySQL daemon to store the results of a query. The next time someone executes the same query the result is readily available and can be returned instantly. The Query Cache was created in a time when hardware resources like CPU and disk were precious and expensive. This was back when single CPU (Linux) servers were king and a fast disk meant you probably bought a 10K rpm SCSI disk with a whopping 9GB of storage. In order to prevent expensive disk access and waste CPU cycles on ordering and filtering the data, the (simple) query cache was invented.

The MySQL Query Cache works on a high level as following: (side note: this is a simplified drawing and obviously not 100% accurate)

mysql-query-cache-high-level

Continue reading “10 reasons why MySQL Query Cache should be switched off”

SELECT … FOR UPDATE on non-existent rows

TL; DR

SELECT … FOR UPDATE has a (not so) surprising side effect on non-existent rows: it could cause a (serious) performance penalty and even prevent you from inserting new rows at all.

Locking rows for update

A development team of ours was working on an application that needed to ensure an update on a single row item isn’t modified by another transaction. Naturally they started making use of SELECT … FOR UPDATE to lock the row before updating it. This worked excellent to keep anyone else from updating this row. However they started to get some lock wait timeouts on new inserts of totally unrelated items during a load test and they asked me to look into this.

SELECT … FOR UPDATE is described as following in the MySQL documentation:
A SELECT ... FOR UPDATE reads the latest available data, setting exclusive locks on each row it reads. Thus, it sets the same locks a searched SQL UPDATE would set on the rows.
So far so good: this behavior is expected to happen. It also doesn’t mention anything about locking anything but the rows it reads.

I asked the team whether they were attempting to insert the same data as they were locking in the other transaction and they said they were not.

In pseudo code they were doing this:
SELECT ... WHERE uuid='some-uuid' FOR UPDATE;
if row { UPDATE row }
else { INSERT row }

The uuid column here is the primary key of the table. This code executed fine and had no issues by itself as a single transaction.

You may wonder why not use the INSERT … ON DUPLICATE KEY UPDATE or REPLACE INTO?
First of all we are inserting only occasionally, so that would fail the insert command 99% of the time. Second of all we may only be updating a single column within a single row, so that implies we would need to know the entire row up front when we have to insert or replace the row.

No row to update

Now what would happen if there is no row to update?

According to the description in the MySQL documentation it sets an exclusive lock on each row it reads, but what about when there is no row to read? This other excerpt on gap locks might hint what it actually does do:
For example, SELECT c1 FROM t WHERE c1 BETWEEN 10 and 20 FOR UPDATE; prevents other transactions from inserting a value of 15 into column t.c1, whether or not there was already any such value in the column, because the gaps between all existing values in the range are locked.
If there is no row, a lock will still be set on the associated index entries. This shouldn’t be bad, right? We can still insert other rows, right? Wrong: it isn’t a gap lock alone, but a next-key lock!

Since the lock is set on a non-existent index entry, a next-key lock is created. Depending on where you would insert in the index, you may find a whole range being locked as it needs to insert within this range. In our version of UUID this shouldn’t happen very often as there is a random factor, but it still can happen often if you only have a few rows in your table. In that case the gaps between UUIDs is large and due to the random factor yu may end up locking large gaps within the index. As cross region latency is present on this system, this keeps the next-key locks open longer and the chance of a collision in gap increases also a little bit. So that explains the behavior during the load test. So all’s well, ends well?

Nasty side effect of Next-Key locks

There is one nasty side effect with the next-key lock: if the index value would be greater than the largest value in the table it locks everything above the largest value until infinity.

So what would happen to a table where the primary key is sequential like an integer? For example this table with the following rows:
CREATE TABLE sometable (
id int(11) NOT NULL,
some_data varchar(255) NOT NULL default '',
PRIMARY KEY (some_id)
);
INSERT INTO sometable VALUES
(1, 'test'),
(2, 'test'),
(3, 'test'),
(4, 'test'),
(5, 'test'),
(10, 'test'),
(11, 'test'),
(12, 'test'),
(13, 'test'),
(14, 'test'),
(15, 'test');

This would create a gap between 5 and 10 and a gap from 15 till infinity.

When we are selecting within the gap between 5 and 10, we create a next-key lock between 5 and 10 and we can’t insert new rows inside this gap. We can still insert new rows at the end of the table though. However if we would select a row on id greater than 15 we would put a next-key lock on 15 till infinity. This means nobody can append anything to this table anymore until we have committed our transaction! This could become a serious bottleneck if you insert more rows than update.

Conclusion

I wasn’t aware of the fact that SELECT … FOR UPDATE actually locked the index for rows that don’t exist yet. I would have reasoned that if there were no rows to be selected, there wouldn’t be anything to lock. And if there wouldn’t be anything to lock, the whole SELECT … FOR UPDATE would simply fizzle.

Even though SELECT … FOR UPDATE sounds like a great way to ensure your transaction is the only one who modifies a specific row, it’s quite dangerous as it could lock out or delay other transactions. If you would take our example above, the safe way to do it (in pseudo code) is:
SELECT ... WHERE uuid='some-uuid';
if row { SELECT ... WHERE uuid='some-uuid' FOR UPDATE; UPDATE row }
else { INSERT row }

This would ensure the lock would only be set when there actually is a row present, however this is at the expense of an additional query.

Edit: added that UUID gaps can be significant

The ProxySQL multiplexing wild goose chase

TL;DR – We encountered multiplexing issues with ProxySQL and after going around in circles for a while we found that the impact of mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms was not documented. Read the blog post why you should check these values in your configuration and check how your codebase handles last insert identifiers!

At my present company we are using a multi-layer ProxySQL setup to route our traffic to the appropriate database cluster’s primary or replica hosts. For this post it doesn’t matter whether you run a single or a multi-layer setup, have many backend servers configured or not, so I’ll simplify our architecture to a single ProxySQL layer where the application connects to all three proxies evenly:

proxysql-multiplexing-wild-goose-chase-topology

The reason for having N+2 proxies is that this ensures us that we can retain high availability after failure of a single node. I’ve also added a fourth (canary) node and sent a small amount of traffic to that to see how any change would affect multiplexing.

One of the key features of ProxySQL is that it’s able to multiplex connections. In short: ProxySQL will attempt to send queries to your database servers over idle database connections. In effect queries of separate incoming connections can be sent to the master with as little backend connections as possible, lowering the connection overhead on your database servers. This could mean queries from the same incoming connection are being multiplexed over multiple backend connections. The limitation is that you can’t multiplex queries within a transaction and neither if you place locks, as this would require the connection to stick with the same backend connection. See also the ProxySQL multiplexing documentation on every condition where multiplexing will be (temporarily) disabled.

Problems encountered with high connection counts

After migration of a single database into multiple shards, we found that ProxySQL was hardly multiplexing connections at all and due to the increase in connections by adding a shard this wasn’t a scalable solution. The number of ingress connections is about 600 per shard, while the number egress connections stuck around 400 per shard. This meant that the ratio of ingress vs egress is about 66% and that’s not a good sign as ProxySQL is supposed to be multiplexing. A good and effective ratio would be more in the lines of 5%. We were certain ProxySQL was multiplexing on other hostgroups before as the ratio was far more favorable there.

proxysql-multiplexing-wild-goose-chase-first-layer-high-connections

Incoming and outgoing ProxySQL connections are ridiculously high

Why is this a big issue then? For us having more than 2000 active connections frontend and about 1400 backend meant ProxySQL was using a lot of CPU time. On a 4 core machine we noticed our CPUs were over 70% busy all the time, which wouldn’t allow us to lose a single ProxySQL node anymore. To stay within safe margins we first upgraded all ProxySQL hosts to 8 cores and this kept the CPUs within the regions of 40% to 50%. Adding a new shard would increase the incoming connections with 600 and backend connections with 400. That would only allow us to add maybe one or two shards before we would no longer be able to lose a ProxySQL node again and upgrade once more. Adding more ProxySQL hosts would work, but as we have a multi-layer ProxySQL architecture we need to scale all layers to keep up with the incoming connections on the other layers as well. In other words: our sharded environment just faced a new challenge as scale out wasn’t an (easy) option anymore.

If you search for ProxySQL and Multiplexing, the top results will be the documentation of ProxySQL Multiplexing and a few blog posts describing the wonders of ProxySQL multiplexing. In the documentation a few conditions when multiplexing is disabled are described, so naturally we examined if our application was meeting these conditions. We examined the general log, query digest inside ProxySQL, tcpdumps and concluded it wasn’t meeting any of these conditions and thus theoretically multiplexing should be functioning. But as we already established with the high incoming and backend connection it clearly wasn’t. Our first dead end.

Investigating ProxySQL bug reports

Then we started to investigate bug reports of ProxySQL to see if anything matched there. At first we found issue #1897 where Multiplexing was erroneously was disabled and there were a few other solved issues that hinted we should try to upgrade. However upgrading a production environment without knowing the impact is never favorable, so instead we added a fourth ProxySQL host and sent 1% of the traffic towards it. This allowed us to easily upgrade the host to various ProxySQL versions and see if this would resolve the issue. Unfortunately we found none of the upgrades would resolve our multiplexing issues. Another dead end.

We did notice the set names metric in ProxySQL was increasing fast, so this lead us to issue #1019 where multi-layered ProxySQL can have issues with set names where the database has a different characterset as ProxySQL’s default characterset.  This was the case with for us, but the documentation doesn’t mention set names influences multiplexing and the previous upgrade to 2.0.8 didn’t resolve our issues. At least we found out why the number of set names is increasing. Another dead end.

Then one of our developers pointed me towards ProxySQL issue #2339 in which a wjordan requests a documentation update on temporary conditions when multiplexing is disabled. He describes the mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms variables to be missing in this page.  I totally ignored this when I searched through the ProxySQL issues as the title contained the word documentation. Insert face palm here.

The two variables basically are a workaround for issue #1048 (Problem with “LAST_INSERT_ID()” returning 0), issue #1421 (Forward SELECTs on LAST_INSERT_ID) and issue #1452 (select last_insert_id() as ID) and are bundled in issue #1828. Ironically issue #1828 is the source for issue #1897, so I went full circle there! Insert second facepalm here.

How ProxySQL delays multiplexing

So what do these two variables do then? First we have to understand the origin of the problem here: with multiplexing enabled, whenever an insert or update would cause an auto increment column to increase, the LAST_INSERT_ID() query would not be guaranteed to be run on the same connection or right after the insert happened. So either it would be proxied over another connection (returning a wrong identifier or 0) or it would be run out of order due to multiplexing (again returning a wrong identifier or 0). These two variables allow you to have the connection stop multiplexing for X-number of queries or X-number of milliseconds after an insert or update happened on an auto increment column.

But wait! How does ProxySQL detect an auto increment column increased then? That’s quite simple: on every successfully query the MySQL binary protocol returns an OK packet to the client. Whenever an insert or update would trigger an auto increment column this OK packet also returns the last inserted identifier. This is what triggers ProxySQL to stop multiplexing for X-number of queries or milliseconds.

But wait! Why did the majority of applications still function correctly before ProxySQL 1.4.14 then? That’s also quite simple: most programming languages are using the native MySQL drivers. For instance PHP PDO makes use of mysqlnd that uses the native MySQL C-driver. Just like ProxySQL, mysqlnd reads the OK packet and stores the last inserted identifier internally. So when you make use of the LastInsertId function in PDO, this will retrieve the value from mysqlnd internally using the function . In general you can assume database drivers never run a SELECT LAST_INSERT_ID() against a database. However you should be cautious with some ORMs, like Hybernate, that are actually depending on queries like this.

proxysql-multiplexing-wild-goose-chase-mysqlnd-native-driver

mysqlnd stores the last insert identifier internally

The default for mysql-auto_increment_delay_multiplex is 5 and mysql-connection_delay_multiplex_ms is 0. So whenever ProxySQL encounters an OK packet with last inserted identifier set it will disable multiplexing on the connection for 5 consecutive queries. This basically locks the frontend connection to the backend connection for at least 5 queries. This to allow an ORM (or application) to run a LAST_INSERT_ID() query on the same connection. New incoming connections will then have to use a different connection from the connection pool. Obviously whenever ProxySQL encounters another OK packet with last inserted identifier it will reset this connection again to 5 consecutive queries again.

Back to our own problem statement

So why did this happen only to the sharded databases then? The primary key of our main table in the sharded database contains a UUID and not an auto increment. There was, however, an auto increment column present on the table to facilitate a sequence number in our records. We clearly overlooked this in our investigation as well. There are various reasons why we currently have to keep this column, so it wasn’t feasible to remove the column for now.

Also our sharded databases have a read/write ratio of around 3:2. That means 2 out of 5 queries will cause an auto increment to trigger and thus lock the frontend connection to a backend connection for at least 5 consecutive queries! With that ratio we will hardly ever multiplex! Once we changed this variable on our canary ProxySQL we immediately saw a significant drop in backend connections on ProxySQL. First we set mysql-auto_increment_delay_multiplex to 0, which caused all 16 incoming connections to be proxied over an average of 0 connections! When we changed the value to 1 it averaged over 3 connections!

proxysql-multiplexing-wild-goose-chase-connection-drop-canary

Drop in active backend connections after setting mysql-auto_increment_delay_multiplex to 0

Before applying this to all other ProxySQL hosts, there was one more hurdle: both variables are global, so they apply to all hostgroups. Once we established our codebase never ran a LAST_INSERT_ID() anywhere we changed it on all other ProxySQL hosts. The difference is night and day:

proxysql-multiplexing-wild-goose-chase-first-layer-connection-drop

Effect of setting mysql-auto_increment_delay_multiplex to 0 on first ProxySQL layer

And the CPU usage dropped to a more sane 30%:

proxysql-multiplexing-wild-goose-chase-first-layer-cpu-usage-drop

CPU usage decreases dramatically after multiplexing works again

Conclusion

There are clearly three problems with the introduction of mysql-auto_increment_delay_multiplex and mysql-connection_delay_multiplex_ms:

  1. Both variables are global variables. If you change them, they will apply to all ProxySQL hostgroups, servers, users and connections
  2. The default of mysql-auto_increment_delay_multiplex is set to 5. This means multiplexing will be disabled or less effective on any write-intensive workload that contains an auto increment column. This is regardless whether your application actually uses the LAST_INSERT_ID statement or not.
  3. With the introduction of both variables in version 1.4.14, the documentation on both variables was updated. However it’s impact on multiplexing has never been updated.

By the time this blogpost goes public I have already made a change in the documentation of ProxySQL on multiplexing. I’ve also created a feature request for ProxySQL for the ability to control the delay per server or per user. But after I submitted the feature request I realized that this might actually make it only worse: even more variables, configurations and query rules to check and it doesn’t fix the problem at it’s root. It would be much cleaner for ProxySQL to store the last inserted identifier returned by the OK packet in a variable bound to the frontend connection that made the request. This identifier can then be returned whenever that frontend connection has a query that contains the LAST_INSERT_ID function.

Putting Galera SST Compression on the benchmark

I mentioned in my Galera crash (un)safe post that it’s bad if a SST is triggered on a large dataset. On a Galera node an SST will simply wipe the contents of the MySQL data directory and copy a snapshot from another node in the cluster. For consistency this is a favorable thing to do, but from an HA perspective it isn’t: one node in the cluster is unavailable and another one is acting as donor. This means, if you have a three node Galera cluster, only two are available where one has “degraded” performance.

This could have quite high impact, for instance in the proxysql_galera_checker script only synced nodes are selected to receive traffic. By default the donor nodes will not receive any traffic and now the cluster only has one node left to serve traffic. Don’t worry: if there are no synced nodes available the donor node will still be selected as available, so in the case you only have one node left (bootstrapping) it can still serve traffic. This, obviously, isn’t a favorable situation you want to be in.

Partial failure on Galera

Now imagine we have a multi DC Galera cluster, as pictured below:

galera-multi-dc-small

A power failure happens in DC #1 and nodes 1 to 3 are all gone. As the nodes in DC #2 are not a majority anymore, the nodes shut down out of safety. As I wrote before in the Galera crash (un)safe blog post: a crash unsafe Galera cluster will force SST to happen on crashed nodes and this happened as well in our case of the nodes in DC #1. The most favorable node to recover from is a cleanly shut down node in DC #2.

The bootstrap cluster automation performing the recovery of the cluster ran in sequential order through its configured hosts. Unfortunately the first node in the list was an unclean node in DC #1, forcing SST to happen on this host. If the other host in DC #2 would have been chosen, a simple IST would have sufficed and the second node in the cluster would have been brought up within minutes. Now the automation was waiting for the SST on the node in DC #1 and this caused a delay of 4 hours for the next node in the automation to be brought up.

Painful GCache durability

Durability of the GCache on a Galera node is calculated by looking up the first and last item in the ringbuffer and then calculate the time between the two. A Galera incremental state transfer (IST) can only happen if the delta of the writeset requested by the joining node is present in the GCache on the donor node. This durability is basically the time you are allowed to perform an IST from this node and this metric is one of the ones to keep an eye upon. Our GCache durability was somewhere in the range of 45 minutes.

When binary logging is set to row based replication and binlog_row_image is set to full (default), the data written to the binlog will contain the previous row state and the updated values. The same applies to the GCache ringbuffer: when set to full the entire previous state of the row will be written to the GCache.

On this particular cluster one of the tables looked similar to this:

CREATE TABLE product (
  pid BIGINT NOT NULL AUTO INCREMENT,
  title varchar(255) NOT NULL,
  description TEXT NOT NULL,
  specifications TEXT NOT NULL,
  stock INT NOT NULL DEFAULT 0,
  price INT NOT NULL DEFAULT 0,
  image_1 VARCHAR(255) NOT NULL,
  image_2 VARCHAR(255) NOT NULL,
  image_3 VARCHAR(255) NOT NULL,
  thumbnail_1 VARCHAR(255) NOT NULL,
  thumbnail_2 VARCHAR(255) NOT NULL,
  thumbnail_3 VARCHAR(255) NOT NULL,
  ...
  another 50+ columns
  ...
PRIMARY KEY(pid)) ENGINE=InnoDB;

You might think a SST isn’t a big deal, but if you have a data size of 1TB it will take about 3 hours to transfer on a 1Gb network or, if you’re lucky, 21 minutes on a 10Gb network. The connection between DC #1 and #2 was a VPN that was limited in bandwidth (about 400Mbps) and moving the 400GB snapshot between node 4 and node 1 took about 4 hours to complete.

As the bootstrapped node was online according to the ProxySQL script, it now allowed writes to the cluster while being a donor node. The mass-update jobs (stock and pricing sync) started to run once again altering all sorts of data inside the cluster. This means a SST taking 4 hours to complete would make IST impossible between the cleanly shut down nodes in DC #2! As a result inside DC #2 the SST will also happen between nodes 4 and 5. Luckily the local network in DC #2 is a bit faster than the VPN but it was a totally unnecessary recovery.

Recovery improvements

If we look at the outage, the cluster was unavailable for a few minutes thanks to ProxySQL sending all traffic to a single node. But according to the definition of high availability, a single node is not HA. So if we take into account that we need at least two nodes to be available, the recovery happens only after the first SST has completed. This means our Mean Time To Recover Service (MTTRS) on Galera was at least 4 hours!

MTTR-Galera-flow

But as explained earlier: the ProxySQL script will not allow traffic to be sent to a donor, unless there is no other node available. This means right after node 1 has been recovered the ProxySQL script will remove the bootstrapped node 4 from available nodes and we’re back to one single node availability. Technically then the MTTRS should be extended till after node 2 has been recovered. But we’ll keep this discussion out of the scope for now.

In the post mortem of this incident we identified a couple of improvements we could make to speed up the recovery of a Galera cluster:

  1. Faster network (between DC #1 and DC #2)
  2. Make GCache bigger
  3. Increase GCache durability
  4. Make the transfer smaller

The first option was not feasible as it would increase cost dramatically, so we focused on option 2 and 3 at first. The GCache was already set to 4GB, which is already quite large. So we simply changed the binlog_row_image parameter from full to minimal. Since most modifications per row were only a few bytes this increased the durability from 45 minutes to somewhere between of 4 to 5 days! I felt a lot more confident that unnecessary SSTs wouldn’t happen anymore.

How to make a Galera State Snapshot Transfer smaller?

Now the fourth option proved to be a bit more challenging: how to decrease the data size of 400GB? We could just drop a couple of the tables larger than 50GB, but I think the product managers would be very angry at me. Another option would be to enable table compression, but I rather not go through the slow migration process on tables larger than 50GB with distributed storage on a hyperconverged infrastructure

Another option is to compress during the creation of the xtrabackup snapshot and this is done for example by using the following configuration:

[sst]
compressor="gzip"
decompressor="gzip -dc"

This would save us probably something near half of the bytes to send over the line, and every byte compressed means we can send the SST faster, right? Naturally we wanted to test this and benchmark how much faster we would be able to move data between the two locations than our current uncompressed SST.

Benchmarking State Snapshot Transfers

We selected the following methods:

  1. Uncompressed
  2. Gzip
  3. Pigz (normal and fast)
  4. LZ4
  5. Snappy
  6. Qpress (LZH built in xtrabackup)

The method for testing was limit the bandwidth between two nodes in our Galera test cluster (4 cores and 8GB of memory) to 300Mbit, force SST to happen on one node and then measure the time between the SST to start and the moment when the node has rejoined the cluster. The amount of bytes transferred was measured through the network interface and we re-ran each method three times to get a reliable outcome.

First of all the throughput tells us already a couple of things:

Galera SST benchmark network throughput

Galera SST benchmark network throughput

LZ4 and Snappy seem to be able to almost fully utilize the bandwidth of our test system, which is good.

We can see the throughput of gzip and pigz is very low. Gzip is renowned for being slow in the compression phase, but pigz (parallel gzip) is supposed to tackle this by allowing parallel execution of chunks of data. On the test machine the graphs of the gzip run were showing it actually was limited by the single core being utilized fully:

Benchmark SST donor GZIP (second set of peaks)

Benchmark SST donor GZIP (second set of peaks)

The peaks before the gzip runs are the uncompressed runs. Throughput really is dwarfed by the uncompressed run.

You may think pigz is any better, but alas even in fast compression mode it’s not much better:

Benchmark SST donor PIGZ fast compression (last three peaks)

Benchmark SST donor PIGZ fast compression (last three peaks)

Yes it will compresses now on all four cores, but the throughput is only twice as much.

So what about the total amount of bytes transferred?

Galera SST benchmark - total bytes transferred

Galera SST benchmark – total bytes transferred

Gzip and pigz are truly king (and queen) here: they save almost 60% of total bytes transferred! This is an amazing achievement and it saves a lot of network bandwidth. LZ4 and Snappy do a much worse job at 25% and 32% respectively.

Now the most important graph, the total duration that we were after:

Galera SST benchmark - total duration

Galera SST benchmark – total duration

LZ4 and Snappy are clear winners here by saving us 15% and 11% in duration of the SST.

Conclusion

We applied Snappy to our production Galera cluster. As our production servers were slightly different we were able to save about 20% on the duration of the state snapshot transfer.

I think the outcome would also be a bit more favorable for pigz on these production servers as they ran on 8 cores instead of only 4 cores. However I fear for the responsiveness of a bootstrapped server utilizing all 8 cores to the maximum for a few hours. Snappy and LZ4 may compress a bit less, but they are much “friendlier” to your donor.

And the final advice I can give you is to keep an eye on the GCache durability as this may be a life saver one day!

Want more?

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

See me speak at Percona Live Europe 2019

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Hyperconverging and Galera cluster

What is hyperconverging?

Hyperconverging is the latest hype: do things more efficiently with the resources that you have by cramming as many virtual machines on the same hypervisor. In theory this should allow you to mix and match various workloads to make the optimum use of your hypervisor (e.g. all cores used 100% of the time, overbooking your memory up to 200%, moving virtuals around like there is no tomorrow). Any cloud provider is hyperconverging their infrastructure and this has pros and cons. The pro is that it’s much cheaper to run many different workloads while the con clearly is when you encounter noisy neighbors. As Jeremy Cole said: “We are utilizing our virtual machines to the max. If you are on the same hypervisor as us, sorry!”

Apart from cloud providers, you could hyperconverge your infrastructure yourself. There are a few hardware/software vendors out there that will help you with that and at one of my previous employers we got a helping hand from one such vendor!

DIY hyperconverging

In our case the entire infrastructure was migrated to a new hyperconverged infrastructure where we would have multiple infrastructure clusters (read: four hypervisors in one chassis) in multiple data centers. Infra marked one of these DCs suitable for our customer facing projects as the peering was performed in that DC. The idea behind this new infrastructure is that the VM can basically run anywhere in your infrastructure and copied realtime to another hypervisor within the same cluster (read: chassis). This copy process (including memory) obviously required some (short) locking, but it even worked amazingly well. We even had some software running that would move around VMs to optimize the workloads and still retain some spare capacity. Magic!

Now there was an additional benefit to choose for this vendor: if a hypervisor would go down the same VM could be spun up immediately on another hypervisor, albeit without copying the memory contents. To be able to do this, the disks are synced to at least one other hypervisor. This means some cluster magic detects one of the hypervisors being down and automagically spins up the same VMs on another (available) hypervisor that contains the latest data of this VM. To spread the load among various hypervisors the replication factor of the disks is advised to be set to 2, where 2 means to be copied to (at least) two other hypervisors.

Hyperconverging Galera

Our Galera cluster consisted out of three Galera nodes and three asynchronous read replicas attached (see image below).

Galera cluster with read slaves

Galera cluster with read slaves

In this picture every Galera node stores every transaction in the GCache, InnoDB flushes the transaction to disk (ibdata*) and asynchronous replication dictates another write to the binlogs. That means that every transaction in our Galera node will already be stored three times on disk.

The hyperconverged cluster where we hosted Galera had the replication factor set to 2. That means every byte written to disk will be written to at least two other storage controllers (VMs), as shown in the image below. This write operation over the network is synchronously, so the filesystem has to wait until both controllers acknowledged the write. Latency of this write is negligible as the write is super fast and performed over a low latency network. The magic behind this synchronous disk replication is out of the scope for this blog post, but I can hint that a certain NoSQL database (named after some Greek mythology) is managing the storage layer.

Hyperconverge write amplification: every write to disk will be written three times!

Hyperconverge write amplification: every write to disk will be written three times!

This means that every write to disk in our Galera node will also be synced an additional two hypervisors. To make matters worse, due to semi-synchronous replication, all three nodes Galera perform the exact same operations at (almost) the exact same time!

1 transaction = 3 nodes (3 writes locally + 6 writes over the network) = 27 writes

As you can guess from the simple formula above: 9 writes are performed locally and 18 writes are performed over the network. As every write to disk is performed synchronously over the network, this write adds a bit more than negligible latency when it spawns 18 writes over the network at the same time. As 1 transaction to Galera can cause 18 synchronous writes over the network, imagine what latency you will encounter if you have a baseline of 200 transactions per second! And we’re not even counting the asynchronous replicas performing similar write operation again mere (milli)seconds later!

Galera managed to cope, but instability only happened on set intervals. We could trace these back to our so called stock-updates or pricing-updates: every half-an-hour stock levels were pushed from the warehouse database and every few hours new pricing information was also pushed via the enterprise service bus.

With more than a million products in the database these torrents of writes quickly caused disk latency in the entire hyperconverged cluster and we have seen the disk latency shoot up well beyond 80ms. This no longer affected the Galera cluster, but this was causing cluster wide issues on the distributed storage layer as well. And to make matters even worse: latency on the entire network was also shooting up.

Benchmarking semi-synchronously replicated hyperconverged clusters

At first nobody believed us, even when we showed the graphs to the vendor. This new infrastructure was so much more expensive than our old that it simply couldn’t be true. Only after conducting benchmarks, reproducing the latency on an empty test cluster, we were taken seriously. The benchmarks revealed that the write amplification saturated the network interfaces of the cluster and we worked with the vendor on seeking a solution to the problem. Even after upgrading the network (10G interface bonding, enabling jumbo frames, hypervisor tuning) we still found latency issues.

The issue with our hyperconverged cluster was that there was no (separate) internal network handling the inter-hypervisor network traffic. Of course we could now achieve the double amount of transactions, but that didn’t solve the underlying issue of also causing latency on other VMs and also causing latency on ingress and egress network of our applications.

Conclusion

We came to the conclusion that (semi-)synchronous replicated databases and hyperconverged infrastructures with high replication factors don’t match. Unfortunately this replication factor could only be set on cluster level and not on an individual VM level. Also the reasoning behind the synchronous disk replication did not make sense (see also my previous blog post) as Galera would wipe the disk contents anyway and in general it would take quite some time for the database to recover, so a quick failover would not happen anyway. That’s why we ran Galera+ProxySQL in the first place: to allow us to have a failover happen within seconds!

We also ran other (semi-)synchronous replicated databases: MongoDB, SOLR and Elasticsearch for example and each an everyone of them basically the same lack of need for disk replication.

The only option left was to migrate the Galera cluster back to our old hardware that, luckily/sadly, was still switched on. At the same time we started working on a migration to a real cloud vendor as they could offer us better performance without the risk of a single point of failure (e.g. single data center).

So what difference would a benchmark up front have made?

This only happened due to bad requirements without analyzing the workload that was supposed to be converged. We would have seen these issues before migrating to the new hyperconverged infrastructure if we would have benchmarked beforehand. We would have saved us from many instabilities, outages and post mortems. We might even have chosen a totally different setup or have chosen to split our workloads over multiple (smaller) hyperconverged clusters.

This is one of the background stories of my talk Benchmarking Should Never Be Optional on Wednesday 2nd of October 2019 at Percona Live Europe in Amsterdam.

See me speak at Percona Live Europe 2019

In my talk I will feature a few cases why you should always benchmark your systems up front. It’s not only about database benchmarking, but in some cases even the entire system that requires benchmarking.

Is Galera trx_commit=2 and sync_binlog=0 evil?

It has been almost 5 years since I posted on my personal MySQL related blog. In the past few years I have worked for Severalnines and blogging both on their corporate blog and here would be confusing. After that I forgot and neglected this blog a bit, but it’s time to revive this blog!

Speaking at Percona Live Europe – Amsterdam 2019

Why? I will be presenting at Percona Live Europe soon and this blog and upcoming content is the more in-depth part of some background stories in my talk on benchmarking: Benchmarking should never be optional. The talk will mainly cover why you should always benchmark your servers, clusters and entire systems.

See me speak at Percona Live Europe 2019

If you wish to see me present, you can receive 20% discount using this code: CMESPEAK-ART. Now let’s move on to the real content of this post!

Innodb_flush_log_at_trx_commit=2 and sync_binlog=0

At one of my previous employers we ran a Galera cluster of 3 nodes to store all shopping carts of their webshop. Any cart operation (adding a product to the basket, removing a product from the basket or increasing/decreasing the number of items) would end up as a database transaction. With such important information stored in this database, in a traditional MySQL asynchronous replication setup it would be essential to ensure all transactions are retained at all times. To be fully ACID compliant the master would have both innodb_flush_log_at_trx_commit set to 2 and sync_binlog set to 0 innodb_flush_log_at_trx_commit set to 1 and sync_binlog set to 1 (correction by Przemek Malkowski) to ensure every transaction is written to the logs and flushed to disk. When every transaction has to wait for data to be written to the logs and flushed to disk, this will limit the number of cart operations you can do.

Somewhere in a dark past the company passed the number of cart operations possible on this host and one of the engineers found a Stackoverflow post instructing how to improve the performance of MySQL by “tuning” the combo of the two variables. Naturally this solved the immediate capacity problem, but sacrificed in consistency at the same time. As Jean-François Gagné pointed out in a blog post, you can lose transactions in MySQL when you suffer from OS crashes. This was inevitable to happen some day and when that day arrived a new solution had come available: Galera!

Galera and being crash-unsafe

Galera offers semi-synchronous replication to ensure your transaction has been committed on the other nodes in the cluster. You just spread your cluster over your entire infrastructure on multiple hosts in multiple racks. When a node crashes it will recover when rejoining and Galera will fix itself, right?

Why would you care about crash-unsafe situations?

The answer is a bit more complicated than a yes or a no. When an OS crash happens (or a kill -9), InnoDB can be more advanced than the data written to the binary logs. But Galera doesn’t use binary logs by default, right? No it doesn’t, but it uses GCache instead: this file stores all transactions committed (in the ring buffer) so it acts similar to the binary logs and acts similar to these two variables. Also if you have asynchronous slaves attached to Galera nodes, it will write to both the GCache and the binary logs simultaneously. In other words: you could create a transaction gap with a crash-unsafe Galera node.

However Galera will keep state of the last UUID and sequence number in the grastate.dat file in the MySQL root folder. Now when an OS crash happens, Galera will read the grastate.dat file on startup and on an unclean shutdown it encounters seqno: -1. While  Galera is running the file contains the seqno: -1 and only upon normal shutdown the grastate.dat is written. So when it finds seqno: -1, Galera will assume an unclean shutdown happened and if the node is joining an existing cluster (becoming part of the primary component) it will force a State Snapshot Transfer (SST) from a donor. This wipes all data on the broken node, copies all data and makes sure the joining node has the same dataset.

Apart from the fact that unclean shutdown always triggers a SST (bad if your dataset is large, but more on that in a future post), Galera is pretty much recovering itself and not so much affected by being crash-unsafe. So what’s the problem?

It’s not a problem until all nodes crash at the same time.

Full Galera cluster crash

Suppose all nodes crash at the same time, none of the nodes would have been shut down properly and all nodes would have seqno: -1 in the grastate.dat. In this case a full cluster recovery has to be performed where MySQL has to be started with the –wsrep-recover option. This will open the innodb header files, shutdown immediately and return the last known state for that particular node.

$ mysqld --wsrep-recover
...
2019-09-09 13:22:27 36311 [Note] InnoDB: Database was not shutdown normally!
2019-09-09 13:22:27 36311 [Note] InnoDB: Starting crash recovery.
...
2019-09-09 13:22:28 36311 [Note] WSREP: Recovered position: 8bcf4a34-aedb-14e5-bcc3-d3e36277729f:114428
...

Now we have three independent Galera nodes that each suffered from an unclean shutdown. This means all three have lost transactions up to one second before crashing. Even though all transactions committed within the cluster are theoretically the same as the cluster crashed at the same moment in time, this doesn’t mean all three nodes have the same number of transactions flushed to disk. Most probably all three nodes have a different last UUID and sequence number and even within this there could be gaps as transactions are executed in parallel. Are we back at eeny-meeny-miny-moe and just pick one of these nodes?

Can we consider Galera with trx_commit=2 and sync_binlog=0 to be evil?

Yes and no… Yes because we have potentially lost a few transactions so yes it’s bad for consistency. No because the entire cart functionality became unavailable and carts have been abandoned in all sorts of states. As the entire cluster crashed, customers couldn’t perform any actions on the carts anyway and had to wait until service had been restored. Even if a customer just finished a payment, in this particular case the next step in the cart could not have been saved due to the unavailability of the database. This means carts have been abandoned and some may actually have been paid for. Even without the lost transactions we would need to recover these carts and payments manually.

So to be honest: I think it doesn’t matter that much if you handle cases like this properly. Now if you would design your application right you would catch the (database) error after returning from the payment screen and create a ticket for customer support to pick this up. Even better would be to trigger a circuit breaker and ensure your customers can’t re-use their carts after the database has been recovered. Another approach would be to scavenge data from various sources and double check the integrity of your system.

The background story

Now why is this background to my talk because this doesn’t have anything to do with benchmarking? The actual story in my presentation is about a particular problem around hyperconverging an (existing) infrastructure. A hyperconverged infrastructure will sync every write to disk to at least one other hypervisor in the infrastructure (via network) to ensure that if the hypervisor dies, you can quickly spin up a new node on a different hypervisor. As we have learned from above: the data on a crashed Galera node is unrecoverable and will be deleted during the joining process (SST). This means it’s useless to sync Galera data to another hypervisor in a hyperconverged infrastructure. And guess what the risk is if you hyper-converge your entire infrastructure into a single rack? 😆

I’ll write more about the issues with Galera on a hyperconverged infrastructure in the next post!

Presenting Sphinx Search at Percona Live London

It has been a while since I posted on this blog. The more in depth articles are all posted on the Spil Games Engineering blog and I’m overcrowded with work in the past months that I hardly have any time left.

One of the reasons for having too much on my plate was my own doing: I volunteered to be a conference committee member for Percona Live London and we, as a committee, worked our way through all proposals while you all were enjoying your holidays in the sun. Well, I must admit I did review a couple of them sitting in the sun, next to the pool enjoying a cold drink. 😉
I must say there were so many great proposals it was really hard to choose which ones would be left out.

I also proposed a talk for Percona Live London this year and my fellow committee members liked it enough to make it to the final schedule: Serve out any page with an HA Sphinx environment.
In basis it is a MySQL case study where I will show how we at Spil Games use Sphinx Search in three different use cases: our main search, friend search and serving out our content on any page. I’ll also describe how we handle high availability, how we handle index creation and show benchmark results between MySQL and Sphinx in various use cases.

In case you are interested in Sphinx or the benchmark results: the session will be on the 4th of November at 3:10pm – 4:00pm in Cromwell 1 & 2.
Also don’t hesitate to ask me things when I’m wandering around in the corridors and rooms. Or maybe we’ll meet at the MySQL Community Dinner?
See you next week!

Presenting at FOSDEM and Percona Live

Very short update from my side: I’ll be presenting at FOSDEM in Brussels (1-2 February 2014) and Percona Live MySQL Conference in Santa Clara (1-4 April 2014).

FOSDEM
At FOSDEM I will present about Galera replication in real life which is concentrate around two use cases for Galera: adding new clusters to our sharded environment and migrating existing clusters into a new Galera cluster.

Percona Live MySQL Conference and Expo, April 1-4, 2014
At Percona Live MySQL Conference I will present about our globally distributed storage layers. Next to our globally sharded environment we have built a new environment called ROAR (Read Often, Alter Rarely) that also needs to be distributed globally.

Both are interesting talks and I really look forward to present at these great conferences. So if you have the way and means to attend either one: you should!