Changing the xbcloud filesize | xtrabackup streaming to cloud storage

When Percona released Percona XtraBackup 2.4.14 on May 1st 2019 they included a very important feature: streaming a backup to S3 and GCS using the xbcloud binary! My team implemented streaming our backups to shorten the total time for taking our backups. We found the streaming of our encrypted backups to GCS to be shorter than making them locally, but this is a subject for another blog post. The matter at hand in today’s post is that we found xbcloud was creating an enormous amount of small files. We found that xtrabackup creates files of 10MB by default. No matter what we tried, we couldn’t have the xbcould binary make larger files. In the end we managed to solve it with the (public) help of Percona!

TL;DR

The issue is that you can set the  –read-buffer-size to change the size of the files created. However if you also encrypt your backup, this change gets lost. You then need to change the  –encrypt-chunk-size to the same size as your –read-buffer-size and this will then create the changed file size.

The xbcloud problem

The problem is the construction of taking a backup. Xtrabackup will handle the MySQL part of the backup and pipes the backup to the xbstream and xbcloud binaries. The xbcloud binary will then stream the data to the backup bucket in 10MB chunks. This means that every time your backup stream reaches 10MB, a new 10MB file will be created in the bucket. The reasoning behind this is that if you hit a network error, not the whole file is lost and xbcloud can reupload the failed 10MB file and not the entire file. The number of retries by xbcloud is 3.

The issue we found is that large backups (8TB of data or larger) will create many many many files. Browsing through the GCS bucket shows you Google doesn’t even wish to estimate how many files there are unless you go to the final page. But a quick calculation says to me that an uncompressed and encrypted 1TB table should end up in at least 100,000 files. Unfortunately there are a couple of downsides attached to this.

Many many many small 10MB files are uploaded. Even GCS doesn't dare to say how many...

Reliability of storage

When we look at the Cloud, we assume everything is stable and networking is always there. Google assumes GCS is reliable and only publishes numbers on the uptime of GCS (99.9%, so 3 nines). Let’s take this SLA and then assume there is a failure rate of 1 out of 1,000 file stores for storing a file in GCS. As xbcloud has 3 retry attempts, that means in theory we would have a chance of 1 out of 1,000,000,000 to get 3 consecutive errors. But when we get into the big numbers of files to upload, the failure rate of 800,000 files should be 0,0008 file stores. That means on our total batch we have a likelihood of 0.08% on a file storage failure. That doesn’t sound too bad, does it?

But it is far more likely that consecutive errors happen as there is a reason why the file storage failed in the first place. It could be the case a queue is full or the pod running the API that accepts the request is failing. So the likelihood of file storage failure is far bigger than what I just made up in the previous paragraph. Also the larger the numbers of total files get, the more likely you will end up with an error on storing a file. This error will then be propagated upwards by xbcloud to xtrabackup, which in turn make the entire backup fail. Retaking a backup to 8TB with a duration of 6 hours isn’t going to be my definition of fun. 

And this is exactly what we encountered. Large backups tended to fail more often and even our automated retry mechanism on the backup would fail to produce a backup from time to time. With the help of support we have improved things by adding more retries for the file uploads. This lowered the chance of failed file stores, but it could still happen. So if we can lower the number of files we upload, we have less chance of a failed backup.

Cost of file operations

With cloud providers you pay mainly for storage used. However people also tend to forget you also have to pay for file operations. For instance with GCS uploading a file is a Class A operation and this costs $0.05 per 10,000 operations for normal storage and $0.10 per 10.000 operations for Nearline and Coldline storage. When we create a backup, we store it in a temporary bucket on Normal storage at first. We then stage the backup (test and verify) before moving it to a protected bucket that retains it for 30 days on Nearline. That means for a backup of 1TB (100,000 files), we actually do 100,000 Class A operations (store) and 100,000 Class B operations (get files to verify) in Normal and 100,000 Class A operations (moving to protected bucket) in Nearline storage. The cost for the operations alone in these various buckets are already $0.50, $0.04 and $1.00 per backup. This means the total costs for a 1TB backup are $1.54 for all bucket operations.

The costs for storing 1TB of data for a month in Nearline storage is $10.00. This means $1.54 costs for bucket operations make up a significant portion (about 13%) of the total costs for the backup. So if we can create larger files (chunks), we would be able to lower the total number of files and we would also be able to lower the total number of operations on these buckets.

Xtrabackup backup duration

Another potential issue of uploading 100,000 files is that there is extra overhead for transferring every individual file. The xbcloud binary is streaming data to the cloud provider as files one by one, so it has to wait for every file to complete and 10MB files are relatively small and quick to upload. Transferring these files is done via HTTPS, so there is also SSL overhead. You can make xbcloud upload files in parallel with the –parallel parameter, we configured 7 parallel streams and this makes it a bit better. However then all our 7 parallel streams will be waiting individually for files to complete during the transfer. We raised this concern, but were told file upload benchmarks showed there was no added benefit from uploading larger files than the default size. To me this felt very counter intuitive.

Why do we want a shorter backup duration? Remember we also stage our backups where we test and verify them? This process takes time as well and we’d rather have these backups as quickly as possible in our verified protected backup bucket. If a backup takes 8 hours to create, our backup verification process would also take at least 8 hours. This means we at most have a usable backup 16 hours after we initiated the backup. We then have to catch up 16 hours of transactions to get our recovered host to get back to 0 seconds replication lag. This means that if we can reduce the backup time to 4 hours, we would only have to catch up to 12 hours of transactions. This means we would get a 25% reduction on our time to recover for this specific database.

Fixing the file size problem

Naturally we reached out to our support and even with their help we couldn’t make it work. We knew it had to do with the –read-buffer-size parameter for xtrabackup, but increasing this to 100MB still created 10MB files inside the bucket. We suspected the cause to be the xbcloud binary creating the files was to blame as it contained a hard coded value. Our ticket remained open for over six months without a proper resolution. 

We were also aware of a bug report (PXB-1877) on the public bugtracker of Percona that described our exact problem. In this bug we also actively contributed and commented. In late December 2021 the bug was kicked up once more and, as you can read in the bug report, the same results were shared there as well. In mid February it was kicked up again by the same person as he was having similar issues as we had: high cost and failed backups. 

This time Marcelo Altmann responded once again that the  –read-buffer-size could be used to control this. We responded we already tried that, but failed to see any result. But this time Marcelo asked us to share the full command. When we shared it he noticed we were using encryption via xtrabackup as well and told us to also set the –encrypt-chunk-size to the same size as –read-buffer-size. This fixed our issue and we’re happy ever after!

Conclusion

What probably caused the problem in the first place is that xtrabackup also pipes the stream through the xbcrypt binary. This then messes up the file size set in the –read-buffer-size. We implemented the change from external encryption to internal some time before streaming backups to a bucket. Also this behavior is not documented anywhere by Percona.

We now run a file size of 200MB each and this means we have a reduction of 20 times. Now all of a sudden Google finds showing file number estimates in our buckets a lot easier to do. We also definitely see a big improvement on large database clusters: backup duration dropped from 18,000 seconds to 13,000 seconds. 

Backup duration on a huge database dropped from 18K seconds to 13K seconds after setting both encrypt chunk size and read buffer size

I do have to add that on small database clusters (smaller than 1TB) we do see increased backup duration. The example below shows this and this is about 10 minutes on a 18 minute backup.

Backup duration on a small database increased from 1K seconds to 1.3K seconds after setting both encrypt chunk size and read buffer size

We can’t see the improvement on cost effectiveness yet as our buckets are inside a shared project. However as we have achieved a reduction of 20, we expect the cost for bucket operations to become negligible compared to the storage costs.

What surprised us most were numerous claims that there wouldn’t be any benefit from increasing the file size. Perhaps the tests conducted by Percona were performed on hosts within the range between 100GB and 2TB of data. In this range you can probably see there is no real benefit from increasing these buffers, especially not to very large sizes.

One word of caution: don’t increase these two buffers too much. We toyed with various sizes and when we reached 1GB, it caused the xtrabackup chain to try to allocate an extra 2GB of memory for the buffers. Naturally the OOM killer dropped by to say hi. Not recommendable.

10 reasons why MySQL Query Cache should be switched off

This post is not meant as a rant, so I hope it will not be interpreted this way. A few weeks ago I was asked the question why we decided to switch off the MySQL Query Cache on all our production servers. Couldn’t we just enable it to squeeze some extra performance out of those expensive 16 and 32 core machines? A question like this would make most MySQL database engineers cringe, but of course we know what’s wrong with the MySQL Query Cache. I could explain this perfectly well, but I knew it was going to be a lengthy write up. So I looked for an existing source that could explain this to anyone.

And to be honest: I couldn’t find a clear (blog) article to explain this to a person who isn’t a database engineer. Even worse: I could find many articles describing how easy it is to enable it for performance reasons, including a tutorial by DigitalOcean! Now if Digital Ocean is telling me how to improve performance, why should I doubt their words?

Time to clear things up once and for all by giving 10 reasons why you should switch it off! But not before I actually explained what the MySQL Query Cache is!

What is the MySQL Query Cache?

The MySQL Query Cache is nothing more than a slab of reserved memory within the MySQL daemon to store the results of a query. The next time someone executes the same query the result is readily available and can be returned instantly. The Query Cache was created in a time when hardware resources like CPU and disk were precious and expensive. This was back when single CPU (Linux) servers were king and a fast disk meant you probably bought a 10K rpm SCSI disk with a whopping 9GB of storage. In order to prevent expensive disk access and waste CPU cycles on ordering and filtering the data, the (simple) query cache was invented.

The MySQL Query Cache works on a high level as following: (side note: this is a simplified drawing and obviously not 100% accurate)

mysql-query-cache-high-level

Continue reading “10 reasons why MySQL Query Cache should be switched off”

Runaway history list

On one of the clusters at Spil we noticed a sudden increase in the length of the history list and a steep increase in the ibdata file in the MySQL directory.
I did post a bit about this topic earlier regarding MySQL 5.5 but this cluster is still running 5.1 and unfortunately 5.1 does not have the same configurable options to influence the purging of the undo log…

History list

Now I did find a couple of great resources that explain the purge lag problem into detail: Pythian, DimitriK and Marco Tusa.

What it boils down to is that the purge lag is largely influenced by the length of the history list and the purge lag:
((purge_lag/innodb_max_purge_lag)×10)–5 milliseconds.
On 5.5 it is also influenced by the number of purge threads and purge batch size. I toyed around with these settings in my earlier post and tuning them helped. However the only setting I could change on 5.1 is the purge lag in milliseconds that was already set to 0. In other words: I could not fiddle around with this. This time it wasn’t an upgrade to 5.5 either so I could not blame that again. 😉

So what was different on this server then? Well the only difference was that it did have “a bit” of disk utilization: around 80% during peak hours. Since it is not used as a front end server it does not affect the end users, but only the (background) job processes that process and store data on this server. However it could be the case that due to the IO utilization it started to lag behind and created a too large history list to catch up with its current configuration.

How did we resolve it then? After I read this quote of Peter Zaitsev on Marco Tusa‘s posting the solution became clear:

Running Very Long Transaction If you’re running very long transaction, be it even SELECT, Innodb will be unable to purge records for changes which are done after this transaction has started, in default REPEATABLE-READ isolation mode. This means very long transactions are very bad causing a lot of garbage to be accommodated in the database. It is not limited to undo slots. When we’re speaking about Long Transactions the time is a bad measure. Having transaction in read only database open for weeks does no harm, however if database has very high update rate, say 10K+ rows are modified every second even 5 minute transaction may be considered long as it will be enough to accumulate about 3 million of row changes.

The transaction isolation is default set to REPEATABLE-READ and we favor it on many of our systems, especially because it performs better than READ-COMMITTED. However a background job running storage server does not need this transaction isolation, especially not if it is was blocking the purge to be performed!

So in the end changing the transaction isolation to READ-COMMITTED did fix the job for us.

Some other things: tomorrow my team is attending the MySQL User Group NL and in three weeks time I’ll be speaking at Percona London:
Percona Live London, December 3-4, 2012
So see you there!