5mins of Postgres E29: Postgres lock conflicts on Amazon Aurora & tuning max_standby_streaming_delay and hot_standby_feedback

Today, we're talking about Postgres lock conflicts on Amazon Aurora and tuning max_standby_streaming_delay and hot_standby_feedback, why you should implement retry logic with Aurora and how you can avoid query conflicts on Postgres with Amazon Aurora's fast clones. We’ll also look at an effective setup from the Twitch team for these settings.



Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.


Transcript

Let's have a look.

In this blog post by Wanda and Avi from the AWS team, they talk about how Amazon Aurora differs from regular RDS Postgres and describe how you have to watch out for lock conflicts that can cause query cancellations on the readers.

The difference between Amazon Aurora and Amazon RDS

To start with, they describe the architectural differences between Amazon Aurora and RDS. If you have a Multi-AZ setup on Amazon RDS, each node has its own independent storage. On Amazon Aurora on the other hand, each node also has its own cache, but the actual data on disk is shared between the nodes. This also enables Amazon Aurora's capabilities to quickly launch new readers.

The downsides of Amazon Aurora’s data sharing between nodes

Wanda and Avi talk about the downsides of this architecture which has to do with handling conflicts between the different states of these nodes.

For example, imagine you have a DML operation on the primary. This operation generates transaction log records (or "WAL") in Postgres. These log records have to be sent to all the reader nodes to then synchronize the caches of these nodes. When a reader does not have the page in the cache, it can actually ignore that because if it was to be accessed, it can access the shared storage, and so that is always kept in sync between the nodes.

When you have a query that's currently running, on either the primary or the reader node, they can hold locks on pages. In the blog post, they talk about lock conflicts amongst other things. They generate some test data and show a statement that, when you run it without any conflict, will finish in 72 seconds. No problem.

Then, they do produce a conflict on purpose. They start a query on the reader and then drop an index on the primary. When you drop this index, it will actually cause the reader to cancel its statement.

This is an issue that can also occur on regular Postgres. There are a couple of Amazon Aurora specific things to understand.

Understanding max_standby_streaming_delay for Postgres and Amazon Aurora

Earlier in their post, Wanda and Avi describe max_standby_streaming_delay. First of all, Amazon Aurora does not give you the option to turn this setting off. You can tune the setting, but you can't set it to higher than 30 seconds.

The other thing to understand is that if Aurora encounters a problem where there is a conflict between the reader and the data that's coming in from the primary, it will wait for up to 60 seconds, that's the second internal timeout. After those 60 seconds, Aurora will actually take the action to restart the reader node.

If you see your reader nodes restarting because you have a lot of conflicts, this is why.

This is Aurora's mechanism to make sure that it doesn't have the cache diverged for more than 60 seconds from the primary.

They do talk about a few ways, how to address these issues on Aurora. First of all, they do generally recommend that reader queries should try to be finished before that max_standby_streaming_delay setting is hit. For example, look at missing indexes or implement table partitioning. This matters even more on Amazon Aurora, and it matters even more on an Aurora reader because that helps you avoid these conflicts.

Why you should implement retry logic with Amazon Aurora

They do recommend implementing retry logic. If cancellations occur, oftentimes if you rerun the same query, it will succeed because the issue on the primary has stopped appearing, and so if you run the query, even if it's longer than 30 seconds, it will most likely succeed. Retry logic is important with Aurora.

Retry logic is important with Aurora.

The other thing that they mention here is the fast cloning feature. Fast clones avoid all these conflicts.

Avoiding query conflicts on Postgres with Amazon Aurora's fast clones

In general, this is something that's not just happening on Aurora. Query conflicts can happen on a regular Postgres streaming replication setup. The Postgres documentation does describe a few other aspects that are good to understand.

For example, they do mention here that when you have tables that are regularly and heavily updated on the primary, then you will often see cancellations of long running queries. If you have a heavily updated table, then you may actually want to shift your read queries away from the readers, to the writers, because that way you avoid the lock conflicts.

Understanding hot_standby_feedback for Postgres and Amazon Aurora

The other thing that the documentation mentions here is called the hot_standby_feedback setting. hot_standby_feedback is actually always enabled on Aurora, but on other Postgres installations you may choose to turn that off. It tells the primary that the secondary is still using certain parts of the table, and when there's a VACUUM process, that VACUUM process will actually skip over those rows that are still being used by the secondary.

It has the benefit of avoiding lock conflicts, but it has the downside of potentially causing table bloat.

This is actually something you should be aware of in Aurora, which the AWS blog post did not talk about, because they have hot_standby_feedback turned on, it's possible that VACUUM doesn't do its work effectively on Aurora and that you'll see table bloat occurring.

An effective setup for hot_standby_feedback and max_standby_streaming_delay on Postgres

The other thing I want to mention is how Twitch uses Postgres. This is a blog post from a few years ago. In it, one of the things Aaron talks about is that sometimes they have data scientists or analysts running ad hoc queries. In order to be able to do that effectively, they set up readers that have hot_standby_feedback turned off and max_standby_streaming_delay set to six hours.

This means that the secondary will queue up the WAL. If there's a conflict, it'll just wait with applying the WAL for up to six hours. It will also not tell the primary that there are active queries on the secondary and thus not cause bloat. This is a great configuration on regular Postgres to make sure that your secondaries can be used for queries, but again this will then increase the replication lag, which is something that you want here. On Aurora this is a perfect use case for fast clone.

Thank you so much for listening. This was 5mins of Postgres. Subscribe to our YouTube channel or follow us on Twitter to hear about next week's episodes. Talk to you next week!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox