Avoiding deadlocks in Postgres migrations
In today’s episode 93 of “5mins of Postgres”, we're talking about how to avoid deadlocks in Postgres migrations and surprising lock behavior with migrations that I personally encountered.
Share this episode: Click here to share this episode on LinkedIn or on twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.
Transcript
Deadlocks and other surprising behavior in Postgres migrations
To start with, this is a blog post by Vanessa Ung on the Dovetail engineering blog.
Using separate transactions when changing multiple tables
The first thing that she describes here is the issue that you can run into when you have a migration that does two different changes and then you have parts of your workload that conflict with it. Imagine you have a migration where you are altering the RLS policy on one table and you're also altering a different table where you're adding a constraint. These are related in principle, so it makes sense to do them as migration.
But the issue is that you might have a user running a request like where they're doing a SELECT
from the "comment" table and a SELECT
from the "comment_thread" table within one transaction. The important thing to note here is that these are essentially reversed in terms of order. The migration first does the change on the "comment_thread" table and then changes the "comment" table, versus the queries run the other way around.
If you're unlucky and that second transaction runs at the same time as a migration, you will get a deadlock. As Vanessa describes,
- first process A will take an AccessShareLock on the "comment" table,
- then process B running the migration will take an AccessExclusiveLock on the "comment_thread" table,
- and then process A will also try to get a lock on the same table, but it will not be able to get its AccessShareLock because that table is currently exclusively locked.
Now, unfortunately also process B cannot proceed because it wants to get an AccessExclusiveLock on the "comment" table but that's already locked with an AccessShareLock by that transaction.
At that point there's no way to resolve this without canceling one of these two. Now, the good news is that Postgres has what's called the deadlock detector. Behind the scenes, the deadlock detector runs by default once a second, and it will check for these types of problems in the lock graph.
It will essentially say: “better than live locking and just waiting forever, let's cancel one of those two processes. That will resolve the situation and then things will continue.” Of course, that means that your application may need to issue a retry, either on the migration, or on the transaction that was canceled.
Other ways to improve the situation are for example, in the migration, if you have different tables involved, it often makes sense to not do too much at once in the same transaction. Even though it's nice to have transactional DDL in Postgres, in production workloads that often causes these types of problems.
And so one best practice is to split the migration so that it's once for each table and then you avoid potential deadlocks, because of the different tables involved.
Which locks are being acquired by your migration?
Now, the other thing that's useful to do sometimes is to look at which locks are being acquired by your migration, because you might have more complex relationships
What you can do is, before you issue a commit on a migration, you can look at the pg_locks
table, which shows you the internal view of Postgres' lock manager. You can just look which locks are currently held and this gets everything but you could also just get that for the current PID. It already did the work; it just didn't commit the transaction because the locks would only be released at the end of the transaction.
This is a good way to get a sense for which locks you actually need to complete the work.
Be explicit about lock order
Now the other thing that Vanessa recommends here is being explicit about your lock order. Be aware, this is actually hard to do in practice. Unfortunately, most of the time when you have application engineers making changes, it's unlikely that they will be aware of the locking considerations the database has to take.
Really, this is where splitting things up makes it easier because then you don't have to be conscious of these different orders. And so I think to me this means that the big call to action here is to avoid complex transactions. Complex transactions, unless you truly need them, usually make things worse in the database rather than better.
So, unless you really need that atomicity guarantee, it's much better to have work split into multiple parts that are not part of the same transaction, because then you don't have to worry about the ordering as much.
Linting database migrations
The last thing that Vanessa mentions here is that it's good to lint your database migrations. At Dovetail internally they have a tool to analyze the statements, it's not mentioned here which tool that is, but personally, in the Ruby on Rails world, one tool that I've seen frequently used is strong_migrations by Andrew Kane (we’re big fans of Andrew here and have featured him previously in “5mins of Postgres E74: Storing and querying vector data in Postgres with pgvector](https://pganalyze.com/blog/5mins-postgres-vectors-pgvector)). “strong_migrations” will run as part of your Ruby on Rails migration, and it will check for a few known cases where your migrations might be doing something unsafe.
It prompts an engineer explicitly that they're intentionally doing something risky, like making a very slow operation that will lock the whole table for a long time.
A quick example to showcase locking behavior
Now, on that last note, I want to show you something that I personally find surprising when I write migrations on a Postgres database.
For that, I'm just going to pull my local database and I'll illustrate this by first showing you how we want things to work (feel free to follow along in the video above). I'm intentionally running this query very slowly. So this is a pg_sleep(60)
, but it's also doing a count. What I'm illustrating here is a long running query that's on this "users" table here that we're working with.
Now in parallel, I'm also doing a completely unrelated read on, for example, my own email address, I'm just getting my own user ID. And as expected, that's really fast. Two milliseconds, just got the data. It's a single row. It's super simple. This is what we would expect, they are two independent operations, so there's no reason why one should block the other.
Now, I'm running this DDL statement where I'm adding a column to the table. And in Postgres, that's generally a very fast operation, because I'm just modifying the catalogs to add that extra column, I don't have to add any data to the table for this. This is a nullable field, and so Postgres just assumes that there is no data in the table for this by default.
I'll start this, and you'll notice that this is not fast. Now, because I'm concurrently running this count query that makes it slow because I can't rewrite the table whilst the query is happening. The thing that's really surprising, if I go back here, you'll notice how the query that was really fast earlier is now really slow.
And this is what, oftentimes would cause production issues because what's happening is you have a DDL statement, it's really fast, but it's blocked. Because this DDL statement here was blocked by this other activity here. And so until this slow query here finished, the DDL statement couldn't finish, and so any other activity on the table is also blocked because in Postgres, this very simple SELECT statement here was behind the DDL statement in the lock queue. And so Postgres, intentionally doesn't let you run that really fast select query. The table is not yet actually exclusively locked, it's just somebody requested an exclusive lock. And so because you have that request for an exclusive lock from the DDL statement, you also get other selects slowed down.
If you're seeing surprising behavior like this, where you're making a really simple migration, you're just adding a column, and for some reason: A, it's very slow and B, your whole production application is down, oftentimes what happens is that the true root cause of the problem is that you have a slow query or a long running transaction on that table. That's causing the issue.
Now, one last note: on production systems, it makes a lot of sense to set the statement_timeout
. You can also, in newer Postgres versions, set an idle transaction timeout. If you have that in place, that also means if your statement timeout is 1 minute, your migration will at most be blocked for 1 minute.
That might still show some impact if there's slow queries, but at least we're not talking 30 minutes or 60 minutes. Sometimes I've seen runaway queries or stuck processes cause bad problems. statement_timeout
and idle_in_transaction_session_timeout
really help you avoid that problem.
If you enjoyed E93 of 5mins of Postgres, why not subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and Twitter to get updates about new episodes?