Finding the root cause of locking problems in Postgres

In today’s E99 of “5mins of Postgres” we're showing how to go about finding the source of a locking problem in Postgres. Specifically, we are looking at how to debug heavyweight locks, how to end a process that’s holding a lock in Postgres, and general helpful settings for handling Postgres locking.



Share this episode: Click here to share this episode on LinkedIn or on twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Debugging a lock problem in Postgres

We'll start with this blog post by Jesse Soyland on the Crunchy Data blog. Jesse is a customer success engineer on the Crunchy team and as part of that job, he often works with customers that have a locking problem. In his experience, these types of issues can quickly take down an entire production database or application.

Now, before we jump into this post, it's important to note that there are two different types of locks that we might be talking about when we talk about locks in Postgres:

  1. heavyweight locks
  2. LWlocks

The heavyweight locks are locks that a user controls. This is, for example, if you say LOCK TABLE, or if you make changes to the database that require a certain lock level. These are all queries that you, as the user of the database, execute.

There's also, what's called LWLocks, which are the internal locks inside Postgres.

When you look at wait event data you might find a particular LWLock that shows you the contention on a particular internal Postgres subsystem.

What we're talking about here, when we talk about locks, are the heavyweight locks, the user-controlled locks.

Understanding heavyweight locks in Postgres

The most basic way to debug a locking problem is to look at the pg_stat_activity and pg_locks views.

Jesse starts by looking at pg_stat_activity, which - in his example - shows us that a specific connection is currently waiting for a lock on a relation. This is an ALTER TABLE statement. This should be very fast, right? Because it's just doing ALTER TABLE ADD COLUMN, adding columns is an online operation.

So this is, generally speaking, not something that will be slow, but because there is a blocking lock, this is not able to succeed.

If we want to understand why this is not succeeding, one of the ways you can look at this, is you can search for this PID in the "pg_locks" table. "pg_locks" is an internal view on the Postgres lock manager for heavyweight locks.

This is very helpful, as it is where you can go to find out which process is holding which locks and also which locks are granted vs. which locks are still being waited on.

We can start by saying: for this particular process, this particular connection, which of the locks are not yet granted, what are you waiting on? In Jesse’s example, we can see that it's waiting on a relation lock. Relations are tables in his case, but they could also be an index. We see that it's waiting on a particular table, with a particular ID.

We can utilize this to say: which other connections, which other processes are holding that lock? In this case, it shows us a single connection that's being the problem. We can then look again at pg_stat_activity, can identify that it's actually an idle-in-transaction connection, which just didn't commit its transaction, which presumably did some kind of modification to that same table, and so that's why it's blocking that DDL statement.

There are various ways of running lock queries that help you do this more automatically. If you have a Postgres monitoring tool you could have that run automatically in the background. This is one version of that type of query. In the Postgres Wiki there are different lock blocking queries, we'll get to a little bit more on that in a second.

But let's say you have identified which process is blocking things, really then you have to think about, how can you end this process?

Download Free eBook: How To Get 3x Faster Postgres

How to end a process that’s holding a lock in Postgres

How can you have the thing complete that is blocking? Of course, when it's an idle-in-transaction, if you can, you could make a commit on that transaction. You could cause a rollback. That requires you to have that level of access, though. Or, if that doesn't work, you can do either a pg_cancel_backend, or in the case of an idle transaction you actually have to do a pg_terminate_backend. Because pg_cancel_backend is going to cancel a query, but if it's in that idle-in-transaction state you actually need to terminate the connection in order to get out of that state. In the particular situation which Jesse is describing, we would have needed to do a pg_terminate_backend to resolve the problem.

Helpful settings for handling Postgres locking

Jesse mentions two more things that I think are helpful for a lot of us that run into these problems.

log_lock_waits

First of all, you can enable a setting called log_lock_waits. I would say anybody who runs a production Postgres should enable this. For example, on Crunchy Bridge this is actually enabled by default.

What log_lock_waits does is to cause a log event to be issued in the Postgres logs every time a lock is being waited on for more than 1 second. This is controlled by the deadlock timeout in Postgres. Even though it's not a deadlock, it's run on the same schedule. If things are slower than 1 second in terms of lock waits, then this is a great way to find out about that. It will tell you which statement is waiting on and which PID holds that lock.

lock_timeout

Another way to get more safety here is to add a lock_timeout. Imagine you have a migration where you're doing an ALTER TABLE, and you think it's safe to do the ALTER TABLE, but actually, due to a concurrent operation that's still holding the lock, the ALTER TABLE is really slow.

Setting a lock timeout will cause this to return early with an error, similar to a canceled statement. You will know pretty quickly that there's a locking problem! So, especially for migrations it's a very good idea to set a lock time out. Make it 10 seconds, make it 30 seconds, or whatever you like. But have a way that the system doesn't just wait forever for some internal issue and then potentially causes downtime.

Postgres lock trees

Now, I do want to talk a little bit more about the implications of schema migrations or DDL statements waiting on other activity to complete in order to do their work.

For this, I want to go back to a very good post that Nikolay from Postgres.ai wrote a couple of years ago. In it, he talks about different queries to analyze what he called “Postgres lock trees” or “lock queues”. In his article he describes how you have to be careful that there are different layers of lock waits.

One process waiting on another process… This doesn't necessarily paint a full picture. There is a lot of detail to his post, but ultimately what I want to talk about is him showing us a tree of lock waits.

In his example, we have an UPDATE statement which is ultimately the holder of the lock that's of interest. We also have a DELETE statement that's waiting on that UPDATE statement. Then we also have an ALTER TABLE statement that's waiting for both of these to complete. Lastly, and this is where it gets really tricky, we have two SELECT statements that are also waiting.

This is the thing that's really surprising to most people! You would think a SELECT statement it's not modifying anything, so why would it need to wait on these other modifying statements?

Unfortunately, the problem is that because you have this ALTER TABLE statement, even though it is an ADD COLUMN, for a very brief moment this will need an exclusive lock on the whole table. This means that in that very brief moment no other SELECT statement can be running.

The way that Postgres queues this for you is that it says:

  1. Let's have this UPDATE statement finish
  2. Let's have this DELETE statement finish
  3. Then, run the ALTER TABLE
  4. Then, let all the SELECTs run

That means, once you have an ALTER TABLE waiting on other activity it's very risky, because it causes the SELECT statements to also wait. I've seen production outages just caused by this little detail!

So you really want to be careful and respond quickly when you're seeing lock problems in Postgres. In particular, when it's about exclusive locks taken by a table migration.

Lock monitoring with pganalyze

Because this is so complicated, we actually ended up building a dedicated lock monitoring feature in pganalyze. I just want to show you a quick example of this because I still use this frequently and I think it's really useful.

What this feature does is to give you a UI that says: this particular statement is waiting on this particular other statement. This feature is able to traverse the lock tree or lock graph.

If you're running any kind of production Postgres database, it's very important to have monitoring for blocked sessions and to know how to either run a manual query for you or to have a product like pganalyze that can do this for you automatically behind the scenes.

I hope you learned something new from E99 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox