5mins of Postgres E40: How to debug slow Postgres UPDATEs and bloated tables with auto_explain & page_inspect

Today, we talk about why Postgres UPDATEs are getting slower over time, HOT updates, auto_explain, the pageinspect extension, and reasons why VACUUM won’t remove dead rows from a table.



Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.


Transcript

Let's get to it!

Debugging slow Postgres UPDATEs and bloated tables

This blog post by Laurenz from the Cybertec team talks about a customer situation that Laurenz ran into. In his article, Laurenz walks us through an example with two tables. It has an item table, and a sum table.

The item table gets items inserted, and then the sum table continuously gets updated with the total as new items get inserted. This is done through the use of a trigger. The sum table doesn't have any indexes, so we should be able to get HOT updates.

In his example, Laurenz's script takes 60 seconds to load a hundred thousand rows, which is pretty slow. This is when using a trigger that does an UPDATE statement. But, if we drop the trigger, the script runs in less than 17 milliseconds!

So clearly, doing all these UPDATEs is a problem. How do we prove though that UPDATEs are actually getting slow over time? Maybe the UPDATEs are always slow?

Understanding Postgres queries with auto_explain

Laurenz uses an extension called auto_explain. I often like to use auto_explain, because it shows you exactly what occurred when a query was slow. Feel free to check out my article on how we visualize auto_explain and provide insights at pganalyze. If I rerun an UPDATE, it might be faster after the fact, but if I experienced the slowness at a particular point in time, only auto_explain can tell me what really went on at that time.

auto_explain gets enabled through shared_preload_libraries. You can set various settings to configure when it logs the execution plan. Each statement that's lower than log_min_duration will be output to the Postgres logs where we can then take a look at the performance and other details of the execution plan.

Using log_min_duration and log_timing

log_min_duration = 0 means every single statement gets output to your logs. This is a configuration I would not use in production. Instead, what I usually recommend is turning log_min_duration to something like 1000 milliseconds and then turning log_timing off, which can have some overhead.

What I usually recommend is turning log_min_duration to something like 1000 milliseconds and then turning log_timing off, which can have some overhead.

What's interesting is, as we look at the first UPDATE, you can see that it's pretty fast. The first UPDATE takes 0.017 milliseconds. I have rarely seen statements that fast in production.

The last UPDATE in the chain is actually much slower. If we compare these two times, you can see 0.017 milliseconds versus 1.6 milliseconds. So that's about a hundred times slower for the last UPDATE.

If you look at the statistics everything looks pretty similar except the "Buffers" statistics. This is counted as pages. Postgres was looking at 443 8kB-pages. Note that the hit portion here means that this was actually in the shared buffer cache. We can conclude that this wasn't necessarily slow from an I/O perspective, but it was slow in terms of looking at a lot of data.

Understanding Postgres UPDATE and VACUUM

In Postgres an UPDATE works by creating what's called a new row version and then marking the old row version as dead. In this particular example the last UPDATE just had a lot more row versions to work with.

VACUUM is supposed to clean this up. Usually, you have autovacuum running and autovacuum will remove these dead row versions produced by your statements. But, the problem is that the COPY statement is a single transaction, and VACUUM can only remove row versions that are older than any currently active transaction. It's very important that you're trying to reduce long running transactions to make sure VACUUM can do its work and finish what it tries to do.

VACUUM can only remove row versions that are older than any currently active transaction.

HOT pruning in Postgres

Now, there is this notion of HOT pruning. What that means is when you have a HOT (**heap-only tuple) update - so there are no indexes referencing it - Postgres can actually do a “mini-VACUUM” of sorts, where it just vacuums the page that you're modifying, because it’s already modifying that page anyway. This usually can dramatically improve the efficiency of having a heavy update workload like this.

However: that only works if you don't have transactions holding things open.

In Laurenz’s particular case, because we have this long-running transaction, it doesn't actually work to clean up these dead row versions.

Understanding Postgres queries with page_inspect

At the end, Laurenz does show the pageinspect extension. pageinspect is very useful if you want to prove what's going on. For example, Laurenz inspects the very first page in a table and shows how all of the row versions are marked as dead in this particular page, but they also were marked as HOT updated. You want to see successful HOT updates, but because you had that long running transaction, it wasn't able to keep up.

The fix in this particular situation was using what's called a statement trigger, which ultimately moves the UPDATE of this particular table out of the COPY statement. By doing so, we avoid having a long running wrapping transaction around all this activity. In this particular case that took the overall load to 19 milliseconds!

Now, just a quick note: if you are seeing this in production, oftentimes this will lead to bloated tables as we saw in this particular case. When you're running VACUUM (VERBOSE) or you're running autovacuum, you can see this notion of "dead rows not yet removable". This is really the main thing to look out for when your autovacuum doesn't seem to be catching up, and doesn't seem to be keeping your tables at the size you expect them to be. There's a couple of other cases, but really long running transactions are the main cause that I've seen.

Thank you so much for listening. This was 5mins of Postgres. Feel free to subscribe to our YouTube channel and follow us on Twitter to hear about next week's episode!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox