How to use HOT Updates in Postgres to make CLUSTER work better

In E72 of “5mins of Postgres” we're gonna talk about how to use HOT updates to use CLUSTER more effectively, and how we can utilize CLUSTER to get the last bit of performance out of an index scan.



Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Let's jump right in!

Utilizing CLUSTER to speed up index scans in Postgres

This is a blog post by Laurenz on the Cybertec blog. In this blog post, Laurenz talks about how we can utilize CLUSTER to get the last bit of performance out of an index scan.

Usually, you have to run CLUSTER regularly to maintain good performance, but in his article, Laurenz shows a trick that lets us avoid re-clustering whilst having concurrent updates on a table.

First of all, some background on why we care about clustering. When you have your Postgres table, also called the “heap”, the order of the rows doesn't necessarily matter. For example, you might have an index scan on the left here that finds records 1, 2, 3, but records 1, 2, 3 might be anywhere in that table.

Typically, Postgres tables are structured as blocks of 8 kB pages. Even if you're just finding 3 records, you might end up loading 3 x 8 kB, even if the individual records are very small.

This is because of the physicality of the table, as that data might be spread across multiples of these 8 kB pages. CLUSTER lets us order the table in a certain way. In Laurenz’ particular case, if we cluster on the same column that the index is ordered by, we can see that the index will actually return the results from a single 8 kB block in this particular case, which is much more efficient in terms of I/O on the table.

The downside of this is that we actually have to run CLUSTER to make that change, and we can, of course, only have one physical table order.

Understanding table sorting via the correlation column

If you're interested in whether your table is sorted in a certain way, you can look at the correlation column, which you can find in the pg_stats table. This lets you know whether you have a physical order of the rows that correlates with that given column, which is a number between minus one and one. If it's zero, that means there's no correlation between the physical and the logical order. This is also used by the Postgres planner to estimate the cost of index range scans.

The drawbacks of Postgres CLUSTER

CLUSTER is an expensive operation. First of all, CLUSTER will lock your table in an access exclusive mode, so you can't even read the table whilst you're running CLUSTER, so it's very disruptive for any production system.

Second, you need twice the disk space for a table in order to run CLUSTER, which presents its own problem sometimes. Now, there are ways to run CLUSTER online using tools like pg_repack, but still it is something that you're not gonna run every day.

Using HOT Updates for making CLUSTER more effective in Postgres

Now, the insight here really comes from the fact that when you have a HOT update in Postgres, also called a “Heap Only Tuple”, then that preserves the order on the table. Generally speaking, a HOT update occurs when you are making an UPDATE where you're only modifying the columns that are not included in any of your indexes on the table.

Second, what also needs to be the case is that there's enough space on the page. Even if something is HOT capable, meaning you're not updating an index column, a HOT update might not actually occur because there is not enough space on the physical page to fit the HOT update.

Download Free eBook: How To Get 3x Faster Postgres

Optimizing Postgres with the fillfactor setting

This relates to the setting fillfactor in Postgres. Fill factor determines how many individual records are added into a single 8 kB page. The lower the fill factor, the more empty space there is for HOT updates to occur, because there is just free space in each page. Of course, if you make fill factor really small, let's say 10%, then you would have a very bloated table, a very sparse table, which slows down sequential scans and makes caching less effective. So we don't wanna lower it too much.

The question in this context here is: We know that HOT updates help us preserve the order because they occur in the same page. But we also know that reducing fillfactor too much is a problem. What would be the right fillfactor setting?

Laurenz runs a benchmark using pg_bench. You should probably benchmark this with a table that roughly looks like your table that you're actually running in production. For example the individual row sizes will influence how fillfactor works.

In Laurenz’ case, the rows are very small. We have one bigint column and two integer columns. What we're testing with is different per table fillfactor settings. We're using the generate_series function to add 10 million rows, and then on the key column, for each of these series values, we're generating a hash so that we have a different distribution than the actual series value. We're creating an index on the key column, and then we're clustering the table using that index.

Afterwards, we do VACUUM ANALYZE. Once we've done that, we'll now run an UPDATE statement and that's going to simulate the cluster order deviating because of the changes to the table. Here we're making 60 million updates to the table with 6 concurrent clients.

Then, we're measuring the results as determined by that correlation column in pg_stats. We're also looking at how many HOT updates occurred and we'll also look at the total table size to understand how much they were bloated.

I encourage you to look at Laurenz’ example in his blog post. It paints a very clear picture. If you have fillfactor 100 in his particular test case, the correlation is only 0.5. That means that the physical and logical row order have clearly changed after these updates. Not all of the updates were HOT updates either, even though they could have been HOT updates, and that's because the fill factor was high enough that there just was no space on the pages.

As we lower the fill factor, correlation improves, which shows that more HOT updates are better for correlation. You can see at fillfactor 75, we have perfect correlation and with 100% HOT updates.

We can also see that the index bloat is zero for the cases where we have near perfect HOT updates. That is easily explained because if you don't modify the place the index points to, it still points to the same page, then Postgres can follow the HOT chain to the actual current row version.

Generally, what this shows is that if you have a table that doesn't have many inserts but has a lot of updates and you see CLUSTER improving performance a lot, then tuning your table for HOT updates is even more important. The trade-off here often is to not over-index, because if you over-index, then you might actually have problems with updates not being HOT capable, and so they become regular updates that may be placed anywhere in the table destroying your clustered order.

Thanks for joining E72 of our 5mins of Postgres show. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox