How to use HOT Updates in Postgres to make CLUSTER work better
In E72 of “5mins of Postgres” we're gonna talk about how to use HOT updates to use CLUSTER more effectively, and how we can utilize CLUSTER to get the last bit of performance out of an index scan.
- Utilizing CLUSTER to speed up index scans in Postgres
- Understanding table sorting via the correlation column
- The drawbacks of Postgres CLUSTER
- Using HOT Updates for making CLUSTER more effective in Postgres
- Optimizing Postgres with the fillfactor setting
- What we have discussed in this episode of 5mins of Postgres
Let's jump right in!
This is a blog post by Laurenz on the Cybertec blog. In this blog post, Laurenz talks about how we can utilize
CLUSTER to get the last bit of performance out of an index scan.
Usually, you have to run
CLUSTER regularly to maintain good performance, but in his article, Laurenz shows a trick that lets us avoid re-clustering whilst having concurrent updates on a table.
First of all, some background on why we care about clustering. When you have your Postgres table, also called the “heap”, the order of the rows doesn't necessarily matter. For example, you might have an index scan on the left here that finds records 1, 2, 3, but records 1, 2, 3 might be anywhere in that table.
Typically, Postgres tables are structured as blocks of 8 kB pages. Even if you're just finding 3 records, you might end up loading 3 x 8 kB, even if the individual records are very small.
This is because of the physicality of the table, as that data might be spread across multiples of these 8 kB pages.
CLUSTER lets us order the table in a certain way. In Laurenz’ particular case, if we cluster on the same column that the index is ordered by, we can see that the index will actually return the results from a single 8 kB block in this particular case, which is much more efficient in terms of I/O on the table.
The downside of this is that we actually have to run
CLUSTER to make that change, and we can, of course, only have one physical table order.
If you're interested in whether your table is sorted in a certain way, you can look at the correlation column, which you can find in the
pg_stats table. This lets you know whether you have a physical order of the rows that correlates with that given column, which is a number between minus one and one. If it's zero, that means there's no correlation between the physical and the logical order. This is also used by the Postgres planner to estimate the cost of index range scans.
CLUSTER is an expensive operation. First of all, CLUSTER will lock your table in an access exclusive mode, so you can't even read the table whilst you're running CLUSTER, so it's very disruptive for any production system.
Second, you need twice the disk space for a table in order to run
CLUSTER, which presents its own problem sometimes. Now, there are ways to run
CLUSTER online using tools like
pg_repack, but still it is something that you're not gonna run every day.
Now, the insight here really comes from the fact that when you have a HOT update in Postgres, also called a “Heap Only Tuple”, then that preserves the order on the table. Generally speaking, a HOT update occurs when you are making an UPDATE where you're only modifying the columns that are not included in any of your indexes on the table.
Second, what also needs to be the case is that there's enough space on the page. Even if something is HOT capable, meaning you're not updating an index column, a HOT update might not actually occur because there is not enough space on the physical page to fit the HOT update.
This relates to the setting
fillfactor in Postgres. Fill factor determines how many individual records are added into a single 8 kB page. The lower the fill factor, the more empty space there is for HOT updates to occur, because there is just free space in each page. Of course, if you make fill factor really small, let's say 10%, then you would have a very bloated table, a very sparse table, which slows down sequential scans and makes caching less effective. So we don't wanna lower it too much.
The question in this context here is: We know that HOT updates help us preserve the order because they occur in the same page. But we also know that reducing fillfactor too much is a problem. What would be the right fillfactor setting?
Laurenz runs a benchmark using
pg_bench. You should probably benchmark this with a table that roughly looks like your table that you're actually running in production. For example the individual row sizes will influence how fillfactor works.
In Laurenz’ case, the rows are very small. We have one
bigint column and two
integer columns. What we're testing with is different per table fillfactor settings. We're using the
generate_series function to add 10 million rows, and then on the key column, for each of these series values, we're generating a hash so that we have a different distribution than the actual series value. We're creating an index on the key column, and then we're clustering the table using that index.
Afterwards, we do
VACUUM ANALYZE. Once we've done that, we'll now run an
UPDATE statement and that's going to simulate the cluster order deviating because of the changes to the table. Here we're making 60 million updates to the table with 6 concurrent clients.
Then, we're measuring the results as determined by that correlation column in
pg_stats. We're also looking at how many HOT updates occurred and we'll also look at the total table size to understand how much they were bloated.
I encourage you to look at Laurenz’ example in his blog post. It paints a very clear picture. If you have fillfactor 100 in his particular test case, the correlation is only 0.5. That means that the physical and logical row order have clearly changed after these updates. Not all of the updates were HOT updates either, even though they could have been HOT updates, and that's because the fill factor was high enough that there just was no space on the pages.
As we lower the fill factor, correlation improves, which shows that more HOT updates are better for correlation. You can see at fillfactor 75, we have perfect correlation and with 100% HOT updates.
We can also see that the index bloat is zero for the cases where we have near perfect HOT updates. That is easily explained because if you don't modify the place the index points to, it still points to the same page, then Postgres can follow the HOT chain to the actual current row version.
Generally, what this shows is that if you have a table that doesn't have many inserts but has a lot of updates and you see
CLUSTER improving performance a lot, then tuning your table for HOT updates is even more important. The trade-off here often is to not over-index, because if you over-index, then you might actually have problems with updates not being HOT capable, and so they become regular updates that may be placed anywhere in the table destroying your clustered order.