5mins of Postgres E38: When to use BRIN indexes in Postgres, tuning pages_per_range and determining correlation with pg_stats
In this article and video, we talk about tuning a Postgres BRIN index and how to decide whether to use BRIN based on pg_stats statistics and the correlation column. We also talk about the impact block ranges can have and a visual model on how to think about BRIN indexes.
Let's get into it!
We'll start with this blog post by Janet Carson, where she describes how she scored a big win by adding a properly thought out BRIN index to a particular table. As a side note, I really like the notion of BRIN indexes as something you need to think about because BRIN indexes are sometimes a really good fit, but sometimes a really bad fit – so it is important to understand when they work well.
BRIN indexes are sometimes a really good fit, but sometimes a really bad fit – so it is important to understand when they work well.
A BRIN index is a block range index. It's essentially a very lightweight index that works well for tables where the data is banded together in physical storage. Janet describes this in her article: if you have a table where you're
- adding data chronologically and
- you also have a date column
then you can think of the physical layout of the table to be banded by date. If you had a particular date for one row, the date on the next row is most likely going to be the same date or the next day's date. There is a correlation between the physical order and the values of a particular column.
The way that BRIN works is that it stores the minimum and maximum value of the indexed column for each block range. A block range would be something like “page 1 to page 10”. You can then imagine it would say for the date column in our example: the minimum value I've seen is this, and the maximum value I've seen is that.
When Postgres executes a query and it uses a BRIN index, what it's able to do is to use the BRIN index to identify which part of the table to look at. A BRIN index is always going to be fuzzy, it's not going to give you an exact row, like a B-tree index does, but it is very small, and that has a lot of benefits in terms of caching behavior and such.
When Postgres executes a query and it uses a BRIN index, what it's able to do is to use the BRIN index to identify which part of the table to look at.
An important decision you need to make when you create a BRIN index is to define how many pages you want to put in each of these block ranges.
The default would be 128. But if you have a very narrow table, that might be too much. You actually want BRIN to be more specific, to have less pages per block. If you have a very wide table where you have very few rows per page, then you might want to increase that block size. Janet describes a query that you can use to estimate how many rows you have in one page of your table. As she notes: if you don't do the math, you'll never know. I agree, this is something where you should calculate the right number for your table.
The other thing that Janet mentions is that multi-column BRIN indexes are actually really cheap, because BRIN is a very small index and it doesn't matter if you add additional columns. Index search effectiveness is the same, regardless of which indexed columns the query uses.
There is another blog post on BRIN indexes by Paul Ramsey from the Crunchy Data team and it covers a lot of the same things as Janet's post. Paul describes a visual way of how you could think of this. Check out his example in the article, page 1 and 2 are part of one BRIN block, and page 3 and 4 are part of the other BRIN block. On the value that's being indexed, in his case a year, you can see that the min and the max are saved for the block range. This is how Postgres would then use the BRIN index to locate a particular block and a particular page it's looking at.
Now, if I'm in a situation where I need to make a decision whether to use BRIN or not, what I can make use of is the pg_stats table. pg_stats is used by the Postgres planner when it makes the decision on which index to use or how to get the data in the first place.
pg_stats is updated when you run
ANALYZE or autovacuum runs
ANALYZE for you. Pg_stats provides a correlation column. This column gives you the statistical correlation between
- the physical row ordering,
- the actual structure on disk and
- the logical ordering of column values.
We can use this to get an understanding around if BRIN could be a good fit for our workload.
Check out the video above where I go to one of our own databases to show you what that correlation column looks like. I have a table called indexing engine runs. This is used inside pganalyze to track when we make index recommendations. This table is an append-only table and you'll note that we actually do not have a BRIN index, and I'll get to why in a second, but I want to describe to you what could be a use case for a BRIN index.
I can query the pg_stats table and I can look at that correlation column for this particular table. What I can see is the "run_at" column. You can think of this as a timestamp we assign each time we run our indexing engine. As this timestamp is the current time, the value of "run_at" highly correlates with the physical order. You can see in the video that the correlation for “run_at” is close to 1, which is an almost perfect correlation. But, for example, if I looked at "server_id" that does not correlate.
If I had a use case for querying solely by the "run_at" column, that would be a good fit for a BRIN index. However, let's say I want to query which server had a particular indexing engine run: that would be a bad fit for a BRIN index, because BRIN would query the whole table in search of a particular value.
To conclude: You can use the pg_stats table to make decisions on whether a column is a good fit for a BRIN index or not.