Benchmarking multi-column, covering and hash indexes in Postgres
Today, we talk about benchmarking multi-column vs. multiple combined indexes in Postgres, and whether B-Tree or hash indexes are a better fit. We also look into cases and workloads where different indexes are better than other ones.
Let's start benchmarking!
In this blog post by Kaarel Moppel he's referencing another post that came up a couple of weeks ago.
The original post by Simon Eskildsen talked about the difference between index merges versus composite indexes, and looked at both Postgres and MySQL. The conclusion from Simon's article was that composite indexes perform much better than having multiple separate indexes.
Composite indexes perform much better than having multiple separate indexes.
In his article, Kaarel takes a very close look at the Postgres aspect of this. He runs some additional benchmarks and shares his benchmark scripts.
In his example, he uses a primary key, which is a
bigint, and two text columns that are not being indexed. Then we have three "cardinality columns", these are the columns that we're querying by, and specifically, we are querying by the
int100 column. Respectively, they have either 1000 unique values or 100 unique values.
The test data is using a
generate_series function and generates about 10 million rows, which results in a 13 GB table. This is roughly the same size as in Simon's original post.
The SQL script that Kaarel pastes into pgbench has two variables defined. For both of these variables he gets a random value. You can use the
random function in the pgbench script to alternate which portions of the table you're looking at.
Kaarel then goes on and does a
SELECT COUNT(*) from that table and looks for these two values in the table. This is the same test that Simon ran, but I think it's nice to be able to simply run a script and test the different results.
He does two different tests:
- he tests it with an index that's on both columns
- he tests with two separate indexes, one on each column.
He runs these tests for 30 minutes each.
Additionally, he notes that
pg_stat_statements could be used for reading timing information in more detail. You can use the stats reset function between runs to get the totals for all the runs.
In addition to the two types of indexes that were looked at in Simon’s initial article, here Kaarel is also looking at an index that uses the
INCLUDE syntax. The
INCLUDE syntax in Postgres says this column, in this case the
int100 column, should not be used for filtering the index. It's not used to reduce the search space, but it is present in the index in the leaf pages. Once the index has found which portion of the index is the result, then it has extra data available that can be used to either return the result or do some additional filtering, like is used here.
The execution plan, as an example, still does an Index Only Scan, but instead of the column that's in the
INCLUDE being in the index condition, that column is now in the filter.
The other thing Kaarel tests is a hash index. Hash indexes are single column indexes. You have to create them in both columns separately. Here, this always results in a Bitmap Index Scan on the index, then does a BitmapAnd, and then does a Bitmap Heap Scan on the table. It does have to do more work, because it actually has to look at the table itself.
Hash indexes are single column indexes. You have to create them in both columns separately.
Now let's look at the results here. They are painting a very clear picture. The composite index for querying 10 million rows takes 0.06 milliseconds. That's almost a hundred times faster than using two separate B-Tree indexes. It's also faster than using a covering index that includes both the columns and is a hundred times faster than a hash index as well.
Clearly, composite indexes, when you can design them right and they match your workload, are the best choice. The structure of a B-Tree means that if you have a composite index, it's able to filter much faster through the relevant portions of the index and it can ignore everything else, thus causing less I/O to be loaded, less pages to be looked at, and it just performs much better.
Clearly, composite indexes, when you can design them right and they match your workload, are the best choice.
Kaarel points out two interesting things. First of all, why the hash indexes didn't work better. If you look at the size of the indexes, you can see that the B-Tree indexes, two separate indexes, are almost eight times smaller than the hash index. This is because hash indexes don't have deduplication. Kaarel’s tests are run with Postgres 15, and as of Postgres 13, so two versions prior, B-Tree indexes have deduplication, and the deduplication really improves how small indexes are (we have looked at this in more detail in our blog post "Lessons Learned from Running Postgres 13: Better Performance, Monitoring & More" - check out the section “Smaller Indexes with B-Tree Deduplication”).
A smaller index will load faster and process faster. Clearly, hash indexes are at a big disadvantage here. If you had very long text columns, on the other hand, the B-Tree index would have to store much more of the text column in the index itself, versus the hash index would be able to do that more efficiently. For different workloads, hash indexes could perform better.
Kaarel also notes that he disregarded data modifications. Oftentimes in a benchmark something will be fast, but when a query is slow in practice, if the index is bloated, for example if you had a lot of page splits on an index, or VACUUM wasn't doing its job correctly, or if your visibility map isn't updated so Index Only Scans can't be used, then the numbers might look very different.
In short: you should watch the query metrics on your production database as well, not just in benchmarks.