Benchmarking multi-column, covering and hash indexes in Postgres

Today, we talk about benchmarking multi-column vs. multiple combined indexes in Postgres, and whether B-Tree or hash indexes are a better fit. We also look into cases and workloads where different indexes are better than other ones.



Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Let's start benchmarking!

Benchmarking indexes in Postgres

In this blog post by Kaarel Moppel he's referencing another post that came up a couple of weeks ago.

The original post by Simon Eskildsen talked about the difference between index merges versus composite indexes, and looked at both Postgres and MySQL. The conclusion from Simon's article was that composite indexes perform much better than having multiple separate indexes.

Composite indexes perform much better than having multiple separate indexes.

In his article, Kaarel takes a very close look at the Postgres aspect of this. He runs some additional benchmarks and shares his benchmark scripts.

Postgres benchmarking with UNLOGGED tables

The benchmark that Kaarel ran here was with an UNLOGGED table. This makes sure that the WAL logging doesn't impact our benchmark.

In his example, he uses a primary key, which is a bigint, and two text columns that are not being indexed. Then we have three "cardinality columns", these are the columns that we're querying by, and specifically, we are querying by the int1000 and int100 column. Respectively, they have either 1000 unique values or 100 unique values.

The test data is using a generate_series function and generates about 10 million rows, which results in a 13 GB table. This is roughly the same size as in Simon's original post.

The SQL script that Kaarel pastes into pgbench has two variables defined. For both of these variables he gets a random value. You can use the random function in the pgbench script to alternate which portions of the table you're looking at.

Kaarel then goes on and does a SELECT COUNT(*) from that table and looks for these two values in the table. This is the same test that Simon ran, but I think it's nice to be able to simply run a script and test the different results.

Benchmarking Postgres index options

He does two different tests:

  1. he tests it with an index that's on both columns
  2. he tests with two separate indexes, one on each column.

He runs these tests for 30 minutes each.

Additionally, he notes that pg_stat_statements could be used for reading timing information in more detail. You can use the stats reset function between runs to get the totals for all the runs.

Benchmarking a Postgres index with INCLUDE

In addition to the two types of indexes that were looked at in Simon’s initial article, here Kaarel is also looking at an index that uses the INCLUDE syntax. The INCLUDE syntax in Postgres says this column, in this case the int100 column, should not be used for filtering the index. It's not used to reduce the search space, but it is present in the index in the leaf pages. Once the index has found which portion of the index is the result, then it has extra data available that can be used to either return the result or do some additional filtering, like is used here.

The execution plan, as an example, still does an Index Only Scan, but instead of the column that's in the INCLUDE being in the index condition, that column is now in the filter.

Benchmarking a Postgres hash index

The other thing Kaarel tests is a hash index. Hash indexes are single column indexes. You have to create them in both columns separately. Here, this always results in a Bitmap Index Scan on the index, then does a BitmapAnd, and then does a Bitmap Heap Scan on the table. It does have to do more work, because it actually has to look at the table itself.

Hash indexes are single column indexes. You have to create them in both columns separately.

Now let's look at the results here. They are painting a very clear picture. The composite index for querying 10 million rows takes 0.06 milliseconds. That's almost a hundred times faster than using two separate B-Tree indexes. It's also faster than using a covering index that includes both the columns and is a hundred times faster than a hash index as well.

Improving Postgres performance with composite indexes

Clearly, composite indexes, when you can design them right and they match your workload, are the best choice. The structure of a B-Tree means that if you have a composite index, it's able to filter much faster through the relevant portions of the index and it can ignore everything else, thus causing less I/O to be loaded, less pages to be looked at, and it just performs much better.

Clearly, composite indexes, when you can design them right and they match your workload, are the best choice.

Kaarel points out two interesting things. First of all, why the hash indexes didn't work better. If you look at the size of the indexes, you can see that the B-Tree indexes, two separate indexes, are almost eight times smaller than the hash index. This is because hash indexes don't have deduplication. Kaarel’s tests are run with Postgres 15, and as of Postgres 13, so two versions prior, B-Tree indexes have deduplication, and the deduplication really improves how small indexes are (we have looked at this in more detail in our blog post "Lessons Learned from Running Postgres 13: Better Performance, Monitoring & More" - check out the section “Smaller Indexes with B-Tree Deduplication”).

A smaller index will load faster and process faster. Clearly, hash indexes are at a big disadvantage here. If you had very long text columns, on the other hand, the B-Tree index would have to store much more of the text column in the index itself, versus the hash index would be able to do that more efficiently. For different workloads, hash indexes could perform better.

Kaarel also notes that he disregarded data modifications. Oftentimes in a benchmark something will be fast, but when a query is slow in practice, if the index is bloated, for example if you had a lot of page splits on an index, or VACUUM wasn't doing its job correctly, or if your visibility map isn't updated so Index Only Scans can't be used, then the numbers might look very different.

In short: you should watch the query metrics on your production database as well, not just in benchmarks.

Thanks for joining us for today’s episode of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres

Learn more about Indexing in Postgres


Enjoy blog posts like this?

Get them once a month to your inbox