5mins of Postgres E6: Optimizing Postgres Text Search with Trigrams and GiST indexes
Today, we're looking at optimizing Postgres text search with trigrams and take a closer look at GiST indexes.
Share this episode: Click here to post to twitter.
Let's have a look together!
Trigrams in general are a way of splitting up a string, like "hello" for example, into individual three-character pieces that are easier to process or easier to compare with. It's one of the ways you can do text search in Postgres. It's not the same as full text search in Postgres (for full text search see our blog posts about Full Text Search in Rails and Postgres and Full Text Search with Postgres Django). Trigrams are more generic and they can also be used for things like regular expressions, which you cannot use full text search for.
In his article, Alex Klibisz is using the Amazon review dataset and in the post you can see the general shape: the table size is about a gigabyte of data.
In the searches, he's going to look for an exact spelling of a specific author, Michael Lewis, and then he's also going to look at a misspelling to show the benefit of trigram searches.
He performs variations of this query. He has a certain input query, "Michael Lewis" and then he does a similarity search against the summary column and then he also sorts the result by the distance operator. If something is more closely matching they should be listed earlier.
The simplest example of a query here takes about 1 minute 30 seconds. Now the question is: how can we make that faster?
Typically, when you want to make things faster indexes are the way to go.
Now, we don't want to create a B-tree index. Because we're not doing a standard equality comparison. In Postgres, there's two index types that can help you support a trigram search. There's a GiST index and a GIN index. Alex chooses the GiST index type because we also want to sort by the distance, we don't just want to filter the data. We'll get back to how GiST indexes are structured, there's actually a great second post that we can look at.
What is important to note is that there is a parameter called
siglen that was introduced in Postgres 13, not 12, which is a typo in Alex’ article. But in Postgres 13, this new
siglen parameter was introduced and with that
siglen parameter you can define for trigram operator classes how precise the index is.
So a smaller
siglen is less precise, a larger
siglen is more precise. The trade-off here is index size versus lookup performance. Now, Alex is benchmarking both
siglen=64 and then
siglen =256. Four times as much. What's interesting here is that the first version takes 4.5 seconds, the higher siglen version takes less than two seconds. That's really good improvement here! Certainly something worth trying benchmarking for your workload. The correct value might be different.
What if I now want to index all the individual other columns? There were three other columns that we were interested in. What happens if I want to query all of them? Alex tries different approaches. Ultimately he ends up with a fast result. He goes from 90 seconds in the beginning, from the sequential scan to a hundred milliseconds if he uses that technique with the expression index that concatenates the text columns. It's a very good evidence-based article that gives you a good introduction of how to optimize trigram searches.
When I was looking at this article I was interested in signature length and I tried to understand it more because I hadn't really looked at it in the past. And so I went to this other article that was written back in 2020 that describes more broadly what GiST indexes do and how they work.
The relevant parts to our conversation here are in the early section where it talks about the structure of a GiST index. A GiST index, like other index types, is split into
- leaf pages or leaf nodes
- internal pages,
- and a root page.
The leaf notes contain your actual data, in this case our trigram values. The internal pages, depending on the GiST operator class, look very different. In the
pg_trgm operator class, what the internal pages contain is essentially a bitmap. If you set the signature length, you define the size of that bitmap. Later in this article they describe this visually. They give the example of Postgres full text search, not trigram based search, but the same principle applies. Instead of the index storing something like this, which is for example how a B-tree index would work, they do store a bitmap, a hash in a sense, of the input data. Those bitmaps are then matched against the query.
For example, in the article you can see the different words, in their case using the full text search logic, and they get mapped to the bitmap value. In that bitmap it's always a single bit that's flipped positive.
If you set the signature length, what you're defining is the size of this bitmap.
Longer signature length results in a longer bitmap, meaning it's more precise. Now, when a search is done the index tree looks like this. It traverses this tree to find the leaf pages. In the article you can see only one third of the tree is traversed. Instead of having to look at the full set of leaf pages, it only has to look at one third of the leaf pages to determine which ones actually match.
If you’re interested in learning more about this topic you can read my eBook about Indexing in Postgres.
Thank you so much for listening. This was 5mins of Postgres episode 6 and talk to you next week!