5mins of Postgres E23: Fuzzy text search & case-insensitive ICU collations in Postgres
Today, we're going to talk about fuzzy text search in Postgres with LIKE/ILIKE, trigrams, levenshtein distances, as well as case-insensitive pattern matching.
Share this episode: Click here to share this episode on twitter.
Let's have a look.
First of all, in this blog post by Brendan Scullion, we're going to talk about how you can go from not so fuzzy to the fuzziest search. The reason why you might care about fuzzy search is, people misspell words, or you have American vs British English, or other languages. Or you may have creative spelling, like "heyyy". If you have user generated content, you might care about fuzzy search.
The easiest pattern matching in Postgres is "LIKE". I actually forgot this, but "LIKE" supports two different ways of matching! What I usually use is the "%" character. You can use that to match against zero or more characters in a string. The other thing that you could also do is match against single characters of any value with an "_".
LIKE is cool, I use it a lot when I'm just looking at data.
LIKE has a companion "ILIKE", which is case insensitive search. We'll come back to different ways of doing that later as well.
More sophisticated are regular expression searches. Postgres actually has a pretty good regular expression engine, a good amount of engineering that went into it. The way you can use regular expression searches in Postgres, is either with the
SIMILAR TO keyword, or you can also use the "~" operator, and its siblings to match against a regular expression.
If you want to improve performance for LIKE, you can use the
varchar_pattern_ops operator classes. This is what you specify when you run a "CREATE INDEX" command. You can use these in the case of a LIKE that's left anchored, what that means is if the LIKE has a wildcard, but it has the wildcard on the right side of the search string, then you can use these special operator classes to index that.
The more sophisticated way of doing text search in Postgres is full text search. It breaks down the data into individual lexemes, and then when you're searching, it also breaks down the search query into individual portions that then match up against these stored lexemes.
To improve Postgres text search performance, first of all, you want to store that data and you can use the new generated column features in Postgres, to automatically generate a column off of your dataset. And then you index that column, that tsvector, with a GIN index.
GIN indexes have some performance trade-offs. Because GIN indexes are inverted indexes. That means that it doesn't store one value per row, it actually stores multiple (or zero) values for each row. That of course means that maintenance, as well as size, are a bit more unpredictable. If you want to learn more about GIN indexes, read our blog post about it here. If you are interested in full text search in Rails or full text search in Django, we have two helpful articles for you on our blog.
Another approach to search in Postgres is using the trigram extension. The way this works is you break the text into sets of three characters. "Hello world", "hel" for example is one set, "ell" will be the next one and such. That way you can then match on things that are similar, based on similar sets of trigrams. You can improve the performance of trigram search by creating a GIST or a GIN index, usually a GIST index is best. We have a dedicated blog post about searching with trigrams on the pganalyze blog here). You might also enjoy episode 6 of 5mins of Postgres, where we talked about optimizing Postgres Text Search with Trigrams and GiST indexes.
If you want to get more fuzzy, you can use levenshtein distance to match against things. Levenshtein distance is essentially how many characters you have to change in order to turn one string into another string. You can see how this works pretty well for name searches. There isn't really a way to index that effectively, but what the author suggests here is there's actually other things that index well, so you can combine them.
The soundex function looks at things that sound similar, like "Anne" with an "e" at the end, or without an "e". It's very focused on the English language, so that is a limitation. You can do an expression index on that soundex function, and then you could combine it with the levenshtein function to get a name search for example.
Another article that was posted recently by Laurenz from the Cybertec team, is on a related topic of case-insensitive search.
There's different ways of doing this. The one that most folks are probably familiar with is using the lower() function. You can also create a B-tree index on the lower() expression.
An alternate way of doing this is with the "citext" extension. Instead of creating a column with the text datatype, you create it with the "citext" data type. You can then match against this column without doing anything special. Behind the scenes it does a similar lower() function call.
The one thing to know about here: if it does regular expression matching, that's actually not case insensitive by default, you will still have to use that case insensitive operator. That might be a bit surprising if you don't watch out for it!
Last but not least, a pretty cool feature since Postgres 12, are case insensitive ICU collations. ICU is the new way of doing international sorting and matching of texts, because that's very locale dependent, depends on the country, depends on the language, there's all kinds of different rules for that. ICU represents these different rules, and lets you create your collation based on a standard ICU locale. This references a case insensitive ICU collation, and then you just reference this collation when you create your data types. In the query, you keep using your equality operator, nothing special there, and that's, you know, pretty sweet. "LIKE" is actually blocked here, and the same for regular expression matches. Laurenz describes this a little bit more in this post.
At the very end, Laurenz has done some performance comparison. This is surprising to me, I forgot the fact that lower() is actually pretty slow. If you are doing an explicit lower() in your queries, running lower can have more than a tenfold difference, between citext and lower(). Case insensitive ICU collations are actually a good trade-off here, because they're slightly more standard than citext. In general it's good to be aware of those, even if you end up using lower, because it's the simplest thing to do.
If you are doing an explicit lower() in your queries, running lower can have more than a tenfold difference, between citext and lower().
Thank you so much for listening. This was 5mins of Postgres. If you are interested in indexing, we just uploaded the recording of our indexing webinar from last week to our YouTube channel. If you subscribe to the channel or follow us on Twitter, you will also hear about next week's episode. Thank you so much and talk to you next week.
PS: If you’re interested: we have an eBook about Efficient Search in Rails with Postgres. You can download it here.