Similarity in Postgres and Rails using Trigrams

Trigram Example


You typed "postgras", did you mean "postgres"?

Use the best tool for the job. It seems like solid advice, but there's something to say about keeping things simple. There is a training and maintenance cost that comes with supporting an ever growing number of tools. It may be better advice to use an existing tool that works well, although not perfect, until it hurts. It all depends on your specific case.

Postgres is an amazing relational database, and it supports more features than you might initially think! It has full text search, JSON documents, and support for similarity matching through its pg_trgm module.

Today, we will break down how to use pg_trgm for a light-weight, built-in similarity matcher. Why are we doing this? Well, before reaching for a tool purpose-built for search such as Elasticsearch, potentially complicating development by adding another tool to your development stack, it's worth seeing if Postgres suits your application's needs! You may be surprised!

In this article, we will look at how it works under the covers, and how to use it efficiently in your Rails app.

What are Trigrams?

Trigrams, a subset of n-grams, break text down into groups of three consecutive letters. Let's see an example: postgres. It is made up of six groups: pos, ost, stg, tgr, gre, res.

This process of breaking a piece of text into smaller groups allows you to compare the groups of one word to the groups of another word. Knowing how many groups are shared between the two words allows you to make a comparison between them based on how similar their groups are.

Postgres Trigram example

Postgres' pg_trgm module comes with a number of functions and operators to compare strings. We'll look at the show_trgm and similarity functions, along with the % operator below:

select
  show_trgm('postgras') as tri1, -- {"  p"," po","as ",gra,ost,pos,ras,stg,tgr}
  show_trgm('postgres') as tri2, -- {"  p"," po","es ",gre,ost,pos,res,stg,tgr}
  similarity('postgras','postgres'), -- 0.5
  'postgras' % 'postgres' -- TRUE

The show_trgm function isn't one you'd necessarily use day-to-day, but it's good to see how Postgres breaks a string down into trigrams. You'll notice something interesting here, that two spaces are added to the beginning of the string, and a single space is added to the end.

This is done for a couple of reasons:

The first reason is that it allows trigram calculations on words with less than three characters, such as Hi.

Secondly, it ensures the first and last characters are not overly de-emphasized for comparisons. If we used only strict triplets, the first and last letters in longer words would each occur in only a single group: with padding they occur in three (for the first letter) and two (for the last). The last letter is less important for matching, which means that postgres and postgrez are more similar than postgres and postgras, even though they are both off by a single character.

The similarity function compares the trigrams from two strings and outputs a similarity number between 1 and 0. 1 means a perfect match, and 0 means no shared trigrams.

Lastly, we have the % operator, which gives you a boolean of whether two strings are similar. By default, Postgres uses the number 0.3 when making this decision, but you can always update this setting.

Download Free eBook: Efficient Search in Rails with Postgres

Ruby Trigram example

You don't need to know how to build a trigram in order to use them in Postgres, but it doesn't hurt to dive deeper and expand your knowledge. Let's take a look at how to implement something similar ourselves in Ruby.

The first method will take a string, and output an array of trigrams, adding two spaces to the front, and one to the back of the original string, just like Postgres does.

def trigram(word)
  return [] if word.strip == ""

  parts = []
  padded = "  #{word} ".downcase
  padded.chars.each_cons(3) { |w| parts << w.join }
  parts
end

p trigram("postgras")
# ["  p", " po", "pos", "ost", "stg", "tgr", "gra", "ras", "as "]
p trigram("postgres")
# ["  p", " po", "pos", "ost", "stg", "tgr", "gre", "res", "es "]

Next up, we'll compare the trigrams from our two words together, giving a ratio of how similar they are:

def similarity(word1, word2)
  tri1 = trigram(word1)
  tri2 = trigram(word2)

  return 0.0 if [tri1, tri2].any? { |arr| arr.size == 0 }

  # Find number of trigrams shared between them
  same_size = (tri1 & tri2).size
  # Find unique total trigrams in both arrays
  all_size = (tri1 | tri2).size

  same_size.to_f / all_size
end

p similarity("postgras", "postgres")
# 0.5

Now that we have our similarity calculator, we can implement a simple similar? method, which checks if the similarity is above the threshold of 0.3:

def similar?(word1, word2)
  similarity(word1, word2) >= 0.3
end

p similar?("postgras", "postgres")
# true

Using Trigrams in Rails

There aren't too many gotchas in order to use these similarity functions and operators within your Rails app, but there are a couple!

Below we have a migration to create a cities table. When indexing the name column, to ensure that querying with the similarity operator stays fast, we'll need to ensure that we use either a gin or gist index. We do this by indicating using: :gin. In addition to that, we have to pass the opclass option opclass: :gin_trgm_ops, so it knows which type of gin index to create.

Unless you have already enabled the pg_trgm extension, you will most likely receive an error, but this is easily fixed by adding enable_extension :pg_trgm to your migration.

class CreateCities < ActiveRecord::Migration[6.0]
  def change
    enable_extension :pg_trgm

    create_table :cities do |t|
      t.string :name, null: false
      t.timestamps
      t.index :name, opclass: :gin_trgm_ops, using: :gin
    end
  end
end

Now that we have the pg_trgm extension enabled, and have correctly indexed the table, we can use the similarity operator % inside of our where clauses, such as in the scope below:

class City < ApplicationRecord
  scope :name_similar, ->(name) { where("name % :name", name: name) }
end

City.name_similar("Torono").count
# SELECT COUNT(*) FROM "cities" WHERE (name % 'Torono')

Showing the closest matches for a term based on its similarity

We may not want to only limit by similarity using the % operator, but also order the results from most similar to least similar. Take the example query and its result below:

select name, similarity(name, 'Dease Lake')
from cities
where name % 'Dease Lake'
order by 2 desc

This query finds cities which have a name similar to Dease Lake, but you can see that we actually get seven results back, though we can clearly see that there was an exact match. Ideally then, we wouldn't just limit our query by similarity, but put it in the correct order as well.

Dease Lake  1
Deer Lake   0.5
Lake Louise 0.375
Lynn Lake   0.33333334
Red Lake    0.33333334
Cat Lake    0.33333334
Baker Lake  0.3125

We can do this by updating our scope to order by similarity. We have to be careful about this, because in order to use the similarity function, we need to pass in the user input of 'Dease Lake'. To avoid SQL injection attacks and to ensure safe string quoting, we'll use the quote_string method from ActiveRecord::Base.

class City < ApplicationRecord
  scope :name_similar, ->(name) {
    quoted_name = ActiveRecord::Base.connection.quote_string(name)
    where("name % :name", name: name).
      order(Arel.sql("similarity(name, '#{quoted_name}') DESC"))
  }
end

Now when we use the name_similar scope, the result will be ordered with the most similar city first, allowing us to find Dease Lake:

City.name_similar("Dease Lake").first.name
# => Dease Lake

And the SQL produced looks like:

SELECT "cities".*
FROM "cities"
WHERE (name % 'Dease Lake')
ORDER BY similarity(name, 'Dease Lake') DESC
LIMIT $1

Conclusion

In this article, we took a dive into the pg_trgm extension, seeing first what trigrams actually are, and then how we can practically use similarity functions and operators in our Rails apps. This allows us to improve keyword searching, by finding similar, rather than exact matches. We also managed to accomplish all of this without adding an additional backend service, or too much additional complexity to our application.

Share this article: If you liked this article we'd appreciate it if you'd tweet it to your peers.

Download Free eBook: Advanced Database Programming with Rails and Postgres

About the Author

Leigh Halliday is a guest author for the pganalyze blog. He is a developer based out of Canada who works at FlipGive as a full-stack developer. He writes about Ruby and React on his blog and publishes React tutorials on YouTube.


Enjoy blog posts like this?

Get them once a month to your inbox