Full Text Search in Milliseconds with Rails and PostgreSQL

Leigh HallidayBy Leigh Halliday
April 16, 2020

Postgres Full Text Search Example

Imagine the following scenario: You have a database full of job titles and descriptions, and you’re trying to find the best match. Typically you’d start by using an ILIKE expression, but this requires the search phrase to be an exact match. Then you might use trigrams, allowing spelling mistakes and inexact matches based on word similarity, but this makes it difficult to search using multiple words. What you really want to use is Full Text Search, providing the benefits of ILIKE and trigrams, with the added ability to easily search through large documents using natural language.

To summarize, here is a quick overview of popular built-in Postgres search options:

Postgres FeatureTypical Use CaseCan be indexed?Performance
LIKE/ILIKEWildcard-style search for small dataSometimesUnpredictable
pg_trgmSimilarity search for names, etcYes (GIN/GIST)Good
Full Text SearchNatural language searchYes (GIN/GIST)Good

In this article, we are going to learn about the inner workings of Full Text Search in Postgres and how to easily integrate Full Text Search into your Rails application using a fantastic gem named pg_search. We will learn how to search multiple columns at once, to give one column precedence over another, and how to optimize our Full Text Search implementation, taking a single query from 130ms to 7ms.

The full source code used in this article can be found here. Instructions on how to run this application locally and how to load the sample data referenced within this article can be found in the README.

The Foundations of Full Text Search

Let's break down the basics of Full Text Search, defining and explaining some of the most common terms you'll run into. Taking the text “looking for the right words”, we can see how Postgres stores this data internally, using the to_tsvector function:

SELECT to_tsvector('english', 'looking for the right words');
-- 'look':1 'right':4 'word':5

In the above SQL we have some text; often referred to as a document when talking about Full Text Search. A document must be parsed and converted into a special data type called a tsvector, which we did using the function to_tsvector.

The tsvector data type is comprised of lexemes. Lexemes are normalized key words which were contained in the document that will be used when searching through it. In this case we used the english language dictionary to normalize the words, breaking them down to their root. This means that words became word, and looking became look, with very common words such as for and the being removed completely, to avoid false positives.

SELECT to_tsvector('english', 'looking for the right words') @@ to_tsquery('english', 'words');
-- TRUE

The @@ operator allows us to check if a query (data type tsquery) exists within a document (data type tsvector). Much like tsvector, tsquery is also normalized prior to searching the document for matches.

SELECT
  ts_rank(
    to_tsvector('looking for the right words'),
    to_tsquery('english', 'words')
   );
-- 0.06079271

The ts_rank function takes a tsvector and a tsquery, returning a number that can be used when sorting the matching records, allowing us to sort the results from highest to lowest ranking.

Now that you have seen a few examples, let’s have a look at one last one before getting to Rails. Following, you can see an example of a query which searches through the jobs table where we are storing the title and description of each job. Here we are searching for the words ruby and rails, grabbing the 3 highest ranking results.

SELECT
  id,
  title,
  ts_rank(
    to_tsvector('english', title) || to_tsvector('english', description),
    to_tsquery('english', 'ruby & rails')
  ) AS rank
FROM jobs
WHERE
  to_tsvector('english', title) || to_tsvector('english', description) @@
  to_tsquery('english', 'ruby & rails')
ORDER BY rank DESC
LIMIT 3

The highest ranking result is a job with the title "Ruby on Rails Developer"... perfect! The full results of this query are:

[
  {
    "id": 1,
    "title": "Ruby on Rails Developer",
    "rank": 0.40266925
  },
  {
    "id": 109,
    "title": "Senior Ruby Developer - Remote",
    "rank": 0.26552397
  },
  {
    "id": 151,
    "title": "Team-Lead Developer",
    "rank": 0.14533159
  }
]

This query is actually concatenating (using ||) two tsvector fields together. This allows us to search both the title and the description at the same time. Later, we'll see how to give additional weight (precedence) to the title column.

Implementing Postgres Full Text Search in Rails

With a basic understanding of Full Text Search under our belts, it's time to take our knowledge over to Rails. We will be using the pg_search Gem, which can be used in two ways:

  1. Multi Search: Search across multiple models and return a single array of results. Imagine having three models: Product, Brand, and Review. Using Multi Search we could search across all of them at the same time, seeing a single set of search results. This would be perfect for adding federated search functionality to your app.
  2. Search Scope: Search within a single model, but with greater flexibility.

We will be focusing on the Search Scope approach in this article, as it lets us dive into the configuration options available when working with Full Text Search in Rails. Let's add the Gem to our Gemfile and get started:

# Gemfile
gem 'pg_search', '~> 2.3', '>= 2.3.2'

With that done, we can include a module in our Job model, and define our first searchable field:

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_title, against: :title
end

This adds a class level method to Job, allowing us to find jobs with the following line, which automatically returns them ranked from best match to worst.

Job.search_title('Ruby on Rails')

If we were to append to_sql to the above Ruby statement, we can see the SQL that is being generated. I have to warn you, it’s a bit messy, but that is because it handles not only searching, but also putting the results in the correct order using the ts_rank function.

SELECT
  "jobs".*
FROM
  "jobs"
  INNER JOIN (
    SELECT
      "jobs"."id" AS pg_search_id,
      (ts_rank((to_tsvector('simple', coalesce("jobs"."title"::text, ''))), (to_tsquery('simple', ''' ' || 'Ruby' || ' ''') && to_tsquery('simple', ''' ' || 'on' || ' ''') && to_tsquery('simple', ''' ' || 'Rails' || ' ''')), 0)) AS rank
    FROM
      "jobs"
    WHERE ((to_tsvector('simple', coalesce("jobs"."title"::text, ''))) @@ (to_tsquery('simple', ''' ' || 'Ruby' || ' ''') && to_tsquery('simple', ''' ' || 'on' || ' ''') && to_tsquery('simple', ''' ' || 'Rails' || ' ''')))) AS pg_search_5d9a17cb70b9733aadc073 ON "jobs"."id" = pg_search_5d9a17cb70b9733aadc073.pg_search_id
ORDER BY
  pg_search_5d9a17cb70b9733aadc073.rank DESC,
  "jobs"."id" ASC

Configuring pg_search

There are a number of ways you can configure pg_search: From support for prefixes and negation, to specifying which language dictionary to use when normalizing the document, as well as adding multiple, weighted columns.

By default pg_search uses the simple dictionary, which does zero normalization, but if we wanted to normalize our document using the english dictionary, searching across both the title and description, it would look like:

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_job,
                  against: %i[title description],
                  using: { tsearch: { dictionary: 'english' } }
end

We can perform a search in the same way we did before: Job.search_job("Ruby on Rails"). If we wanted to give higher precedence to the title column, we can add weighting scores to each of the columns, with possible values of: A, B, C, D.

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_job,
                  against: { title: 'A', description: 'B' },
                  using: { tsearch: { dictionary: 'english' } }
end

When you start combining columns, weighting them, and choosing which dictionary provides the best results, it really comes down to trial and error. Play around with it, try some queries and see if the results you get back match with what you are expecting!

Optimizing Full Text Search Queries in Rails

We have a problem! The query that is produced by Job.search_job("Ruby on Rails") takes an astounding 130ms. That may not seem like such a large number, but it is astounding because there are only 145 records in my database. Imagine if there were thousands! The majority of time is spent in the to_tsvector function. We can verify this by running this streamlined query below, which takes almost as much time to execute as the full query which actually finds the matching jobs:

SELECT to_tsvector('english', description) FROM jobs;
-- ~130ms

SELECT description FROM jobs;
-- ~15ms

This tells me that the slowness is in re-parsing and normalizing the document into a tsvector data type every single time the query is executed. The folks at thoughtbot have a great article about Full Text Search optimizations, where they add a pre-calculated tsvector column, keeping it up-to-date with triggers. This is great because it allows us to avoid re-parsing our document for every query and also lets us index this column!

There is a similar but slightly different approach I want to cover today which I learned by reading through the Postgres documentation. It also involves adding a pre-calculated tsvector column, but is done using a stored generated column. This means we don't need any triggers! It should be noted that this approach is only available in Postgres 12 and above. If you are using version 11 or earlier, the approach in the thoughtbot article is probably still the best one.

As we are venturing into the territory of more custom Postgres functionality, not easily supported by the Rails schema file in Ruby, we'll want to switch the schema format from :ruby to :sql. This line can be added to the application.rb file:

config.active_record.schema_format = :sql

Now, let's generate a migration to add a new column to the jobs table which will be automatically generated based on the setweight and to_tsvector functions:

class AddSearchableColumnToJobs < ActiveRecord::Migration[6.0]
  def up
    execute <<-SQL
      ALTER TABLE jobs
      ADD COLUMN searchable tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(description,'')), 'B')
      ) STORED;
    SQL
  end

  def down
    remove_column :jobs, :searchable
  end
end

Note that, as of the writing of this article, Postgres always requires a generated column to be a “Stored” column. That means it actually occupies space in your table and gets written on each INSERT/UPDATE. This also means that when you add a generated column to a table, it will require a rewrite of the table to actually set the values for all existing rows. This may block other operations on your database.

With our tsvector column added (which is giving precedence to the title over the description, is using the english dictionary, and is coalescing null values into empty strings), we're ready to add an index to it. Either GIN or GiST indexes can be used to speed up full text searches, but Postgres recommends GIN as the preferred index due to GiST searches being lossy, which may produce false matches. We'll add it concurrently to avoid locking issues when adding an index to large tables.

class AddIndexToSearchableJobs < ActiveRecord::Migration[6.0]
  disable_ddl_transaction!

  def change
    add_index :jobs, :searchable, using: :gin, algorithm: :concurrently
  end
end

The last thing we need to do is to tell pg_search to use our tsvector searchable column, rather than re-parsing the title and description fields each time. This is done by adding the tsvector_column option to tsearch:

class Job < ApplicationRecord
  include PgSearch::Model
  pg_search_scope :search_job,
                  against: { title: 'A', description: 'B' },
                  using: {
                    tsearch: {
                      dictionary: 'english', tsvector_column: 'searchable'
                    }
                  }
end

With this optimization done, we have gone from around 130ms to 7ms per query... not bad at all!

Conclusion

Let’s have a look at a real-life data set. We can prove the precision of our approach by looking at my database: Out of 145 jobs pulled from the GitHub and Hacker News job APIs, searching for "Ruby on Rails" returns the following results:

[
  "Ruby on Rails Developer",
  "Senior Ruby Developer - Remote",
  "Gobble (YC W14) – Senior Full Stack Software Engineers – Toronto, On",
  "DevOps (Remote - Europe)",
  "CareRev (YC S16) Is Hiring a Senior Back End Engineer in Los Angeles",
  "Software Engineer, Full Stack (Rails, React)",
  "Software Engineer",
  "Technology Solutions Developer"
]

To summarize:

We have shown how to use Postgres' Full Text Search within Rails and also how to customize it both in terms of functionality, but also in terms of performance. We ended up with a performant and flexible solution right inside the database we were already using.

Many use cases for Full Text Search can be implemented directly inside Postgres, avoiding the need to install and maintain additional services such as Elasticsearch.

If you find this article useful and want to share it with your peers you can tweet about it here.


Sign up for the pganalyze newsletter

Receive infrequent emails about interesting Postgres content around the web, new pganalyze feature releases, and new pganalyze ebooks. No spam, we promise.