Introducing Query Tuning Workbooks to safely tune Postgres queries on production with pganalyze!

Advanced Active Record: Using Subqueries in Rails

Active Record provides a great balance between the ability to perform simple queries simply, and also the ability to access the raw SQL sometimes required to get our jobs done. In this article, we will see a number of real-life examples of business needs that may arise at our jobs.

They will come in the form of a request for data from someone else at the company, where we will first translate the request into SQL, and then into the Rails code necessary to find those records. We will be covering five different types of subqueries to help us find the requested data.

Let's take a look at why subqueries matter:

Canvas 1 Layer 1 Postgres Rails Advanced Active Record with Subqueries: Simple usage of Active Record: SELECT AVG(salary) FROM employees 99306.4 SELECT * FROM employees WHERE salary > 99306.4 Result SELECT * FROM employees WHERE salary > (SELECT AVG(salary) FROM employees) Result

In the first case, without subqueries, we are going to the database twice: First to get the average salary, and then again to get the result set. With a subquery, we can avoid the extra roundtrip, getting the result directly with a single query.

Working with Active Record in Rails

Active Record is a little like a walled garden. It protects us as developers (and our users) from the harsh realities of what lies beyond those walls: Differences in SQL between databases (MySQL, Postgres, SQLite), knowing how to properly escape strings to avoid SQL injection attacks, and generally providing an elegant abstraction to interact with our database using the language of our choice, Ruby.

But, SQL is extremely powerful! By understanding the SQL that Active Record is executing, we can open the gate in our walled garden to reach beyond what you may think is possible to accomplish in Rails, taking advantage of optimizations and flexibility that may be difficult to achieve otherwise.

What are Subqueries in Rails

In this article, we will be learning how to use subqueries in Active Record. Subqueries are what their name implies: A query within a query. We will look at how to embed subqueries into the SELECT, FROM, WHERE, and HAVING clauses of SQL, to meet the demands of our business counterparts who are asking to view data in different and interesting ways.

We'll be playing the role of a developer fielding questions from HR. They are asking for reports about our employees at BCE (Best Company Ever), and we'll do our best to find the data they need using Active Record.

The source code for this article is available on GitHub.

An Overview of our Data

Our database has 4 tables:

  • roles: The job roles of our employees (Finance, Engineering, Sales, HR, etc...)
  • employees: The people that work for BCE
  • performance_reviews: Performance reviews carried out by an employee's manager, giving them a score between 0 and 100
  • vacations: Keeping track of when employees have taken vacation

Using https://dbdiagram.io/ we're able to see how these tables relate to each other:

An Overview of how 4 tables in our database relate to each other

If you are following along, the rails db:seed command will generate 1,000 employees, 1,000 vacations, and 10,000 performance reviews.

The Where Subquery

Now that we have our data set and we’re ready to go let’s help our HR team with their first request:

Leigh, could you find us all the employees that make more than the average salary at BCE?

Here we will use a subquery within the WHERE clause to find the employees that match HR's request:

SELECT *
FROM employees
WHERE
  employees.salary > (
    SELECT avg(salary)
    FROM employees)

My first attempt at replicating the query above looked like this:

Employee.where('salary > :avg', avg: Employee.average(:salary))

But what it produced was two queries: One to find the average, and a second to query employees with a salary greater than that number. Not technically wrong, but it doesn't line up with the SQL we were going for. There is also a potential performance impact of two round-trip requests to the database server, along with potential inconsistencies if a new employee making $1B/year is hired between queries one and two. Although this is unlikely in this particular scenario, it’s something to consider as a potential risk.

-- find the average
SELECT AVG("employees"."salary") FROM "employees"
-- find the employees
SELECT "employees".* FROM "employees" WHERE (salary > 99306.4)

What we shouldn’t forget about Active Record is that certain methods, such as average(:salary), actually execute the query and return a result, while other methods implement Method Chaining, allowing you to chain multiple Active Record methods together, building up more complex SQL statements prior to their execution.

Employee.where('salary > (:avg)', avg: Employee.select('avg(salary)'))

This produces the SQL we want, but note that we had to wrap the placeholder condition :avg in brackets, because the database wants subqueries wrapped in brackets as well.

Because the seed data is generated randomly, your results will vary from mine, but I am seeing 487 matching employees, getting a result that looks like this:

#<ActiveRecord::Relation [#<Employee id: 4, role_id: 5, name: "Bob Williams", salary: 127053.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, #<Employee id: 5, role_id: 4, name: "Bob Florez", salary: 149218.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, ...]>

Download Free eBook: Advanced Database Programming with Rails and Postgres

Where Not Exists

Leigh, we would like to encourage employees to have a healthy work-life balance, and were hoping you could provide us with a list of all the employees who have yet to take any vacation time.

For this case, NOT EXISTS is a perfect fit, since it only matches records that do not have a match in the subquery. An alternative is to perform a left outer join, only choosing the records with no matches on the right side. This is referred to as an anti-join, where the purpose of the join is to find records that do not have a matching record.

SELECT *
FROM employees
WHERE
  NOT EXISTS (
    SELECT 1
    FROM vacations
    WHERE vacations.employee_id = employees.id)

If you're interested in the LEFT OUTER JOIN equivalent, it might look like this:

SELECT employees.*
FROM
  employees
  LEFT OUTER JOIN vacations ON vacations.employee_id = employees.id
WHERE vacations.id IS NULL

The subquery depends on a match between the employees.id column and the vacations.employee_id column, making it a correlated subquery. Because Rails follows standard naming conventions when querying (the downcased plural form of our model), we can add the above condition into our subquery without too much difficulty.

Employee.where(
  'NOT EXISTS (:vacations)',
  vacations: Vacation.select('1').where('employees.id = vacations.employee_id')
)

Using my seed data, I am seeing 369 employees that have yet to take any vacations.

#<ActiveRecord::Relation [#<Employee id: 2, role_id: 2, name: "Alice Florez", salary: 86920.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, #<Employee id: 5, role_id: 4, name: "Bob Florez", salary: 149218.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, ...]>

The Select Subquery

Leigh, could you provide us with a list of employees, including the average salary of a BCE employee, and how much this employee's salary differs from the average?

SELECT
  *,
  (SELECT avg(salary)
    FROM employees) avg_salary,
  salary - (
    SELECT avg(salary)
    FROM employees) above_avg
FROM employees

Because the subquery is repeated, we can save ourselves a little bit of hassle by placing the subquery SQL into a variable that we'll embed into the outer query. The to_sql method is perfect for this, but it's also fantastic to peak into the SQL that Rails is producing without actually executing the query.

avg_sql = Employee.select('avg(salary)').to_sql

Employee.select(
  '*',
  "(#{avg_sql}) avg_salary",
  "salary - (#{avg_sql}) avg_difference"
)

This query does not limit the results in any way, but instead selects two additional columns (avg_salary and avg_difference). Looking at the first three results, I am seeing:

[
  {"id"=>1, "role_id"=>1, "name"=>"Joe Serna", "salary"=>86340.0, "avg_salary"=>99306.4, "avg_difference"=>-12966.399999999994}, 
  {"id"=>2, "role_id"=>2, "name"=>"Alice Florez", "salary"=>86920.0, "avg_salary"=>99306.4, "avg_difference"=>-12386.399999999994}, 
  {"id"=>3, "role_id"=>3, "name"=>"Amanda Florez", "salary"=>93600.0, "avg_salary"=>99306.4, "avg_difference"=>-5706.399999999994}
]

As with any SQL query, there are often many ways to arrive at the same result. In this example we used subqueries to find the average employee salary, but it may have been better to use window functions instead. They give us the same result, but provide a simpler query which is actually more performant as well. Even on a small dataset of 1000 employees, this query takes approximately 12ms vs 18ms for the subquery equivalent.

SELECT
  *,
  avg(salary) OVER () AS avg_salary,
  salary - avg(salary) OVER () AS avg_salary
FROM
  employees

The window function approach is actually easier to write in Rails as well!

Employee.select(
  '*',
  "avg(salary) OVER () avg_salary",
  "salary - avg(salary) OVER () avg_difference"
)

The From Subquery

Leigh, we'd like to know the average performance review score given across all our managers.

After clarifying with HR, they are looking to take the average score each manager has given, and then take the average of those averages. In other words, the average average. When you are dealing with an aggregate of aggregates, it needs to be accomplished in two steps. This can be done using a subquery as the FROM clause, essentially giving us a temporary table to then select from, allowing us to find the average of those averages.

SELECT avg(avg_score) reviewer_avg
FROM (
  SELECT reviewer_id, avg(score) avg_score
  FROM performance_reviews
  GROUP BY reviewer_id) reviewer_avgs

To keep our Ruby code clean, we'll place the subquery into a variable which can then be embedded into the main query.

from_sql =
  PerformanceReview.select(:reviewer_id, 'avg(score) avg_score').group(
    :reviewer_id
  ).to_sql

PerformanceReview.select('avg(avg_score) reviewer_avg').from(
  "(#{from_sql}) as reviewer_avgs"
).take.reviewer_avg

The result of this query is 50.652. This makes sense given that the seed data used a random value between 1 and 100 (rand(1..100)).

Download Free eBook: Efficient Search in Rails with Postgres

The Having Subquery

Leigh, certain reviewers are consistently giving low performance review scores. Could you find us a list of all the managers whose average score is 25% below our company average? We need to find out what is happening.

We will start by joining the employees table to the performance_reviews table where the employee is the reviewer (a manager), and then take their average score. Then we will filter out these managers using a HAVING clause to only include those whose score increased by 25% is still lower than the company average.

SELECT
  employees.*,
  avg(score) avg_score,
  (SELECT avg(score)
    FROM performance_reviews) company_avg
FROM
  employees
  INNER JOIN performance_reviews
    ON performance_reviews.reviewer_id = employees.id
GROUP BY employees.id
HAVING
  avg(score) < 0.75 *
    (SELECT avg(score)
    FROM performance_reviews)

You'll notice that I actually included two subqueries in the above SQL. Because the SQL was saved to a variable (avg_sql), we were able to reuse this both within the SELECT portion of the query, and also within the HAVING clause.

avg_sql = PerformanceReview.select('avg(score)').to_sql

Employee.joins(:employee_reviews).select(
  'employees.*',
  'avg(score) avg_score',
  "(#{avg_sql}) company_avg"
).group('employees.id').having("avg(score) < 0.75 * (#{avg_sql})")

The result of this query gives me 103 employees, and the first three of them look like:

[
  {"id"=>173, "role_id"=>1, "name"=>"Bob Williams", "salary"=>109206.0, "avg_score"=>23.75, "company_avg"=>50.04}, 
  {"id"=>390, "role_id"=>5, "name"=>"Bob Serna", "salary"=>127559.0, "avg_score"=>26.0, "company_avg"=>50.04}, 
  {"id"=>802, "role_id"=>4, "name"=>"Alice Halliday", "salary"=>94956.0, "avg_score"=>35.88, "company_avg"=>50.04}
]

Conclusion

In this article we were able to see a number of (somewhat) real-life examples of real business needs translating first into SQL, and then into the Rails code necessary to find those records. A backend developer's career will consist in most likely hundreds of similar requests!

Active Record gives us the ability to perform simple queries simply, but also lets us access the raw SQL which is sometimes required to get our jobs done. Subqueries are a perfect example of that, and we saw how to create subqueries in Rails and Active Record in the SELECT, FROM, WHERE, and HAVING clauses of an SQL statement. As we have seen in the examples above, with the expressiveness of Active Record, one doesn’t have to resort to writing completely in SQL to use a subquery.

Share this article: If you liked this article we’d appreciate it if you’d tweet it to your peers.

You might also be interested in

Learn more about how to make the most of Postgres and Ruby on Rails:

About the author

Leigh Halliday is a guest author for the pganalyze blog. He is a developer based out of Canada who works at FlipGive as a full-stack developer. He writes about Ruby and React on his blog and publishes React tutorials on YouTube.


Enjoy blog posts like this?

Get them once a month to your inbox