Advanced Active Record: Using Subqueries in Rails
Active Record provides a great balance between the ability to perform simple queries simply, and also the ability to access the raw SQL sometimes required to get our jobs done. In this article, we will see a number of real-life examples of business needs that may arise at our jobs.
They will come in the form of a request for data from someone else at the company, where we will first translate the request into SQL, and then into the Rails code necessary to find those records. We will be covering five different types of subqueries to help us find the requested data.
Let's take a look at why subqueries matter:
In the first case, without subqueries, we are going to the database twice: First to get the average salary, and then again to get the result set. With a subquery, we can avoid the extra roundtrip, getting the result directly with a single query.
Working with Active Record in Rails
Active Record is a little like a walled garden. It protects us as developers (and our users) from the harsh realities of what lies beyond those walls: Differences in SQL between databases (MySQL, Postgres, SQLite), knowing how to properly escape strings to avoid SQL injection attacks, and generally providing an elegant abstraction to interact with our database using the language of our choice, Ruby.
But, SQL is extremely powerful! By understanding the SQL that Active Record is executing, we can open the gate in our walled garden to reach beyond what you may think is possible to accomplish in Rails, taking advantage of optimizations and flexibility that may be difficult to achieve otherwise.
What are Subqueries in Rails
In this article, we will be learning how to use subqueries in Active Record. Subqueries are what their name implies: A query within a query. We will look at how to embed subqueries into the SELECT
, FROM
, WHERE
, and HAVING
clauses of SQL, to meet the demands of our business counterparts who are asking to view data in different and interesting ways.
We'll be playing the role of a developer fielding questions from HR. They are asking for reports about our employees at BCE (Best Company Ever), and we'll do our best to find the data they need using Active Record.
The source code for this article is available on GitHub.
An Overview of our Data
Our database has 4 tables:
- roles: The job roles of our employees (Finance, Engineering, Sales, HR, etc...)
- employees: The people that work for BCE
- performance_reviews: Performance reviews carried out by an employee's manager, giving them a score between 0 and 100
- vacations: Keeping track of when employees have taken vacation
Using https://dbdiagram.io/ we're able to see how these tables relate to each other:
If you are following along, the rails db:seed
command will generate 1,000 employees, 1,000 vacations, and 10,000 performance reviews.
The Where Subquery
Now that we have our data set and we’re ready to go let’s help our HR team with their first request:
Leigh, could you find us all the employees that make more than the average salary at BCE?
Here we will use a subquery within the WHERE
clause to find the employees that match HR's request:
SELECT *
FROM employees
WHERE
employees.salary > (
SELECT avg(salary)
FROM employees)
My first attempt at replicating the query above looked like this:
Employee.where('salary > :avg', avg: Employee.average(:salary))
But what it produced was two queries: One to find the average, and a second to query employees with a salary greater than that number. Not technically wrong, but it doesn't line up with the SQL we were going for. There is also a potential performance impact of two round-trip requests to the database server, along with potential inconsistencies if a new employee making $1B/year is hired between queries one and two. Although this is unlikely in this particular scenario, it’s something to consider as a potential risk.
-- find the average
SELECT AVG("employees"."salary") FROM "employees"
-- find the employees
SELECT "employees".* FROM "employees" WHERE (salary > 99306.4)
What we shouldn’t forget about Active Record is that certain methods, such as average(:salary)
, actually execute the query and return a result, while other methods implement Method Chaining, allowing you to chain multiple Active Record methods together, building up more complex SQL statements prior to their execution.
Employee.where('salary > (:avg)', avg: Employee.select('avg(salary)'))
This produces the SQL we want, but note that we had to wrap the placeholder condition :avg
in brackets, because the database wants subqueries wrapped in brackets as well.
Because the seed data is generated randomly, your results will vary from mine, but I am seeing 487 matching employees, getting a result that looks like this:
#<ActiveRecord::Relation [#<Employee id: 4, role_id: 5, name: "Bob Williams", salary: 127053.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, #<Employee id: 5, role_id: 4, name: "Bob Florez", salary: 149218.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, ...]>
Where Not Exists
Leigh, we would like to encourage employees to have a healthy work-life balance, and were hoping you could provide us with a list of all the employees who have yet to take any vacation time.
For this case, NOT EXISTS
is a perfect fit, since it only matches records that do not have a match in the subquery. An alternative is to perform a left outer join, only choosing the records with no matches on the right side. This is referred to as an anti-join, where the purpose of the join is to find records that do not have a matching record.
SELECT *
FROM employees
WHERE
NOT EXISTS (
SELECT 1
FROM vacations
WHERE vacations.employee_id = employees.id)
If you're interested in the LEFT OUTER JOIN equivalent, it might look like this:
SELECT employees.*
FROM
employees
LEFT OUTER JOIN vacations ON vacations.employee_id = employees.id
WHERE vacations.id IS NULL
The subquery depends on a match between the employees.id
column and the vacations.employee_id
column, making it a correlated subquery. Because Rails follows standard naming conventions when querying (the downcased plural form of our model), we can add the above condition into our subquery without too much difficulty.
Employee.where(
'NOT EXISTS (:vacations)',
vacations: Vacation.select('1').where('employees.id = vacations.employee_id')
)
Using my seed data, I am seeing 369 employees that have yet to take any vacations.
#<ActiveRecord::Relation [#<Employee id: 2, role_id: 2, name: "Alice Florez", salary: 86920.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, #<Employee id: 5, role_id: 4, name: "Bob Florez", salary: 149218.0, created_at: "2020-04-26 18:42:53", updated_at: "2020-04-26 18:42:53">, ...]>
The Select Subquery
Leigh, could you provide us with a list of employees, including the average salary of a BCE employee, and how much this employee's salary differs from the average?
SELECT
*,
(SELECT avg(salary)
FROM employees) avg_salary,
salary - (
SELECT avg(salary)
FROM employees) above_avg
FROM employees
Because the subquery is repeated, we can save ourselves a little bit of hassle by placing the subquery SQL into a variable that we'll embed into the outer query. The to_sql
method is perfect for this, but it's also fantastic to peak into the SQL that Rails is producing without actually executing the query.
avg_sql = Employee.select('avg(salary)').to_sql
Employee.select(
'*',
"(#{avg_sql}) avg_salary",
"salary - (#{avg_sql}) avg_difference"
)
This query does not limit the results in any way, but instead selects two additional columns (avg_salary
and avg_difference
). Looking at the first three results, I am seeing:
[
{"id"=>1, "role_id"=>1, "name"=>"Joe Serna", "salary"=>86340.0, "avg_salary"=>99306.4, "avg_difference"=>-12966.399999999994},
{"id"=>2, "role_id"=>2, "name"=>"Alice Florez", "salary"=>86920.0, "avg_salary"=>99306.4, "avg_difference"=>-12386.399999999994},
{"id"=>3, "role_id"=>3, "name"=>"Amanda Florez", "salary"=>93600.0, "avg_salary"=>99306.4, "avg_difference"=>-5706.399999999994}
]
As with any SQL query, there are often many ways to arrive at the same result. In this example we used subqueries to find the average employee salary, but it may have been better to use window functions instead. They give us the same result, but provide a simpler query which is actually more performant as well. Even on a small dataset of 1000 employees, this query takes approximately 12ms vs 18ms for the subquery equivalent.
SELECT
*,
avg(salary) OVER () AS avg_salary,
salary - avg(salary) OVER () AS avg_salary
FROM
employees
The window function approach is actually easier to write in Rails as well!
Employee.select(
'*',
"avg(salary) OVER () avg_salary",
"salary - avg(salary) OVER () avg_difference"
)
The From Subquery
Leigh, we'd like to know the average performance review score given across all our managers.
After clarifying with HR, they are looking to take the average score each manager has given, and then take the average of those averages. In other words, the average average. When you are dealing with an aggregate of aggregates, it needs to be accomplished in two steps. This can be done using a subquery as the FROM
clause, essentially giving us a temporary table to then select from, allowing us to find the average of those averages.
SELECT avg(avg_score) reviewer_avg
FROM (
SELECT reviewer_id, avg(score) avg_score
FROM performance_reviews
GROUP BY reviewer_id) reviewer_avgs
To keep our Ruby code clean, we'll place the subquery into a variable which can then be embedded into the main query.
from_sql =
PerformanceReview.select(:reviewer_id, 'avg(score) avg_score').group(
:reviewer_id
).to_sql
PerformanceReview.select('avg(avg_score) reviewer_avg').from(
"(#{from_sql}) as reviewer_avgs"
).take.reviewer_avg
The result of this query is 50.652
. This makes sense given that the seed data used a random value between 1 and 100 (rand(1..100)
).
The Having Subquery
Leigh, certain reviewers are consistently giving low performance review scores. Could you find us a list of all the managers whose average score is 25% below our company average? We need to find out what is happening.
We will start by joining the employees
table to the performance_reviews
table where the employee is the reviewer (a manager), and then take their average score. Then we will filter out these managers using a HAVING
clause to only include those whose score increased by 25% is still lower than the company average.
SELECT
employees.*,
avg(score) avg_score,
(SELECT avg(score)
FROM performance_reviews) company_avg
FROM
employees
INNER JOIN performance_reviews
ON performance_reviews.reviewer_id = employees.id
GROUP BY employees.id
HAVING
avg(score) < 0.75 *
(SELECT avg(score)
FROM performance_reviews)
You'll notice that I actually included two subqueries in the above SQL. Because the SQL was saved to a variable (avg_sql
), we were able to reuse this both within the SELECT
portion of the query, and also within the HAVING
clause.
avg_sql = PerformanceReview.select('avg(score)').to_sql
Employee.joins(:employee_reviews).select(
'employees.*',
'avg(score) avg_score',
"(#{avg_sql}) company_avg"
).group('employees.id').having("avg(score) < 0.75 * (#{avg_sql})")
The result of this query gives me 103 employees, and the first three of them look like:
[
{"id"=>173, "role_id"=>1, "name"=>"Bob Williams", "salary"=>109206.0, "avg_score"=>23.75, "company_avg"=>50.04},
{"id"=>390, "role_id"=>5, "name"=>"Bob Serna", "salary"=>127559.0, "avg_score"=>26.0, "company_avg"=>50.04},
{"id"=>802, "role_id"=>4, "name"=>"Alice Halliday", "salary"=>94956.0, "avg_score"=>35.88, "company_avg"=>50.04}
]
Conclusion
In this article we were able to see a number of (somewhat) real-life examples of real business needs translating first into SQL, and then into the Rails code necessary to find those records. A backend developer's career will consist in most likely hundreds of similar requests!
Active Record gives us the ability to perform simple queries simply, but also lets us access the raw SQL which is sometimes required to get our jobs done. Subqueries are a perfect example of that, and we saw how to create subqueries in Rails and Active Record in the SELECT
, FROM
, WHERE
, and HAVING
clauses of an SQL statement. As we have seen in the examples above, with the expressiveness of Active Record, one doesn’t have to resort to writing completely in SQL to use a subquery.
Share this article: If you liked this article we’d appreciate it if you’d tweet it to your peers.
You might also be interested in
Learn more about how to make the most of Postgres and Ruby on Rails:
- eBook: Best Practices for Optimizing Postgres Query Performance
- Effectively Using Materialized Views in Ruby on Rails
- Efficient GraphQL queries in Ruby on Rails & Postgres
About the author
Leigh Halliday is a guest author for the pganalyze blog. He is a developer based out of Canada who works at FlipGive as a full-stack developer. He writes about Ruby and React on his blog and publishes React tutorials on YouTube.