Speeding up partial COUNT(*) in Postgres by using LIMIT in a subquery

In today’s E103 of “5mins of Postgres” we look at optimizing the performance of COUNT statements when only a subset of the data needs to be counted, through use of a LIMIT in a sub-SELECT, or what I like to call a "limited COUNT". We also discuss how this can be represented effectively in a web application.



Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

In this blog post by Lukas Eder, he expands on a previous post where he was talking about the benefits of using SQL EXISTS instead of counting.

EXISTS vs COUNT(*)

The big argument there is that if you're doing a COUNT(*) in Postgres, that has to look at the actual rows to check visibility. It's actually an expensive operation, if you have a lot of records doing a COUNT across a lot of records is not a good idea. And so in this example here from Lukas' previous posts, he was showing that if you do a SELECT COUNT(*) from this "actors" table, but really all you care about is the fact are there any actors in that table that match a particular name, it's much more efficient to do a SELECT EXISTS. And inside the EXISTS, you're doing a sub-select.

The argument here is that as soon as one row gets returned, Postgres can return early because it has proven that at least one value exists.

Today what we want to talk about is a very similar situation, but what we want to talk about is not when I'm checking for existence of a single value, but when I want to know whether there is at least 2 values, 10 values or 100 values. And so essentially I care about, not a precise number, but if I want to know are there at least that many elements.

Partial COUNT(*)

What Lukas suggests, is instead of us writing something like a "SELECT COUNT(*)", and then we put a boolean condition around this.

If you did this on the SQL side, you do a sub SELECT. And you're saying "SELECT COUNT(*)", but then on the outside you're doing a SELECT and then you're saying just ">= 2".

And so this is not performant, is the argument, because what we should be doing instead is we should be asking the database to only count at most two values. Here we're getting the records, but LIMIT that by two. And then we return.

The big benefit here is that the database doesn't have to count things unnecessarily, because you don't care about if there's 3 or 4, you just care whether there is at least 2.

Lukas shows if we don't have a LIMIT, if we just have the initial queries that he showed, you can first see that this particular query is doing a Nested Loop, and you can see that the Nested Loop is estimated to return 55 rows. In this case, it is actually returning 56 rows. Roughly correct estimate from Postgres' planner in this case.

And then things get aggregated and ultimately we return a single row, which is that Boolean expression that we are evaluating, whether there are two or more rows.

Using LIMIT in a Subquery to improve count performance

With the limit in place that Lukas just described, you can see that Postgres will actually still estimate that they're going to be 55 rows returned, but then in the actual execution, it only returns two rows. And then it's able to terminate early.

There's not a big difference in this particular query, but you can see that the second query here is slightly faster. The first one took 0.039 milliseconds. The second one took 0.024 milliseconds.

As Lukas states here, if you're counting all the records, that's going to be consistently and significantly slower, than if you're doing a LIMIT.

Now, this can sometimes seem a little bit hard to understand in terms of how would I actually use this in practice?

I want to show an example of a change that I made in our own application, pganalyze, where we added a LIMIT to a COUNT statement to make things load much faster.

A practical example inside the pganalyze app

The context here is, inside pganalyze, we have these query pages and I'm showing you the version of the query page before the change. And the relevant portion to look for here is roughly in the middle of the screen where you see "Log Entries" and it shows about 74,000 log entries. These are all the log entries for this particular query in the system.

These are things like auto_explain outputs, slow query logs and whatnot. The idea is that when I'm looking at this query details page in the app here, I don't actually need to know that there's exactly 74,000 log entries. This is a great example where instead of doing an exact COUNT, we can use this limit technique to make it much faster.

Let me jump to the console and this is running locally.

Going from 75,000 rows to 101 rows

When we implemented this change I'm about to show you, we did see quite significant improvement on production. But locally, of course, it's going to be slightly less pronounced. And so here in this case, we're fetching roughly the same time range. We have 75,000 rows being returned, and we have a "SELECT COUNT(*)" at the top of the statement, so we get the exact count.

Now, let's look at the execution plan to understand better what this is actually doing. And so this is a bit hard to read arguably, but the simplest thing is, this is a partitioned table, and so you can see here that it's essentially scanning the indexes and then it's returning all this data, from each partition a little portion of the data. And then at the end, it's aggregating it to get that COUNT. Where we essentially have those 75,000 rows that we're aggregating that we're getting from the whole partition tree.

Now, what we want to compare this with is, what if we do a LIMIT?

Let me just illustrate for clarity again. This is the version with a simple COUNT(*):

SELECT COUNT(*)
  FROM log_lines_7d
  WHERE server_id = '82dd2475-2e29-44f2-bdf9-667c0a15c830'
        AND database_id = 23
        AND query_fingerprint IN ('\x98f04a54d842e630')
        AND occurred_at >= '2024-02-15 01:56:56.532925' AND occurred_at <= '2024-02-22 01:56:56.532976'
        AND log_line_parent_id IS NULL;

And then the version that is a lot faster as this version right here:

SELECT count(*) AS count FROM (
  SELECT 1
    FROM log_lines_7d
    WHERE server_id = '82dd2475-2e29-44f2-bdf9-667c0a15c830'
          AND database_id = 23
          AND query_fingerprint IN ('\x98f04a54d842e630')
          AND occurred_at >= '2024-02-15 01:56:56.532925' AND occurred_at <= '2024-02-22 01:56:56.532976'
          AND log_line_parent_id IS NULL
  LIMIT 101
) limited_count

What we're doing is we have "SELECT COUNT(*)" on the outside, but on the inside we have this SELECT 1 or in Lukas' example SELECT *, and we LIMIT this.

And we limit this here by 101. I'm gonna explain to you in a second, why it's 101 and not 100. But the idea is that this only has to fetch 101 rows from this log lines table here, then the query can return.

Performance impact: 35ms to 5ms

Just to illustrate this in terms of timing difference. if we run the first version with "SELECT COUNT(*)", run this a couple of times it's like 35 milliseconds. And now if we run the version that has a LIMIT in place, that runs in a much nicer 5 milliseconds.

Again, this is locally, on production this would probably be even slower. But the difference here is roughly 36 - 35 milliseconds for the version where we do an exact count, we get the full number. Versus if we just look at "Do we have at least a hundred records?", that's going to be much faster.

The trick here again is that Postgres will return early from the index scan on these different tables. It's a bit hard to read because of the Append, but you can see here that in the Append node that pulls the data from the different partitions here, we can see that there's only 101 rows being returned, versus in the other case we got the full 75,000 rows.

Representing partial counts in a web UI

So this is a very good technique, you do have to think about, of course, how you present it in the user experience that you have in your application. Here previously, what we would do is we would show the exact count and you could still get the exact count if you clicked on it. But sometimes you want to just have these hints of "there's at least this many records".

And so what we ended up changing is on the client side, we now show "100+". This comes back to the difference between why we fetch 101 rows, not 100 rows, is because if you have exactly 100 rows, we do want to show you log entries 100. But if you have 101 at least, then we just showed the plus sign at the end.

From a user experience that doesn't make a big difference, but the page in this case loads noticeably faster because we added that LIMIT statement there at the end.

I hope you learned something new from E103 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox