Out of range planner statistics and "get_actual_variable_range" in Postgres
In today’s E102 of “5mins of Postgres” we talk about how Postgres handles situations where planner statistics are out of date. We look at the logic for deciding when the planner goes to the actual indexes on a table to get the maximum value using the get_actual_variable_range
function, and performance implications of this approach.
Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.
- How ANALYZE impacts Postgres row estimates
- What Postgres does when the statistics are out of range
- Looking at the Postgres source: Selectivity functions and get_actual_variable_range
- Performance issues with get_actual_variable_range over the years
- A recent example of a surprising planner slowdown
- VACUUM with INDEX_CLEANUP ON
- In conclusion
- What we have discussed in this episode of 5mins of Postgres
Transcript
In this blog post by Franck Pachot he talks about how Postgres as well as Yugabyte handle out-of-range statistics. Now for today, we're just going to focus on Postgres.
The idea is when a query in Postgres contains a certain value that you're looking for, and that value is out of the range of what ANALYZE has collected, then Postgres will have special behavior to handle this out-of-range data.
ANALYZE in Postgres will run periodically and collect table statistics, it will sample your table and say, what is the typical value we see here, how frequent is it and such. And it builds, amongst other things, histograms on that.
What can happen is if you have new data being inserted after the most recent ANALYZE, you'll have data that's out of sync with the statistics on the table. Postgres can detect such situations and will go to the actual table and the actual indexes to determine what query plans to use. Oftentimes you would see this when you have something like a sequence or a current timestamp, and that value keeps increasing.
How ANALYZE impacts Postgres row estimates
Franck has a test here where he creates a table. He also turns autovacuum off because we don't want to have automatic ANALYZE happen. We want to do explicit ANALYZE to test that behavior. This table has an "id" column, a timestamp column, and a text column. The initial data here, is just going to sit on the February 8th timestamp and it's going to just have one single text value as well. We can see Postgres' row estimate, as well as what the actual rows are. The row estimates are slightly off, Postgres estimates there's 20,000 rows matched, but actually it's returning 40,000 rows.
And when you're looking at EXPLAIN plans, the row estimates are actually very important. Because they are, amongst other things, driving how Postgres decides which JOIN type to use. It's important that your row estimates are at least roughly directionally, correct.
Franck does a bunch of tests here and he summarizes it here in this table. What we can see, is that right now, with the current data, without having ANALYZE run, Postgres will estimate roughly the same amount of rows being returned for each of the different timestamps.
And that's really because there are no statistics about this. Now when we run ANALYZE, Postgres will build it's internal statistics that are stored for each column. And so now it's much more accurate in being able to determine that for February 7th we have about 40,000 rows, and this matches the actual amount of rows.
That's how it's supposed to work, the planner is able to use these statistics to plan quickly from the cached data.
What Postgres does when the statistics are out of range
Now, Franck does a few more tests. And what he's doing, is he's adding additional data on subsequent days. Without having to run ANALYZE, Postgres will actually be able to estimate how many rows were added.
And so in this case, it will actually assume that, on February 8th we added more rows and so they match the estimate, but then it also estimates that on February 9th, there's some data. And there is actually no rows added on this day. But Postgres doesn't have any precise estimates about this.
And so what you can do to help Postgres understand what additional data might be available, is when you create an index, Postgres is able to use the index as part of its planner estimation.
And so you can see here at once we have an index on the timestamp column, Postgres is actually accurately able to predict that on February 9th, there is no data.
And so the benefit here is, as we add more data, Postgres is able to estimate the fact that there is more data coming in for these new timestamps just by looking at the index, and being able to assess the shape of the table from the index's maximum value.
Looking at the Postgres source: Selectivity functions and get_actual_variable_range
Sometimes it's interesting to understand how this works behind the scenes. And of course the good news is Postgres' source code is easily accessible and the selectivity decisions like how many rows are matching are usually located in the "selfuncs.c" file.
Here we have different selectivity functions for different operators. Here in this case we're doing a range comparison, we're saying this value larger than this fixed constant value. We can see here that Postgres uses the "ineq_histogram_selectivity" function to determine selectivity for such operators. The important thing that we want to understand here is that sometimes Postgres will not just use the internal histograms that are stored in the statistics.
But if Postgres determines, if you're looking at the first or the last histogram of those internal statistics, then it's also looking at the index through the "get_variable_actual_range" function.
The idea here is: As the table is changing and you're adding more values at the tail ends of those histograms, Postgres wants to have a more accurate estimate. And so the idea is that only in certain conditions, "get_actual_variable_range" gets called for the inequality comparisons.
Performance issues with get_actual_variable_range over the years
Now, this has actually been a bit of a problem over the years. And so if you were using Postgres, let's say 6-7 years ago, you might've seen workloads where this becomes a bottleneck. And again, the problem is this gets called during planning.
This is not something that you would usually expect, usually planning is fast. You wouldn't expect planning to actually look at the actual table or actual indexes. And so this is one of these rare cases where Postgres does do that.
As recently as 2022, Tom Lane committed a patch to improve, the "get_actual_variable_range" performance to avoid behavior around dead tuples.
The problem is if you have a lot of updates to your table, a lot of changes, a lot of deletes, then you will have dead index entries that get in the way of Postgres finding the actual maximum value through the index. And the challenge of course is, this can make things really slow.
A recent example of a surprising planner slowdown
What I found interesting recently, is that Mark Callaghan runs what he calls the "The Insert Benchmark".
He compares different database technologies, including Postgres, and he looks at performance regressions over time. Earlier in January, he had this post where he said, "Hmm, this is weird. I'm seeing a performance regression in Postgres 14".
And recently earlier this week, he said, "Hey, this wasn't actually performance regression". And so what's interesting is it's again, coming back to this, "get_actual_variable_range" function.
VACUUM with INDEX_CLEANUP ON
Even though he was on 14, what he was seeing was a regression because in Postgres 14, VACUUM was changed to sometimes skip the index vacuum phase, if it's not necessary, if only a small portion of the table changed. The way his insert benchmark was written this triggered the "get_actual_variable_range" function to be more expensive because the VACUUM wasn't running on the same schedule that it was running on 13 and earlier.
The fix here was to run VACUUM explicitly with the "INDEX_CLEANUP ON" argument. This will force VACUUM to do an index cleanup, even if VACUUM on 14+ would otherwise not do this.
On a production system, I probably wouldn't use this, since for the most part its a good idea to avoid the extra overhead of the index vacuuming cycle when you can.
In conclusion
And so all in all, if you're running into a problem where suddenly your planning is slow and you know, it's a table where there is lots of dead rows, I would encourage you to take a look at your vacuum schedule. And if it's your own self-managed server, it's also worth doing a perf top
to understand if you might be running into a bottleneck with get_actual_variable_range
.
But then again, this has been a problem of the past, since those recent improvements that Tom made, you will have one slow planning cycle, but subsequent planning is going to be able to reuse those benefits from the earlier planning that was slow.
Therefore, in recent times, this is not something I've heard of much, but still it's good to know that this is one of the cases where the planner will look at the actual indexes as part of the planning process.
I hope you learned something new from E102 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!