5mins of Postgres E24: Tuning the Postgres statistics target & understanding selectivity for spatial indexes
Today, we are talking about tuning the Postgres statistics target, why it is 100 and how that number was derived, and Postgres selectivity, specifically selectivity for spatial indexes.
Let's jump in.
In this blog post by Paul Ramsey from the Crunchy Data team, he talks about making use of statistics and selectivity in Postgres to optimize your indexes and includes some specific notes on spatial indexes as well. He starts out by talking about statistics targets. The default is controlled with the
default_statistics_target setting in Postgres. This setting determines how much information Postgres collects from your table to then make planning decisions later on.
The default statistics target is 100, but as Paul points out, it actually means that you're gathering a sample of 30,000 rows!
This might be non-obvious, and I decided to take a quick look at the Postgres source code to understand where this magic number of 300 came from. If you go to the Postgres source and you look at the analyze.c file, you'll actually find that the way that the Postgres source documentation describes this is that they've based this on a paper from 1998 about random sampling for histogram construction. The idea is, we're looking for a hundred values and their histograms, like how frequent they are, and such. We need to look at a certain number that's higher than that, so we actually have a representative sample. That paper argues ultimately what comes out to be 305, and Postgres simplifies it to 300.
That's essentially the number of samples that you are gathering. 300 times that statistics target.
Now, the statistics target may not be enough. If your data has a Pareto type distribution and Postgres keeps a hundred actual values in its statistics systems, the long tail would not be represented. Statistics about the long tail would be wrong, because Postgres wouldn't necessarily have kept those.
You can do some tweakings to optimize that. The first thing you can do is you can actually change the statistics target on a per column basis. In the simplest situation, if you see Postgres doing bad estimates for your plans, and you've decided that maybe the problem is statistics, you actually don't need to raise the
default_statistics_target setting. Instead, you can alter a specific column, and then set the statistics to a higher value. Afterwards you'd have to run
ANALYZE to actually improve things.
To set all this in context, we introduce the concept of selectivity. Selectivity means, if you are matching a lot of rows, then what you're doing would not be very selective versus if you're matching a very small portion, like 1%, for example, that would be very selective.
Paul works a lot with spatial data. When you're filtering on geographic data, oftentimes you would use something like a bounding box. We can see two filters in the example in his article. On the left side, you can see the small red square and on the right side, you can see the bigger blue square. You might think that the small red square is actually more selective, right? Because it looks smaller. It is a smaller area to look for versus the blue might be less selective, might select more data.
Really, that's not necessarily true.
If your data is distributed in a way where the red square is looking at a portion of the spatial data that has a lot of data points in that area, then your red might actually be less selective because you're looking at more data points within that red rectangle versus the blue square here. This is why it's very important to understand which portion of the table you're looking at.
How does Postgres actually know this in the case of spatial data? What PostGIS keeps is essentially a spatial histogram, and that histogram can tell you which portion of the map has more values and which portion has less values. You can see where you have more dots in the middle, you have numbers like 23 or 20 representing the number of rows in that area. Versus on the right side in his example, you have, for example, single ones or just zeros.
The way we can use this during query planning is that we can look at the thing you're searching for and match that against that spatial histogram, and then say: In the case of the blue rectangle, we're actually looking at two data points overall that we're most likely gonna match, and then in the case of the red rectangle, we're actually gonna get roughly 33 values. Note that this is kind of a simplification, PostGIS actually keeps a more sophisticated, spatial histogram that scales the bounds, the amount of data points in that bucket. The more data points there are, the smaller the width of that histogram bucket is, to make sure we can be more specific in the areas where there's more data.
PostGIS actually keeps a more sophisticated, spatial histogram that scales the bounds, the amount of data points in that bucket.
Last, but not least, what's important to note: All of what we've talked about here is per column. Postgres also has the extended statistics system, which runs across columns. That's the system that you would use to tell Postgres about functional dependencies between two columns, which can greatly impact the costing for a query.
Before we close, I did briefly wanna mention that Amazon Aurora now has support for Postgres 14. If you are on Aurora, it's a good time to consider upgrading. You may be surprised, they haven't included the latest patch release 14.4, but the good news is, the main reason for that last patch release was the fix for the CONCURRENTLY bug. That fix was actually backported and so they have made sure that that critical CONCURRENTLY bug fix is in that release as well.