How to use CREATE STATISTICS to improve a plan rows estimate, and why it matters

In E92 of “5mins of Postgres”, we're going to talk about creating functional dependency statistics to improve plan cost estimates in Postgres. We’re looking at a problem that I myself encountered last week when I was trying to optimize the estimates we're getting for a particular query inside our own system.



Share this episode: Click here to share this episode on LinkedIn or on twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Which type of statistics does Postgres collect?

Before we jump into the example let me quickly refresh your mind about statistics. In Postgres, you have two types of statistics that the planner makes use of when it tries to understand how many rows will be matched by a particular query.

Single column statistics

The first kind are single column statistics, which are always automatically collected. You don't have to do anything for that.

It happens automatically in the background as part of a process called AutoAnalyze. AutoAnalyze will automatically get data from your tables to assess how frequent certain columns are. For example, it will calculate things like the n_distinct number, which then represents, of the total number of rows in this case, how many unique entries there are.

Extended statistics

Now, we also have extended statistics, and the big difference with extended statistics is that Postgres will not do this automatically for you, but it often makes sense to create these statistics in order to improve planner cost estimates or influence query plans into the right direction.

Functional dependency statistics in Postgres

The statistics type we're going to talk about today are functional dependency statistics. These help the planner estimate when you have two columns that are dependent on each other, which means for a given value in column A, you will always, or most of the time, have a particular value in column B. It helps you to tell the planner about this relationship between two columns that the planner otherwise wouldn't know about. The way you create these statistics is by using the CREATE STATISTICS command. We specify the dependencies type to collect these functional dependency statistics, and then we'll look at two or more columns that are part of that statistics object, where the planner will then collect the statistics. We then need to run ANALYZE, and then we can actually see the statistics collected. In the showed example, we can see the column position one, which is a zip code, has a clear functional dependency on column five, which means that for a given zip code, you'll only ever get one city

A real-life example from pganalyze

Now, I want to show you an actual example where I ran into this in our own database, because it kind of paints the picture pretty clearly where you might see this yourself. On our database we have a table called issues. This tracks issues that pganalyze finds in our customers' databases where it then flags common Postgres problems. On the shown table, you can see that we associate the data to a particular organization, to a particular server, and in some cases to individual databases. One of the queries is where we look at a particular organization ID.

If we just look at the EXPLAIN, you'll see in this very simple case that we have an index on this, and it estimates that there's 250 rows returned. Now, another case that we often find ourselves doing is we actually query for both of these columns, right? We specify the server, but we also know the organization ID, and so it's included in the query.

Now, what Postgres does by default is a bit of a problem here. You can see in this particular situation, which is actually rare, it uses a BitmapAnd, so it gets one index scan on one index, and then an index scan on the other index. As you can see, it estimates that on one index we'll get 250 results. On the other one it would get 110 results.

Then, after BitmapAnd, just one result, which is not true. If we run an EXPLAIN ANALYZE here, what you'll actually find is that clearly the result of this query shows that there's 488 rows. The misestimate here is pretty big.

Download Free eBook: How To Get 3x Faster Postgres

What Postgres does behind the scenes when it calculates selectivities

What Postgres does behind the scenes when it calculates selectivities - if it doesn't have extended statistics - it will just multiply out the selectivities of the individual parts of the query. So you can imagine that internally it's a very small number between zero and one.

In this case, it's probably like 0.0003 or something. Then it multiplies that with the other number, which is also very small, like 0.0004. And then what happens is, because those two numbers are multiplied, they actually become smaller, Which is how we end up with this row estimate of 1. Now, let's run CREATE STATISTICS on this table on the two columns that are referenced. So we'll run this, and then we'll actually have to run ANALYZE, because this just registers the statistics object in the database, but it doesn't yet execute the actual collection of data. What we'll find is that, surprisingly enough, it hasn't changed. Now the reason for this is: on this particular table, I know that the default statistics target is not going to be sufficient.

What I'm actually going to do is I'm going to increase the size of that statistics object. So I'm intentionally going to select more statistics, which is going to take a longer time. Now I run ANALYZE again. This is going to be slower than before, but what it does behind the scenes is it gathers a lot more data, and so this will help improve the functional dependency information, so we can clearly show to the planner that these two columns are correlated. Now that the ANALYZE has finished, I can rerun my EXPLAIN, and now we have 500 rows estimated, and remember from before that that's close to the actual row count return.

To end, I want to show you briefly how this actually looks like behind the scenes in Postgres itself. When Postgres executes a query, really the part to look at, if you're ever interested in the nitty-gritty details is the clausesel.c file. That's really what drives these selectivity estimates, but then defines how many rows you're getting in your query. The general approach to how this works internally is that first Postgres will try to apply extended statistics to as many clauses as possible. Then only to those parts where it didn't apply them, it will then use a regular method, which is to take the product (AKA multiplication) of the individual selectivities. This is where you'll get the mis-estimates because that oftentimes could be wrong.

Postgres will then look at the actual extended statistics information. In addition to the functional dependency statistics we looked at today, this will also supply what's called multivariate MCV lists and more modern Postgres versions. These help you improve estimates even further for the most common values in your table. In the case of the functional dependency statistics though, we ultimately land on this function here. We'll first try to find the strongest dependency between the columns involved. In the situation where we're talking about city and zip code, for example, it will prefer zip code pointing to the city versus the other way around.

In the situation with organization ID pointing to server, It would prefer server pointing to the organization ID because that's the stronger correlation. With that, it will essentially say:

  1. First, multiply the strength of that functional dependency with the selectivity estimate for the leading column, essentially. In the case of zip code, that would be one multiplied by zip code.
  2. Then it would say, plus 1 minus f (that strength of that dependency)
  3. Then multiply it with the regular selectivity.

What that means is: the stronger the dependency, the more it biases all the way towards the leading column being the value it uses, versus if you have a very weak functional dependency, it will act more like the regular code.

In a sense it's very simple, but this really makes a big difference in your plan of cost estimates. The reason it often matters to have an accurate row count is because this does influence how Postgres chooses which JOIN order to use. For example, if you have a nested loop this greatly influences which of the tables gets queried first, and then which of the tables is on the inner side of the nested loop, which then again influences index usage. If you see row mis-estimates, I would highly recommend looking at functional dependency statistics. Also consider looking at the actual statistics data to get all the details.

If you liked E92 of 5mins of Postgres, feel free to subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox