A Postmortem on High Memory Usage with Prepared Statements
In today’s episode 68 of “5mins of Postgres” we're going to talk about how using partitioning together with prepared statements can lead to out of memory problems.
Let's jump in!
This is a blog post by Christopher on the Prefect blog, where he describes a post-mortem where they ran into a problem and what they ultimately coined ”the single Postgres setting that almost took down the database”.
Now, the backstory here is that at Prefect, they were essentially doubling their traffic every two months. Generally, that's great, but if you think about your database, that might not be so good. What they were trying to do is get ahead of the game.
Roughly a month ago they ended up doing two things:
They scaled up their primary Postgres 14 instance that is running on CloudSQL in GCP, essentially adding more hardware. They introduced table partitioning. Specifically, they introduced table partitioning for big tables that had 400 million rows at the time, and they anticipated to have over a billion rows by the time that partitioning work would be completed.
Generally, partitioning is a best practice in Postgres as it helps you reduce autovacuum overhead and makes queries faster in many cases. We talked about partitioning previously in 5mins of Postgres episode 18 and episode 22. We also have a dedicated article about PostgreSQL partitioning in Django.
One thing that they noticed as they were making both of these changes at the same time is that even though they increased the size of their instance, their memory utilization on the database didn't improve much. Even though they added more hardware, added more RAM, they still saw problems with memory utilization being quite high.
And this was happening at the same time as rolling out these new table partitions. They were unsure whether it was related to the table partitioning or something else completely. Whilst they were wondering about this problem, at the end of last month their hand was forced because suddenly their database restarted with an out of memory event. The memory utilization was correlated with active connections. Meaning when they had more active connections, then memory utilization would be higher.
One thing that they did to improve stability is to reduce the number of connections. A couple of days later, the database restarted again. Clearly, whatever they were doing wasn't enough. And so there was still a problem going on.
They ended up going through all the different Postgres configuration settings, like
shared_buffers and especially
work_mem. And so you might think that
work_mem was actually the problem here. But, it was not, to take that away. There was actually a different problem going on!
As they were debugging this, another thing that they noticed is that their application had a connection pool on the application side which was resetting every two hours. What that meant was that there was a new connection being created every two hours and the old one being retired. Around that time the memory usage fluctuated. So new connections did not have this problem, only old connections had this issue.
Now, the change that they made to fix this was to recycle their connections more often. At the same time they also made this seemingly unrelated change, which is that they changed some of their tracing configuration. We'll get back to that in a moment.
These two changes that happened around the same time significantly improved the situation. Previously, they had 60 to 80% of memory utilization. After the two changes got deployed, the memory utilization dropped down to 20 to 25%. An amazing difference, this is exactly what they were looking for.
Was it the connection pooling or was it that seemingly unrelated change? They realized that it was actually the tracing change that they made, so not the connection pooling, but the other change and the backstory here is, what they term "load-bearing comments".
An additional instrumentation that they enabled was essentially adding query comments to each query. This can oftentimes be a best practice, however, the one thing to be aware of is that a lot of times when you add these comments, they actually end up making the text of the statements unique. For example we have a
SELECT statement and then this additional sqlcommenter comment which will add some comment like: “where does this come from, like which driver, which framework”. It will also add a trace parent information. The idea behind the trace parent is that you can trace your queries across the application and the database by having the same trace information. What that does though is it makes the statement text unique.
They remembered a previous issue they had on a different database. What they observed was a typical prepared statement behavior, which is that when you run the same prepared statement five times, the sixth execution behaves special.
This is hard coded logic inside Postgres.
Once you have the same prepared statement executing six times, Postgres will also calculate what's called a generic plan. A generic plan is a plan that Postgres doesn't have to recreate for each execution. Instead it's independent of the parameter values.
Sometimes that can be a huge benefit because you're saving on the planning time, but the downside can be that you are always using that generic plan. The big issue that they ran into was actually something we covered previously in 5mins of Postgres E18, which is when you have prepared statements and you have partitions then partition pruning - which removes partitions from the planning - doesn't occur during the planning phase, it only occurs during execution time! What that means is that you have a lot more overhead when you have a lot of partitions, because Postgres is not going to be able to remove partitions you're not querying.
This is a blog post by Amit from the EDB team. And Amit actually worked on a patch that improves this, what's called initial pruning, so that generic plans can also benefit from it.
At the time I had hoped that Postgres 16 would see this patch committed. Unfortunately, there was some more back and forth on this mailing list thread. It looks like there is now a path forward, but it was a bit late in the Postgres 16 development cycle. We'll probably have to wait until Postgres 17 for this to be fixed upstream.
The root problem here is that if you use partitioning and you have generic plans, then you'll have a lot more overhead because Postgres will have to keep around all the partitions in the relcache of each connection, and will also cache a plan that's pretty heavy, in the sense that it has a lot of nodes.
What they turned to in this case was the plan cache mode setting. You can tell Postgres to not do generic plans. What you can do instead is say
plan_cache_mode = force_custom_plan. What this does, even if you're executing the same statement over and over again, is that Postgres will always replan, will always make a new plan, and through that avoid this problem of doing generic plans with partitions.
You can set this on individual connections right before you run the query. You don't necessarily have to set this database wide, you could make it specific to the queries that use a lot of petitions. Definitely worth investigating if you know you're using prepared statements and you have a lot of partitions.