5mins of Postgres E5: Finding the root cause of high CPU utilization in Postgres with EXPLAIN and the Linux perf command
Today, we'll take a look at a hairy Postgres incident and how we can use Postgres EXPLAIN and the Linux perf command to debug high CPU utilization in Postgres.
Share this episode: Click here to post to twitter.
Let's start diving in.
In this blog post by Jeremy Schneider, he describes a situation he encountered at AWS where they were debugging a particular customer problem. What's important to know is that this was a migration situation. There was a major version upgrade, and what they were seeing was that on the new database the application wasn't working at all anymore.
The system CPU utilization was spiked to 100%! The system was pretty unresponsive, I imagine.
When they looked at
pg_stat_activity, they also didn't see any wait events or anything that would indicate which part of Postgres could be busy with the queries. They were able to see one thing, which is very helpful in debugging here, which is that they were consistently seeing a particular query show up. They were seeing this rather simple query which was looking at the data in a particular table and it was comparing an
IN list of values against two columns and then was also doing an
AND comparison on a timestamp column. When you look at Jeremy’s article, you'll notice that these are fairly large multi column indexes: for example,
timestamp column one is actually the last column of that long index which we'll come back to, but it's partially related to what the ultimate root cause was found out to be.
The same queries and same workload was behaving fine on the old systems, but on this new system, in the new version, it was not working as expected. Here's an important reminder: make sure to
ANALYZE after you upgrade a major Postgres version. There are some issues with outdated statistics after upgrades, and so always run
ANALYZE after you've done a major version upgrade.
There are a couple of things for how you can find out whether Postgres was choosing a bad query plan. One thing that I always recommend people do, and it's also mentioned here in the article is consider using
auto_explain. If there is a bad query plan that was executed it will log that plan to the Postgres logs. This will help you find the root cause of slow queries. If a query goes for multiple seconds then it just outputs the explain plan of that specific execution into the Postgres logs. On Jeremy’s particular server, they didn't have that enabled unfortunately. And it didn't necessarily help, because the other thing that was in place here: they are using a
statement_timeout of about two minutes. If you have a
statement_timeout enabled in Postgres, both
pg_stat_statements won't show these queries that are timing out.
pg_stat_statements only considers completed queries, it doesn't consider canceled queries like a timeout would cause. auto_explain also does not output a canceled query. You should look at your Postgres logs and if you're seeing a particular query time out a lot, then you should look at that independently of other statistics you may be looking at.
The point where this got really interesting was when they went from looking at the Postgres statistics tables to the Linux system level tools. If you have a hundred percent CPU utilization, what I have seen a lot of experts in the Postgres world do is they will actually drop down to the Linux level and they'll run Linux system profilers, like perf to debug what exactly time is spent on.
Perf is a rather basic program, there are scripts that you can write to make it easier to use or to output certain data in a nicer to read manner. For example, Jeremy and team generated a flame graph based on the perf data they were collecting and this flame graph was showing them what the system was busy with. This made them compare the profiles between the good system and the bad system. They were essentially able to identify the same query on the good system. There was something that was odd about it, which was that the
BitmapAnd function was called a lot less frequently on the good system, and on the bad system it was called quite a lot. With that in mind, Jeremy ended up having a hunch.
You sometimes need to work off a hunch and try things out.
In this case Jeremy wanted to try something and he ended up doing an
explain on the query in some variant, and he was able to reproduce the bad query plan. As you can see in his article, in the bad query plan case, it was doing a bitmap index scan on the second index and it was expecting to get 2 billion rows, the whole table, but it was still using that index and was still attempting that index scan. What the system was churning on was just that bitmap index scan being pulled into the plan when it really shouldn't have.
For various reasons they didn't change the indexes, which is where I would have started, but they instead went and tried to override what Postgres was doing by using a CTE and force the query into the good plan. And that was the solution here.
If you encounter high CPU utilization problems yourself, what I would recommend is look at this Postgres Wiki page that describes how to use the perf profile tool. Note: This is only usable if you run Postgres on a virtual machine. If you're a customer of Amazon, for example, you actually would need Amazon to run this. If you have your own Postgres in a virtual machine, then you can just do this yourself. It's pretty straightforward, actually.
Last, but not least: Jeremy also referenced this set of tools by Tanel Poder. Tanel’s tools are intended to be low overhead profiling that you can keep running for a longer time. Tanel in particular has a tool called run_xcpu and Jeremy referenced in his post that, internally at AWS, they have similar tooling to do low-frequency profiles over a longer timeframe. They are using that same perf tool underneath the hood which gets the raw Linux system data that you can then analyze.
Thank you so much for listening. This was 5mins of Postgres. See you next week!