5mins of Postgres E30: Postgres performance puzzle, Linux readahead and tuning effective_io_concurrency
Today, we're going to talk about an interesting Postgres performance puzzle by Jeremy Schneider, Linux readahead and how to tune effective_io_concurrency. The default Linux readahead setting assumes sequential scans ahead, but you might want to change it, together with effective_io_concurrency, which tunes how much Postgres is telling the Linux Kernel to look ahead.
Let's get into it.
This blog post by Jeremy Schneider was posted a couple of weeks ago and it was posing a puzzle for the reader to solve. I'll admit I was lazy at the time and I didn't actually solve the puzzle, but Jeremy got a few interesting responses and we'll take a look today at the puzzle itself and also at the solutions.
In his puzzle, he was counting records in a table and was limiting it by filtering on a certain column. There are two columns here that are relevant. The first query with the first column took 500 milliseconds. The second query took over a hundred seconds.
That 600 times difference is pretty substantial and surprising.
The question is: "can you guess what the data in the table is?" - the solution to this puzzle: the first column was just counting up - like 1, 2, 3 - as it was inserting data, versus the second column was random data, and this was distributed all across the index. You may also see behaviors like this, by the way, if you're using random UUIDs, we've talked about this previously, but random UUIDs are very ineffective in terms of index composition.
In this follow-up blog post, Jeremy goes through how to debug this problem. He starts with an
EXPLAIN ANALYZE, and does two things here that I think are worth noting.
First of all, he specifies the
EXPLAIN. That's always a good idea to specify when you run an
The second thing he does, and this is something I don't actually do as often myself, is that he enables
log_executor_stats. This runs an operating system function called
getrusageto collect extra metrics during the execution of the query. This will tell you what happened in the operating system level process, like how much data was read on a file system.
What's odd about this first query is that if you divide the IO time by the number of blocks that were being read, the average latency of a read request is very low. Each page that was being read was very fast.
The second query takes over a hundred seconds and reads a lot of blocks. If you're reading a lot of data, that's going to be slow no matter what. We know that this index layout is ineffective.
If you're reading a lot of data, that's going to be slow no matter what.
Let's just look at a smaller portion of the index. When you look at the numbers, you'll find that the individual IOs, the individual page that's being loaded, takes about 0.4 milliseconds. Underneath the hood, this is using a gp2 EBS volume. This is a network request, so the 0.4 milliseconds, that's actually reasonable.
You can do some more analysis, but what's interesting to understand is that Linux readahead is the reason that the first query is much faster. The way that somebody who works on this in the Linux kernel describes it is: Readahead takes a risk, reads more than is needed. And then if that brings a reward, as in somebody actually requests that data later, then it reads even more ahead because it thinks that you're reading sequentially.
Instead of you having to wait on each page, it reads ahead and it fetches that data into the OS page cache for you to read from the cache instead.
Jeremy does a little bit more analysis here in terms of readahead. The main point that I would make is that readahead helps you improve performance, especially on a network EBS volume.
Linux readahead is only effective when you have sequential data.
There are cases where Postgres can help Linux. If you have a Bitmap Heap Scan, so it looks at the table data, Postgres actually has the index itself that will tell it where in the table it's looking for data. effective_io_concurrency in Postgres tunes how much Postgres is telling the Linux Kernel to look ahead.
When you set
effective_io_concurrency to a value, let's say 10, that will say instead of getting one page from the main table at a time, it will read ahead multiple pages by telling the Linux kernel: Hey, I am going to need these other pages soon.
This is very useful! If you have a system that has high concurrency supported on the drive, for example an EBS volume, the default value here is not going to be good enough.
Surprisingly, Amazon RDS does not set a useful default value, in my opinion. They don't change the default and default is one, which means a single request is going to be requested ahead, which doesn't really help you.
Somebody actually did a benchmark of this a couple of years ago, where they were comparing two cases.
- A sparse bitmap scan, which effectively means it's looking at a lot of different locations in the table
- and a dense bitmap scan, where they were looking at very closely related data that's right next to each other on the table.
If you have a sparse bitmap scan, what you want is a high
effective_io_concurrency. You want something like 32, 128 or even 512. Depending on how many active queries you're running, maybe 32 makes more sense, but certainly the default of one doesn't make sense.
On the other hand, if you have a very dense Bitmap Heap Scan, so if you're looking at very closely related data, turn
effective_io_concurrency to 0, because then you defer back to the Linux readahead behavior, the default, which will assume sequential scans ahead, instead of trying to override that behavior with Postgres.