Optimizing bulk loads in Postgres, and how COPY helps with cache performance
In episode 80 of “5mins of Postgres” we're going to talk about optimizing bulk load performance in Postgres and how the COPY ring buffer works. We're comparing different methods of INSERTs and show why COPY is the fastest option. We are also looking at pgbench and pgbuffercache to show client side vs server side performance.
Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.
Transcript
Let's take a look!
In this blog post, Laurenz Albe from the Cybertec blog talks about the different ways you can load data into Postgres most efficiently.
Using COPY for bulk loading data into Postgres vs. INSERT
He starts by creating a table and inserts 10 million rows in a couple of different ways.
- One transaction for each
INSERT
- Inserting single INSERTs in a single transaction
- Single INSERTs with a prepared statement
- Using a multi line
INSERT
- Using
COPY
As mentioned above, first of all he inserts in a way where there's one transaction for each INSERT
. That's the least efficient way to do it.
Then, he inserts single INSERT
s in a transaction, which means he's issuing individual statements, but they're all in a single transaction. This will give quite a performance boost.
Then, he does single INSERT
s, but with a prepared statement in one transaction. This reduces some of the overhead that Postgres has when coming up with execution plans for subsequent statement executions.
Next, he does a multi line INSERT
. He does INSERT
into the table and then has multiple rows listed in the values. This is fairly efficient and Laurenz also does a further iteration on this by using a prepared statement in one transaction.
Now, the last one here is my favorite and if you can use it, it's the best as you probably know, which is COPY
.
COPY is not part of the SQL standard, it's a Postgres specific API. It allows you to load data in a way that's more narrowly focused just on bulk loading of data.
You have a particular way of inputting the data in the protocol on the wire, so it is something that you will have to support on the application side, but it is going to be the fastest way to do it.
In his article, Laurenz shows a breakdown of different methods. First of all, the most important thing to realize is that COPY
is just a lot, lot faster. Clearly, the worst thing you can do is doing single inserts. For 10 million rows in Laurenz's machine, which is a laptop, the 10,000,000 rows of single inserts took 9,000 seconds.
COPY
on the other hand took just 14 seconds. COPY
is also quite faster than just bulk INSERT
s. If you have bulk INSERT
s, that's roughly 50 seconds here, but the COPY
is still about four times faster than that. Clearly, COPY
is the winner. Nothing is as fast as COPY
. If you have to use INSERT
s, do try to use multi-row inserts, because they are going to be the most efficient way of doing INSERT
statements.
How to tune your Postgres settings for bulk loads
Laurenz has a few tips on how you should tune your Postgres settings for bulk loads. This is important because Postgres has a checkpointing process where it has to ultimately flush out all the data that was written to disk. This will depend a lot on two settings.
- The
max_wal_size
setting and - the
checkpoint_timeout
setting.
Checkpoints can either be caused by the max_wal_size
being reached, which is a number in gigabytes or the checkpoint timeout being reached, which is a time based measurement. Generally, what you want is your max_wal_size
to be large enough so that you hit the checkpoint timeout timeframe. For example, if you want to have your checkpoints every 5 minutes or every 10 minutes, then that's better than having checkpoints every 10 gigabytes of data.
On many systems what makes sense is to raise max_wal_size
so you can have more timed checkpoints and then consider increasing checkpoint timeout as well, in some cases. Now, Laurenz concludes this with saying: for bulk loads, you should do COPY.
Postgres INSERT vs COPY impact on shared buffers
I do want to point out one more thing, and for this I actually pulled up Postgres 16 in a little benchmark.
What I want to show you here is the impact that COPY has on the contents of shared buffers, and ultimately how a COPY
is going to be much better for concurrent read behavior on your database. What we're going to do here is to use podman on my local machine, and we're going to do a comparison between pgbench doing COPY
and pgbench
doing INSERT
s.
We're going to do that through pgbench
's method of doing either client side or server side generation of the data. This is going to generate 128 megabytes of data, which is also the default for shared buffers on a stock Postgres install. I'm going to run this in my terminal, and we're first going to set up the COPY
version of this.
I'm going to initialize this with the client side generation of data in pgbench. This uses COPY
.
I'm also going to do the same on another terminal where I'm going to use the server side generation. Note the uppercase “G” at the end of that pgbench
initialization. And so that here is going to use an INSERT
statement on the server side.
Now, I'm going to compare how the pg_stat_io
data looks like for this.
On the INSERT
version, thanks to pg_stat_io
, we can see how the data was written. You can see that in the normal I/O context, we had about 16,000 extends, which means 16,000 new pages had to be created in our tables.
On the other side here, you can see that we have the same number of extends, also about 16,000, but we do have them in the bulkwrite
context.
This shows you the big difference between COPY
and INSERT
, which is that Postgres has a particular way in its I/O subsystem to handle bulk writes using a ring buffer instead of just using shared buffers in the regular I/O context.
The reason that matters can easily be seen with pgbuffercache
. pgbuffercache is an extension that's bundled in Postgres, which you can use to introspect what's in Postgres' shared buffers.
I can see, when I was running the INSERT
s, that I have about 16,000 buffers in shared buffers, which are part of the pgbench
accounts table. Essentially, that means the whole table is currently in shared buffers. Now, if I run the same thing on the COPY
version of the benchmark, you can see that only 2,000 buffers are in shared buffers.
This means that when you're doing an INSERT
, you're fully using shared buffers to run through all the inserts, because each insert has to pass through the Postgres shared buffer cache.
On COPY, Postgres also has to pass all of them through shared buffers, but it's able to reuse the buffers contents in a more efficient way.
This means that when you're doing a bulk load, it's much better to use COPY
because here there was a lot of other data that could remain in shared buffers – it didn't have to be evicted from shared buffers. Versus in the case of INSERT
s, there might've been a lot of data that you actually wanted to have in shared buffers for reads, but it actually had to be moved out to be able to perform the INSERT
. That's why generally on a busy system, it's also better to use COPY
versus INSERT
s to improve the caching performance.
Thanks for joining us for E80 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!