Optimizing bulk loads in Postgres, and how COPY helps with cache performance

By Lukas Fittl

August 10, 2023

In episode 80 of “5mins of Postgres” we're going to talk about optimizing bulk load performance in Postgres and how the COPY ring buffer works. We're comparing different methods of INSERTs and show why COPY is the fastest option. We are also looking at pgbench and pgbuffercache to show client side vs server side performance.

Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and subscribe to our YouTube channel.

Using COPY for bulk loading data into Postgres vs. INSERT
How to tune your Postgres settings for bulk loads
Postgres INSERT vs COPY impact on shared buffers
What we have discussed in this episode of 5mins of Postgres

Transcript

Let's take a look!

In this blog post, Laurenz Albe from the Cybertec blog talks about the different ways you can load data into Postgres most efficiently.

Using COPY for bulk loading data into Postgres vs. INSERT

He starts by creating a table and inserts 10 million rows in a couple of different ways.

One transaction for each INSERT
Inserting single INSERTs in a single transaction
Single INSERTs with a prepared statement
Using a multi line INSERT
Using COPY

As mentioned above, first of all he inserts in a way where there's one transaction for each INSERT. That's the least efficient way to do it.

Then, he inserts single INSERTs in a transaction, which means he's issuing individual statements, but they're all in a single transaction. This will give quite a performance boost.

Then, he does single INSERTs, but with a prepared statement in one transaction. This reduces some of the overhead that Postgres has when coming up with execution plans for subsequent statement executions.

Next, he does a multi line INSERT. He does INSERT into the table and then has multiple rows listed in the values. This is fairly efficient and Laurenz also does a further iteration on this by using a prepared statement in one transaction.

Now, the last one here is my favorite and if you can use it, it's the best as you probably know, which is COPY.

COPY is not part of the SQL standard, it's a Postgres specific API. It allows you to load data in a way that's more narrowly focused just on bulk loading of data.

You have a particular way of inputting the data in the protocol on the wire, so it is something that you will have to support on the application side, but it is going to be the fastest way to do it.

In his article, Laurenz shows a breakdown of different methods. First of all, the most important thing to realize is that COPY is just a lot, lot faster. Clearly, the worst thing you can do is doing single inserts. For 10 million rows in Laurenz's machine, which is a laptop, the 10,000,000 rows of single inserts took 9,000 seconds.

COPY on the other hand took just 14 seconds. COPY is also quite faster than just bulk INSERTs. If you have bulk INSERTs, that's roughly 50 seconds here, but the COPY is still about four times faster than that. Clearly, COPY is the winner. Nothing is as fast as COPY. If you have to use INSERTs, do try to use multi-row inserts, because they are going to be the most efficient way of doing INSERT statements.

How to tune your Postgres settings for bulk loads

Laurenz has a few tips on how you should tune your Postgres settings for bulk loads. This is important because Postgres has a checkpointing process where it has to ultimately flush out all the data that was written to disk. This will depend a lot on two settings.

The max_wal_size setting and
the checkpoint_timeout setting.

Checkpoints can either be caused by the max_wal_size being reached, which is a number in gigabytes or the checkpoint timeout being reached, which is a time based measurement. Generally, what you want is your max_wal_size to be large enough so that you hit the checkpoint timeout timeframe. For example, if you want to have your checkpoints every 5 minutes or every 10 minutes, then that's better than having checkpoints every 10 gigabytes of data.

On many systems what makes sense is to raise max_wal_size so you can have more timed checkpoints and then consider increasing checkpoint timeout as well, in some cases. Now, Laurenz concludes this with saying: for bulk loads, you should do COPY.

Postgres INSERT vs COPY impact on shared buffers

I do want to point out one more thing, and for this I actually pulled up Postgres 16 in a little benchmark.

What I want to show you here is the impact that COPY has on the contents of shared buffers, and ultimately how a COPY is going to be much better for concurrent read behavior on your database. What we're going to do here is to use podman on my local machine, and we're going to do a comparison between pgbench doing COPY and pgbench doing INSERTs.

We're going to do that through pgbench's method of doing either client side or server side generation of the data. This is going to generate 128 megabytes of data, which is also the default for shared buffers on a stock Postgres install. I'm going to run this in my terminal, and we're first going to set up the COPY version of this.

I'm going to initialize this with the client side generation of data in pgbench. This uses COPY.

I'm also going to do the same on another terminal where I'm going to use the server side generation. Note the uppercase “G” at the end of that pgbench initialization. And so that here is going to use an INSERT statement on the server side.

Now, I'm going to compare how the pg_stat_io data looks like for this.

On the INSERT version, thanks to pg_stat_io, we can see how the data was written. You can see that in the normal I/O context, we had about 16,000 extends, which means 16,000 new pages had to be created in our tables.

On the other side here, you can see that we have the same number of extends, also about 16,000, but we do have them in the bulkwrite context.

This shows you the big difference between COPY and INSERT, which is that Postgres has a particular way in its I/O subsystem to handle bulk writes using a ring buffer instead of just using shared buffers in the regular I/O context.

The reason that matters can easily be seen with pgbuffercache. pgbuffercache is an extension that's bundled in Postgres, which you can use to introspect what's in Postgres' shared buffers.

I can see, when I was running the INSERTs, that I have about 16,000 buffers in shared buffers, which are part of the pgbench accounts table. Essentially, that means the whole table is currently in shared buffers. Now, if I run the same thing on the COPY version of the benchmark, you can see that only 2,000 buffers are in shared buffers.

This means that when you're doing an INSERT, you're fully using shared buffers to run through all the inserts, because each insert has to pass through the Postgres shared buffer cache.

On COPY, Postgres also has to pass all of them through shared buffers, but it's able to reuse the buffers contents in a more efficient way.

This means that when you're doing a bulk load, it's much better to use COPY because here there was a lot of other data that could remain in shared buffers – it didn't have to be evicted from shared buffers. Versus in the case of INSERTs, there might've been a lot of data that you actually wanted to have in shared buffers for reads, but it actually had to be moved out to be able to perform the INSERT. That's why generally on a busy system, it's also better to use COPY versus INSERTs to improve the caching performance.

Thanks for joining us for E80 of 5mins of Postgres. Subscribe to our YouTube channel, sign up for our newsletter and follow us on LinkedIn and Twitter to get updates about new episodes!

Optimizing bulk loads in Postgres, and how COPY helps with cache performance

Using COPY for bulk loading data into Postgres vs. INSERT

How to tune your Postgres settings for bulk loads

Postgres INSERT vs COPY impact on shared buffers

What we have discussed in this episode of 5mins of Postgres

Enjoy blog posts like this?