5mins of Postgres E10: max_wal_size, Postgres full page writes and UUID vs BIGINT primary keys
In today’s episode 10, we'll take a look at max_wal_size, full-page writes and why you might want to choose BIGINT instead of a UUID for your primary key column.
Let's dive in!
In this blog post by Dave Page, he talks about the
max_wal_size parameter. He talks about how they've gotten up to 10 times performance increase after tuning the
The reason that
max_wal_size exists is to control how much of the Write Ahead Log in Postgres you want to keep after a checkpoint. A checkpoint is a moment where Postgres writes all the pending changes that are stored in the WAL to disk and persists a known safe state of the database.
Oftentimes, you would want to increase
max_wal_size. The other thing to consider with
max_wal_size is that when
max_wal_size is reached it causes an immediate checkpoint to occur, and these immediate checkpoints are pretty expensive because Postgres tries to get them done quickly.
max_wal_size is not reached, by default checkpoints happen on a timed schedule. Checkpoints happen every
checkpoint_timeout parameter intervals. This could be five minutes or 30 minutes. The idea is that not all the writes happen at the same time, but instead they're spread out over the whole time of that checkpoint.
What you want is
max_wal_size to be large enough that you can have timed checkpoints and
checkpoint_timeout to be small enough to cause timed checkpoints instead of WAL-based checkpoints. When you're tuning this, you may also want to tune other associated parameters, like
checkpoint_completion_target, which affects timed check points.
What you want is max_wal_size to be large enough that you can have timed checkpoints and checkpoint_timeout to be small enough to cause timed checkpoints instead of WAL-based checkpoints.
There's also a
full_page_writes parameter which we'll get to in a moment. If you want to monitor whether you have timed checkpoints or WAL-based checkpoints, there's a
pg_stat_bgwriter view in Postgres that tells you how many of your checkpoints were done overall and then how many of them were timed.
Usually, for best performance, you would want more timed checkpoints, rather than WAL-based checkpoints.
When you see there are a lot of WAL-based checkpoints, then it probably makes sense to tune both
max_wal_size and the
Now, there is another blog post this reminded me of, that Tomas Vondra wrote a couple of years ago. In this blog post he talks about the impact of full-page writes.
Postgres, by default, uses eight kilobyte pages. Now, an eight kilobyte page would usually be split into two four kilobyte pages on a Linux file system.
That means when you have a write in Postgres, and you write out this eight kilobyte page, two separate writes happen on the Linux file system. If your server crashes right in that moment it can cause part of the Postgres page to be written out, but the other part not being written out!
Later on, when your system recovers, the recovery itself is not safe because the recovery might try to read part of the page that's old and part of the page that's new, resulting in a system that's corrupted, but you don't really know what's a known good state.
Postgres solves this torn page problem using the Write Ahead Log.
The Write Ahead Log solves that by recording what's called a full-page image. The first time after a checkpoint the page is modified, the whole page is written into the WAL. The known good state is written to the WAL for the page and on recovery it doesn't need to go read the potentially bad data from the main table heap. Instead, it reads the full page image from the WAL stream and replaces the on disk representation with that page. This is data corruption prevention.
Important to mention: Changing a single byte on an eight kilobyte page will log the whole page into WAL. It's really easy to make small changes that amplify significantly. This also means that typically there's a short burst of full-page writes right after a checkpoint.
Where you will see this pretty significantly is in the difference between using a UUID and a BIGSERIAL or BIGINT data type for your primary key column. Tomas shows a test where they saw two gigabytes of WAL being generated with BIGSERIAL, but 40 gigabytes with a random UUID. So there is a 20 times difference between the two data types serving the same benefit, which is uniquely identifying a record!
The reason this occurred is in a B-tree index, the UUID, if it's random, the index entries will be placed on different leaf pages. Versus with a BIGSERIAL, it's very likely that will land them on the same pages. If you use BIGSERIAL, then you’ll get consecutive values and so in the ordered B-tree index, they get placed next to each other. Versus a UUID gets placed all over the place if it's random.
The conclusion here is it's highly recommended that you don't use random UUIDs for write heavy tables, but instead you either use integers that count up, or you use something like a timestamp prefixed UUID, called ULIDs, UL IDs, those are much more effective while still using a UUID data type.
It's highly recommended that you don't use random UUIDs for write heavy tables.