Using pgbench to load data faster, and the random_normal function in PG16
Today, in episode 50 of 5mins of Postgres, we're going to talk about two things. We're going to talk about Kaarel Moppel's post on how to generate lots of test data with Postgres, using pgbench to load data fast and faster, and we're going to talk about the new "random_normal" function that was recently added to Postgres 16.
Let's start this off!
In this blog post, Kaarel describes how to generate a lot of data for testing your disks or testing some of the database configuration, or doing other kinds of tests in Postgres.
Kaarel points out that you may want to do that when you're trying to understand what happens when your data doesn't fit into RAM anymore, for example when doing things like EXPLAIN DDL operations.
- What happens if I create a table, and then I "ALTER COLUMN"?
- Is the "ALTER COLUMN" going to rewrite the table?
- Is it going to be slow or is it going to be fast because Postgres does not do a table rewrite?
You may also want to test your "low on disk" alerting setup. You might want to do things like testing your block latency, or looking at your I/O scheduler configuration. These are the reasons why you may want to have a large table to test with.
The simplest way of just generating a bunch of data is to use
generate_series. We'll actually come back and generate series at the end of this episode. In general,
generate_series is very useful and powerful. We can generate 30 million rows, which roughly is a 1 GB table, within a few seconds.
[With generate_series] we can generate 30 million rows, which roughly is a 1 GB table, within a few seconds.
If you want to create something more interesting with at least a handful of tables, pgbench is a great way to do that. Let’s continue with looking at how we can use pgbench and its different pgbench initialization options.
We'll start with just the scale factor, which deinfes how large the test dataset should be. One scale factor unit equals 100,000 banking schema account rows. The banking schema is the benchmark schema example that is used by pgbench. If you multiply that by 5,000 it roughly generates 73 GB of data. There are ways to calculate how I could generate 100 GB of data, which scale factor I need, etc. But for now, we'll just use 5,000 as a scale factor. In our test example, doing that takes 870 seconds to load that data into the Postgres database.
Oftentimes, when you're doing tests like this, you don't necessarily need the data to be persisted, and oftentimes the whole WAL logging process in Postgres will slow things down and will skew your tests. Postgres has the concept of UNLOGGED tables. Unlogged tables are not WAL archived. They're not crash safe and they're also not replicated to a replica. But they are very useful and they're actually more useful than temporary tables because they are visible to all the users on the database server, they get auto vacuumed and auto analyzed, and they don't bloat the Postgres catalog as much.
Oftentimes, when you're doing tests like this, you don't necessarily need the data to be persisted.
Another thing we can do if we just care about the table having “a bunch of data” is to modify the pgbench init phase options. You can tell pgbench what it should do in the initialization phase. By default, it will do all of them. It would do
- "d" for dropping existing tables
- "t" for creating new tables
- "g" for generating data
- "v" for vacuum
- "p" for primary keys
- "f" foreign keys.
Oftentimes, you may want to skip "p" or "f" to not have indexes at all. Kaarel also skips the vacuum option in his article.
Doing this gives us a huge performance boost because indexes are expensive. A 3x improvement just in terms of data load time! If you just want the data, this is a great way of doing that.
Let's say we really just care about benchmarking I/O. We want to see how good or bad the I/O subsystem of our server is. We can use the FILLFACTOR setting in Postgres to say: in each page on a table, how much data should be in that page? By default on the table, there will be 100%.
Postgres will try to get as much data as possible into a single data page. But with the fill factor option, we can say: only do 10%. If we just care about generating some I/O, this is a good option.
There are various other ways to speed this up. One thing that I found interesting in Kaarel’s article is when he describes a strategy to utilize the partitions option in pgbench to generate the data, not all into one big table, but actually having a partitioned table and then having clients work on different tables in parallel, inserting into the table itself. He generates a bunch of test data first and then keeps inserting that data into itself. This keeps the work somewhat separate and lets you stress the I/O even more on the system and see where it fails first.
The other thing I want to mention today is a new feature in Postgres 16. This is a new function called "random_normal" (have a look at depesz article on it here). It lets us generate a random value that's not completely random, but it actually is along a normal distribution.
We can use this to generate what looks like a normal curve. You could imagine if you're trying to model a certain type of dataset, then being able to use this, whilst not actually using the same values all over again will be quite useful in Postgres 16 for generating test data.
I hope this upcoming year we'll see lots of benchmarking and your Postgres databases continue to perform well.
I wish you happy new year and thank you so much for listening, this was 5mins of Postgres. Subscribe to our YouTube channel to hear about next week's episode and talk to you next week.