The surprising logic of the Postgres work_mem setting, and how to tune it

In today’s E119 of “5mins of Postgres” we discuss tuning the Postgres work_mem setting for your workload, and why it can be quite confusing to interpret the meaning of work_mem correctly for a given query. We also discuss the impact of hash_mem_multiplier, and why it's default changed from 1.0 to 2.0 in Postgres 15.



Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

We'll start with this blog post by Shaun Thomas on the Tembo blog from earlier this week. This is a pretty good comprehensive post where Shaun talks about all different aspects of Postgres' memory settings.

But I want to focus on work_mem specifically and add a few more details beyond what Shaun has in this post. So, first of all, work_mem by default in Postgres is set to 4MB. And as Shaun notes here, this is one of the first values that users might want to modify in an attempt to make queries run faster. And oftentimes the effect of this is that if you increase this, Postgres will be able to do more operations in memory instead of having to do them on disk, which oftentimes leads to a speed up in a query.

The risk of too high work_mem: Running out of memory

Now if you get this wrong however, if you set this too high and you have a system with a lot of concurrency, what you will see is that Postgres will crash with a message like "Out of memory: Killed process ... (postgres)", as described in the Postgres documentation.

And usually what this means is the Linux out of memory (OOM) killer went and killed your Postgres process, unfortunately, oftentimes the postmaster process, the root of the Postgres process hierarchy. Which means that ultimately your Postgres will crash completely if you've exhausted the memory thanks to having too high work_mem.

One of the ways you can address this is you can use the "overcommit_memory" setting and set that to 2, to essentially prevent the Linux kernel from over-committing memory:

sysctl -w vm.overcommit_memory=2

Depending, if you have various sidecars on your server, or if you want to use swap, this may or may not make sense, but if you have this set, you're much safer to set "work_mem" more aggressively because you don't crash the whole server.

But let's come back a little bit to, how can we set work_mem to the right setting?

Surprising facts about work_mem

So one of the most confusing things about work_mem, and I had to remind myself again as part of researching this episode is that, as Andres Freund noted in a mailing list post earlier this week:

One thing to note is that postgres' work_mem is confusing - it applies to individual query execution nodes, not the whole query.

What this means is that oftentimes for a complex query, Postgres will use multiples of work_mem. Let's illustrate this with a specific example.

Demonstrating work_mem's memory usage per plan node

Now, I'm going to create a table locally here on Postgres 17 beta1:

CREATE TABLE test AS SELECT c, c c2 FROM generate_series(0, 10000000) x(c);

This table has 10 million records, and it's 346 MB in size.

And now I'm going to execute an EXPLAIN ANALYZE of a simple query that selects from the table, ordering by the first column and then it reorders that result by a different column:

EXPLAIN ANALYZE WITH x AS (SELECT * FROM test ORDER BY c) SELECT * FROM x ORDER BY c2;
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=2496995.72..2522018.44 rows=10009088 width=8) (actual time=1816.799..2253.805 rows=10000001 loops=1)
   Sort Key: test.c2
   Sort Method: quicksort  Memory: 468872kB
   ->  Sort  (cost=1308175.94..1333198.66 rows=10009088 width=8) (actual time=892.513..1291.049 rows=10000001 loops=1)
         Sort Key: test.c
         Sort Method: quicksort  Memory: 468872kB
         ->  Seq Scan on test  (cost=0.00..144378.88 rows=10009088 width=8) (actual time=0.031..427.474 rows=10000001 loops=1)
 Planning Time: 0.084 ms
 Execution Time: 2612.252 ms
(9 rows)

As we can see, that uses an in-memory sort. And that's because I have already set the work_mem to a sufficiently high number, 458 MB of memory are needed here to have Postgres sort this in memory, as opposed to sorting on disk:

SHOW work_mem;
 work_mem 
----------
 458MB
(1 row)

Now, what we want to prove here is that Postgres is actually going to use not just 458 MB of memory once, but it's actually going to use it twice, because we have two Sort nodes here.

And so what I'm going to do here is I'm going to execute the EXPLAIN ANALYZE again, and whilst that's executing, I'm going to log the backend memory contexts from a different connection (61 is the PID of the process we want information about):

SELECT pg_log_backend_memory_contexts(61);
 pg_log_backend_memory_contexts 
--------------------------------
 t
(1 row)

This is a bit tricky to get right from a timing perspective, but essentially what this will do is for a running process, it will dump all the current memory allocations that Postgres has. And so what we can see here in the Postgres logs, where this is output, we can see that this process uses 951 MB of memory whilst that query was executing:

2024-06-16 19:09:08.848 UTC [61] LOG:  logging memory contexts of PID 61
...
2024-06-16 19:09:08.848 UTC [61] LOG:  level: 4; TupleSort main: 240132128 total in 2 blocks; 6848 free (0 chunks); 240125280 used
2024-06-16 19:09:08.848 UTC [61] LOG:  level: 5; TupleSort sort: 8192 total in 1 blocks; 7952 free (0 chunks); 240 used
2024-06-16 19:09:08.848 UTC [61] LOG:  level: 6; Caller tuples: 243269520 total in 39 blocks; 3268248 free; 240001272 used
2024-06-16 19:09:08.848 UTC [61] LOG:  level: 4; TupleSort main: 240132128 total in 2 blocks; 6848 free (0 chunks); 240125280 used
2024-06-16 19:09:08.848 UTC [61] LOG:  level: 5; TupleSort sort: 8192 total in 1 blocks; 7952 free (0 chunks); 240 used
2024-06-16 19:09:08.848 UTC [61] LOG:  level: 6; Caller tuples: 226492304 total in 37 blocks; 6352488 free; 220139816 used
...
2024-06-16 19:09:08.848 UTC [61] LOG:  Grand total: 951952232 bytes in 312 blocks; 10260760 free (242 chunks); 941691472 used

We can also prove that this memory is being used by sorts. In the above output of "pg_log_backend_memory_contexts" you will see "TupleSort main", "TupleSort sort" and "Caller tuples". These are essentially both related to the sort operation occurring, "tuple sort" is the internal mechanism in Postgres that does the sorting.

And so you can see that per Sort node we are using about 480 MB of memory, which does roughly match the memory that Postgres also output in the EXPLAIN plan (Sort Method: quicksort Memory: 468872kB).

Now, of course, if you did, you know, a much lower work_mem setting, if we did "work_mem = 4MB" for example, what you would see is that the EXPLAIN ANALYZE now uses the slower on disk sort:

SET work_mem = '4MB';
EXPLAIN ANALYZE WITH x AS (SELECT * FROM test ORDER BY c) SELECT * FROM x ORDER BY c2;
                                                          QUERY PLAN                                                          
------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=3041206.80..3066204.32 rows=9999010 width=8) (actual time=2656.435..3163.010 rows=10000001 loops=1)
   Sort Key: test.c2
   Sort Method: external merge  Disk: 176160kB
   ->  Sort  (cost=1580243.69..1605241.21 rows=9999010 width=8) (actual time=1289.112..1814.388 rows=10000001 loops=1)
         Sort Key: test.c
         Sort Method: external merge  Disk: 176160kB
         ->  Seq Scan on test  (cost=0.00..144278.10 rows=9999010 width=8) (actual time=0.525..416.553 rows=10000001 loops=1)
 Planning Time: 0.756 ms
 Execution Time: 3423.307 ms
(9 rows)

But it does also use a lot less memory:

2024-06-16 20:48:07.238 UTC [61] LOG:  logging memory contexts of PID 61
...
2024-06-16 20:48:07.239 UTC [61] LOG:  level: 4; TupleSort main: 8192 total in 1 blocks; 6328 free (0 chunks); 1864 used
2024-06-16 20:48:07.240 UTC [61] LOG:  level: 5; TupleSort sort: 2949576 total in 14 blocks; 264552 free (59 chunks); 2685024 used
2024-06-16 20:48:07.240 UTC [61] LOG:  level: 6; Caller tuples: 8080 total in 1 blocks; 8048 free; 32 used
2024-06-16 20:48:07.240 UTC [61] LOG:  level: 4; TupleSort main: 2105384 total in 2 blocks; 6848 free (0 chunks); 2098536 used
2024-06-16 20:48:07.240 UTC [61] LOG:  level: 5; TupleSort sort: 278648 total in 7 blocks; 144304 free (26 chunks); 134344 used
2024-06-16 20:48:07.240 UTC [61] LOG:  level: 6; Caller tuples: 8080 total in 1 blocks; 8048 free; 32 used
...
2024-06-16 20:48:07.240 UTC [61] LOG:  Grand total: 7267728 bytes in 256 blocks; 1046744 free (327 chunks); 6220984 used

So, this really is the trade-off here, if I want more performance now I used 1 GB of memory.

Parallel query will use multiples of work_mem

Now, there are two more aspects that I want to mention. The first one is that parallel query will also increase how much "work_mem" is being used, as Andres noted in the same mailinglist discussion:

When you use parallel workers, each of the parallel workers can use the "full" work_mem, rather than work_mem being evenly distributed across workers.

[In the discussed plan on the mailinglist] we can see each of the Hash nodes use ~1GB, which is due to (1 leader + 4 workers) * work_mem = 5 * 200MB = 1GB.

In parallel query, Postgres allows each worker as well as the leader process to use up to work_mem for each of the query execution notes, which means if you have, for example, 4 parallel workers (max_parallel_workers_per_gather = 4) and you have 1 leader, the process that initiates the parallel query.

And then you really have 5 in this case, times work_mem, which is 200 MB in this example, meaning 1 GB. So in this case, that means each of the plan nodes is allowed to use 1 GB of memory if it executes in parallel.

Hash operations will use multiples of work_mem (hash_mem_multiplier)

The other case to know about is the "hash_mem_multiplier" setting. So this setting controls how much memory, hash based operations can use. And the important thing to know about here is that the default as of Postgres 15 is 2.0, which means that a hash based operation can use twice the work_mem base amount. And you may even want to set this higher, if you want to allow more memory for hash operations compared to sort operations for a given work_mem base amount.

And so the background here is, as Peter Geoghegan noted in the 2022 mailing list thread about the Postgres 15 change, that essentially the characteristics for how much performance increase you get for hashing operation as compared to a sorting operations, is very different:

The relationship between memory availability and overall performance/throughput has very significant differences when we compare sort-based nodes with hash-based nodes. It's hard to make reliable generalizations about how the performance/throughput of hash-based nodes will be affected as memory is subtracted, even if we optimistically assume that requirements are fairly fixed. Data cardinality tends to make the picture complicated, just for starters. But overall, as a general rule, more memory tends to make everything go faster.

On the other hand, sort-based nodes (e.g., GroupAggregate) have very predictable performance characteristics, and the possible upside of allowing a sort node to use more memory is quite bounded. There is a relatively large drop-off when we go from not being able to fit everything in memory to needing to do an external sort. But even that drop-off isn't very big -- not in absolute terms. More importantly, there is hardly any impact as we continue to subtract memory (or add more data). We'll still be able to do a single pass external sort with only a small fraction of the memory needed to sort everything in memory, which (perhaps surprisingly) is mostly all that matters.

For sort based operations, whilst you do see a drop-off when things don't fit in memory and Postgres needs to do an external sort, in practice this has a fixed penalty. Versus for hashing operations, it can oftentimes be the case that you will see drastic performance difference if something can be hashed in memory, versus Postgres might not even choose a hash operation if things spill to disk.

And so it may make sense to experiment with using a higher "hash_mem_multiplier", but keeping work_mem low. The Postgres documentation recommends going up to 8.0 for example in the hash_mem_multiplier, and so really the motivation here is you could keep your work_mem lower whilst increasing your hash_mem_multiplier to essentially emphasize that hash operations should be fast, but big sorts are okay to spill to disk.

How should I tune work_mem for a given system?

Now, let's close with, for a given system what should I do?

Christophe Pettus wrote a good post about this back in 2023, where he essentially said do not use a formula, because work_mem is about the most workload dependent setting in Postgres that you can find.

And so really what you should do is you should do some iterative testing. So what Christophe suggests here, start with a certain work_mem setting, lets suppose in the tens of megabytes. Then run your system on a realistic production load with "log_temp_files = 0". This will cause your logs to contain temporary file messages like:

LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp15967.0", size 200204288
STATEMENT: alter table pgbench_accounts add primary key (aid)

If there are no temporary file messages, then you're done because probably it's a safe setting, and assuming you have plenty of memory available, you don't really have to change anything.

If you however see a lot of temporary file messages, you could increase work_mem. So the important thing to note here is that you could set work_mem on a per-connection or per-user basis. So you might want to give your interactive queries more work_mem so they return faster, but the background process that maybe sort a lot of data, you may want to intentionally keep those on disk, so that they don't consume all the memory. And so ultimately what Christophe is saying, do that iterative process and then find a given work_mem setting.

If you absolutely must have a formula, Christophe gives a suggestion here:

(average freeable memory * 4) / max_connections

But again, this really is just meant for, what's a conservative default, that's not the tiny default of 4 MB.

In reality, you really should be looking at your temporary files and you should make an informed decision, potentially using the hash_mem_multiplier to distinguish between hash and sort operations as well.

I hope you learned something new from E119 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox