Waiting for Postgres 19: Reduced timing overhead for EXPLAIN ANALYZE with RDTSC

In today’s E122 of “5mins of Postgres” we're talking about the upcoming Postgres 19 release, and how a change in the Postgres instrumentation handling reduces overhead of timing measurements in EXPLAIN ANALYZE using the RDTSC instruction, and why this will allow turning on auto_explain.log_timing for more workloads.

We dive into the recently committed change that I (Lukas) authored together with Andres Freund and David Geier. See the full transcript with examples below.



Share this episode: Click here to share this episode on LinkedIn. Feel free to sign up for our newsletter and subscribe to our YouTube channel.


Transcript

Welcome back to 5mins of Postgres! Today we talk about a change in the upcoming Postgres 19 release that will lower timing overhead for EXPLAIN ANALYZE.

This is a change that I contributed myself together with Andres Freund and David Geier, and we've worked on this change for a couple of years now actually. But in this release, we basically sat down and we really figured out all the little details that make this work. Now, this was committed recently to the Postgres 19 development branch, and to be clear, it might still be taken out of the final release if any issues are found, but right now, I think there's a decent chance it stays in.

Postgres 19 will be released in September or October, and feature freeze just happened and the beta release will come out sometime in May this year. Now let me show you a little bit more about what this change is about.

The problem of slow timing measurements

Back in 2020, Andres Freund started a mailing list thread where he was basically saying when you run EXPLAIN ANALYZE on a query, it looks a lot slower than it actually is. So in this example here, Andres created a table with 50 million rows:

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

Very simple table, and then he ran a COUNT(*) on that table:

SELECT count(*) FROM lotsarows;

If I run the COUNT(*) without any EXPLAIN, I get a run time of about 1,900 milliseconds. If I run, EXPLAIN ANALYZE with TIMING OFF and back in that release also with BUFFERS OFF, I get a runtime of about 2,300 milliseconds. Now, if I turn TIMING ON the runtime more than doubles from the actual time. Instead of my query taking 1,900 milliseconds, the query now takes 4,200 milliseconds:

-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

And first of all, that's a problem because it skews what my actual performance is. If I'm doing testing with EXPLAIN ANALYZE, and I don't recognize that timing has overhead, I basically think my query is slower than it actually is. The other issue is that if you run auto_explain, usually we recommend people turn log_timing off. Just for example, here in pganalyze's install instructions, we like recommending people to use auto explain, but we always tell people today to turn timing off because we think that this is not safe to use on most production systems without knowing your workload better.

If we look at the problem here in more detail, Andres basically did a little profile here and he looked at where is that overhead coming from?

-   95.49%     0.00%  postgres     postgres                 [.] agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

RDTSC vs RDTSCP

So first of all, in that profile we see the InstrStartNode and InstrStopNode calls. So those are basically calls that get added by Postgres when instrumentation is on, so when I'm running an EXPLAIN ANALYZE, and we can see that most of that time is spent in the clock_gettime function. On a modern Linux system, this is not actually a syscall. Instead, it directly calls RDTSCP. RDTSCP is basically a special instruction on the CPU that gets what's called the timestamp counter.

And think of the timestamp counter as a value that keeps going up, that basically counts cycles, but it counts cycles in a way that isn't influenced by power level changes or other issues that might cause it to be skewed. So it's actually pretty reliable. Now the problem is that what RDTSCP does is it waits until all prior instructions have finished and we say instructions we mean CPU instructions. And so basically what happens is that the timing itself is not just getting the time, but it's also blocking other activity from occurring.

It's blocking the CPU from basically running things in parallel effectively. Now, there is a different instruction called RDTSC without the P. And this instruction basically does not have this blocking of other concurrent instructions. And so when you have this in the picture, then it actually drastically lowers the performance overhead of the timing.

In this particular example Andres ran at the time, instead of the query taking 4,200 milliseconds, it actually took only 2,600 milliseconds:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                          QUERY PLAN                                                           │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1)                         │
│   ->  Seq Scan on lotsarows  (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms                                                                                                       │
│ Execution Time: 2610.256 ms                                                                                                   │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

This was mainly a prototype at the time. So a lot of the complexities, and part of the reason why this took so long to get implemented is because we needed to make sure that this worked in all kinds of different systems that Postgres gets used on.

The new timing_clock_source Postgres setting

One of the things we ended up adding based on discussions on the mailing lists is a new setting to control whether this gets used or not. So with the new "timing_clock_source" setting, you basically control whether you automatically use the TSC clock source on x86-64 CPUs that are modern enough that have the right instructions. You can force the old way of using the system clock, or you can explicitly set the TSC clock source.

Now in Postgres, we're now basically splitting into two different use cases. So for things like EXPLAIN ANALYZE where we don't necessarily care about a very short, exactly precise measurement, like it's more about the cumulative time that gets taken we use the RDTSC instruction versus in other cases where we care about the higher precision, and it's still a short, run time we do use the RDTSCP instruction, which is higher overhead. Now there is a lot of supporting code to make this work in different environments, if you're interested in how that works, look at the "instr_time.c" file.

Live demo on Postgres 19 development branch

I want to show you an actual example of how this improvement now looks like in the 19 branch. So here I have an SSH client because my machine right now actually is a MacBook. And this initial release will only be focused on getting the fast timing in for x86-64. ARM has a similar instruction, but there is some outstanding issues for ARM machines. So right now I'm connected here via SSH to a different machine. This machine sits right next to me, it's this little Framework Desktop here, but that one is an x86 machine.

And so now what I can do here is I have my Postgres branch already built. I'm first going to run the pg_test_timing utility, it basically measures that overhead of timing. Now here we get three different measurements:

System clock source: clock_gettime (CLOCK_MONOTONIC)
Average loop time including overhead: 18.80 ns
Histogram of timing durations:
   <= ns   % of total  running %      count
       0       0.0000     0.0000          0
       1       0.0000     0.0000          0
       3       0.0000     0.0000          0
       7       0.0000     0.0000          0
      15      12.7533    12.7533   20353931
      31      87.2357    99.9890  139225930
...

Clock source: RDTSCP
Average loop time including overhead: 16.94 ns
Histogram of timing durations:
   <= ns   % of total  running %      count
       0       0.0000     0.0000          0
       1       0.0000     0.0000          0
       3       0.0000     0.0000          0
       7       0.0000     0.0000          0
      15      31.1807    31.1807   55204578
      31      68.8159    99.9966  121836600
...

Fast clock source: RDTSC
Average loop time including overhead: 11.69 ns
Histogram of timing durations:
   <= ns   % of total  running %      count
       0       0.0000     0.0000          0
       1       0.0000     0.0000          0
       3       0.0000     0.0000          0
       7       0.0000     0.0000          0
      15      83.5188    83.5188  214321443
      31      16.4789    99.9977   42287217
...

TSC frequency in use: 2993629 kHz
TSC frequency from calibration: 2994357 kHz
TSC clock source will be used by default, unless timing_clock_source is set to 'system'.

We get the built in clock source called clock_gettime. That took 18 nanoseconds to get a time measurement. Now we're checking with RDTSCP, which again, blocks out of order instructions. That one takes 16.9 nanoseconds. And then if we're running with RDTSC, it takes 11.6 nanoseconds. So clearly RDTSC has less overhead here, I'm getting 50% benefit in this test timing program. I also see which frequency gets used, and then I also see whether that new clock source will used by default. If I don't want to use it, I would have to set timing_clock_source to system explicitly.

The only reason why that would make sense by the way, is if for some reason your TSC is emulated in a certain way so the timing measurements are not stable. And then timing_clock_source = system might provide you those stable measurements.

Now I can run a psql client, show you the actual example. I already have that table that Andres created as an example here as well. First of all, I'll turn on \timing. This is on the psql side, just gives me the run time. Now I'm doing a SELECT COUNT(*):

postgres=# SELECT count(*) FROM lotsarows;
  count   
----------
 50000000
(1 row)

Time: 268.466 ms

This is a more modern machine, so this takes the same 50 million rows, just goes a little faster. So I have about 260 - 270 milliseconds of runtime here.

If I run with EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF), let's start with that. I'm not doing a lot of extra work really. I'm just counting how many rows got returned:

postgres=# EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) SELECT count(*) FROM lotsarows;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=482655.97..482655.98 rows=1 width=8) (actual rows=1.00 loops=1)
   ->  Gather  (cost=482655.75..482655.96 rows=2 width=8) (actual rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=481655.75..481655.76 rows=1 width=8) (actual rows=1.00 loops=3)
               ->  Parallel Seq Scan on lotsarows  (cost=0.00..429572.40 rows=20833340 width=0) (actual rows=16666666.67 loops=3)
 Planning Time: 0.174 ms
 Execution Time: 297.043 ms
(8 rows)

Time: 297.535 ms

That's pretty simple.

And then if I now turn TIMING ON this is with the TSC clock source, I get a measurement of about 350 milliseconds:

postgres=# EXPLAIN (ANALYZE, TIMING ON, BUFFERS OFF) SELECT count(*) FROM lotsarows;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=482655.97..482655.98 rows=1 width=8) (actual time=349.687..351.719 rows=1.00 loops=1)
   ->  Gather  (cost=482655.75..482655.96 rows=2 width=8) (actual time=349.606..351.709 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=481655.75..481655.76 rows=1 width=8) (actual time=347.932..347.933 rows=1.00 loops=3)
               ->  Parallel Seq Scan on lotsarows  (cost=0.00..429572.40 rows=20833340 width=0) (actual time=0.149..201.918 rows=16666666.67 loops=3)
 Planning Time: 0.186 ms
 Execution Time: 351.773 ms
(8 rows)

Time: 352.171 ms

I'm still seeing, I would say about a 20 - 25% overhead here. So it's not free, but it's substantially better than with the system clock source.

If I do SET timing_clock_source = system, and I do the timing again, you see a drastic difference:

SET timing_clock_source = 'system';
EXPLAIN (ANALYZE, TIMING ON, BUFFERS OFF) SELECT count(*) FROM lotsarows;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=482655.97..482655.98 rows=1 width=8) (actual time=799.624..801.496 rows=1.00 loops=1)
   ->  Gather  (cost=482655.75..482655.96 rows=2 width=8) (actual time=799.535..801.488 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=481655.75..481655.76 rows=1 width=8) (actual time=797.885..797.887 rows=1.00 loops=3)
               ->  Parallel Seq Scan on lotsarows  (cost=0.00..429572.40 rows=20833340 width=0) (actual time=0.073..417.005 rows=16666666.67 loops=3)
 Planning Time: 0.115 ms
 Execution Time: 801.529 ms
(8 rows)

Time: 801.979 ms

Just for clarity, if I just did a regular select count star here, it would take me 260 milliseconds to run the actual query:

postgres=# SELECT count(*) FROM lotsarows;
  count   
----------
 50000000
(1 row)

Time: 263.824 ms

And with the old timing clock source, I get a run time of 800 milliseconds. Versus with the new TSC clock source, I get 355 milliseconds:

SET timing_clock_source = 'tsc';
EXPLAIN (ANALYZE, TIMING ON, BUFFERS OFF) SELECT count(*) FROM lotsarows;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=482655.97..482655.98 rows=1 width=8) (actual time=353.401..355.238 rows=1.00 loops=1)
   ->  Gather  (cost=482655.75..482655.96 rows=2 width=8) (actual time=353.292..355.229 rows=3.00 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Partial Aggregate  (cost=481655.75..481655.76 rows=1 width=8) (actual time=351.081..351.082 rows=1.00 loops=3)
               ->  Parallel Seq Scan on lotsarows  (cost=0.00..429572.40 rows=20833340 width=0) (actual time=0.131..200.584 rows=16666666.67 loops=3)
 Planning Time: 0.150 ms
 Execution Time: 355.291 ms
(8 rows)

Time: 355.690 ms

So a drastic difference, and I think this to me also makes a difference for many systems where I would feel comfortable using auto_explain on with log_timing on just because, most queries are not this extreme. To be clear, many realistic queries have much less repetition over just these instrumentation start and stop functions.

Previously you would've seen 5-10% on average, now you'll probably see 2-3% on average, which for many systems is a good trade off to have the full instrumentation data available in auto_explain.

There's many other new features that are coming up, hear some more about that in upcoming episodes.

I hope you learned something new from E122 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn to get updates about new episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox