pganalyze Blog - Articles by Lukas Fittl

Introducing pg_query for Postgres 16 - Parsing SQL/JSON, Windows support, PL/pgSQL parse mode & more

Lukas Fittl — Thu, 11 Jan 2024 12:00:00 GMT

Parsing SQL queries and turning them into a syntax tree is not a simple task. Especially when you want to support special syntax that is specific to a particular database engine, like Postgres. And when you’re working with queries day in day out, like we do at pganalyze, understanding the actual intent of a query, which tables it scans, which columns it filters on, and such, is essential.

Almost 10 years ago, we determined that in order to create the best product for monitoring and optimizing Postgres, we needed to parse queries the way that Postgres does. We released the first version of pg_query back in 2014, and have seen many different projects outside of pganalyze utilize our open-source project. For example, to support migration use cases, create linting tools, or check which queries an application executes (see our post from 2021 for some examples). And to name just one vanity metric, the Ruby binding for pg_query has been downloaded an incredible 34 million times!

Today, we’re excited to announce the new pg_query release based on the Postgres 16 parser, which introduces support for running on Windows (a frequently requested addition), alternate query parse modes (e.g. to parse PL/pgSQL assignments), as well as parsing and deparsing new Postgres syntax, such as SQL/JSON. We’ve released updated Ruby, Rust and Go bindings, and expect bindings maintained by the community, such as for Node.js and Python, to be updated soon as well.

In this post, we showcase how to use pg_query in your application, and a few benefits of the new release. But first, let’s go back to the basics - how does pg_query work?

pg_query, the Postgres parser as a standalone C library

At its core, pg_query is all about making the “raw_parser” function from Postgres available. We’ve written about this in more detail in the original pg_query announcement, but the quick summary is:

We apply a tiny amount of patches on top of Postgres, e.g. to help with parsing $n parameter references in queries from pg_stat_statements
We utilize libclang to build a tree of dependencies between functions and global variables in the Postgres source code
In some cases, we apply mocks to avoid entering parts of Postgres we don’t need (e.g., functions that access the file system)
We locate all the source code necessary for the functions we want to call (like “raw_parser”), and remove all other code, to make sure the compiler doesn’t do unnecessary work, or pull in functionality we don’t need
From the built-in node definitions (which are C structs), we automatically create output functions for JSON and protocol buffers, to make it convenient to write bindings in other programming languages

Overall, this results in a library that can parse SQL text and return a Postgres parse tree for you to work with and modify, whilst supporting the full syntax that Postgres itself supports.

From an end user perspective that means you can, for example in the Ruby library, use the following code to parse a query, and find out which table it's querying:

require 'pg_query'
parsed_query = PgQuery.parse("SELECT * FROM users")
puts parsed_query.tree.stmts.first.stmt.select_stmt.from_clause.first.range_var.inspect
# => <PgQuery::RangeVar: catalogname: "", schemaname: "", relname: "users", inh: true, relpersistence: "p", location: 14>

The parse tree structs are automatically generated as protocol buffer definitions based on Postgres’ internal structs located in parsenodes.h and adjacent files, and the language-specific bindings can use each language’s protobuf libraries to have properly typed structs as well.

The main change in the core parsing functionality in this release is that we’ve added support for compiling libpg_query on Windows (with either MSVC, or an MSYS2 stack using MinGW/etc), a frequently requested feature.

Using query fingerprints to identify queries across servers

Besides parsing itself, there was another major use case that we needed to solve for pganalyze: The ability to group queries together.

Postgres itself generates a “queryid” to support this. Originally part of pg_stat_statements, it has been part of Postgres core since Postgres 14, and is generated when “compute_query_id” is enabled (automatically done when using pg_stat_statements). However, the Postgres queryid has its flaws: Besides not always grouping together as well as it could (e.g. in the case of IN lists), it’s not portable. If you ran the same query on two different servers, you would get two different query IDs. This difference in query IDs is primarily explained by the fact that Postgres determines which tables a query references based on the relation OIDs. But those OIDs are not stable across servers, as they are internal identifiers.

With the pg_query fingerprint we intentionally went another way: We utilize the name (and schema) of the table, as it is present in the raw parse tree that pg_query has access to, when generating a unique identifier for a query.

There are of course many other parts of a query we also take into consideration, e.g. referenced columns, expressions, functions, etc. To enable grouping we do not include constant values in the fingerprint, to ensure that two similar queries get the same fingerprint:

PgQuery.fingerprint("SELECT * FROM users WHERE id = 1")
# => "a0ead580058af585"
PgQuery.fingerprint("SELECT * FROM users WHERE id = 2")
# => "a0ead580058af585"
PgQuery.fingerprint("SELECT * FROM users WHERE email = $1")
# => "e213d9d32c7097d5"

What else can we use fingerprints for? One use case that we’ve heard about from pganalyze customers, is to use query fingerprints to help identify the same query on both the application side and the database.

Specifically, by using pg_query in application side tracing to tag a query, and then, when looking at a slow trace, using that data in pganalyze to find more detailed information about database-side performance. This also inspired our recent integration with OpenTelemetry, which solves the same use case in a slightly different way.

Utilizing deparsing to upgrade queries to Postgres 16 SQL/JSON syntax

Now to something new in the Postgres 16 release! In Postgres 16, one of the bigger syntax changes was the addition of SQL/JSON. And pg_query fully supports that, both for parsing, as well as deparsing (which allows you to turn a syntax tree back into a SQL statement).

We can use the pg_query deparser to write the equivalent of a codemod for SQL statements, that rewrites the legacy syntax into the more standard SQL/JSON syntax.

For example, imagine we have many places where we build JSON objects manually in SQL using the “json_build_object” function, and wanted to replace that with the new JSON_OBJECT syntax:

q = PgQuery.parse("SELECT json_build_object('key1', 1, 'key2', 'val');")
q.walk! do |node|
  next unless node.is_a?(PgQuery::Node) && node.node == :func_call
  func_name = node.func_call.funcname[0].string.sval
  if func_name == 'json_build_object'
    exprs = node.func_call.args.each_slice(2).map do |key, value|
      PgQuery::Node.from(
        PgQuery::JsonKeyValue.new(
          key: key,
          value: PgQuery::JsonValueExpr.new(raw_expr: value)
        )
      )
    end
    node.inner = PgQuery::JsonObjectConstructor.new(exprs: exprs)
  end
end
q.deparse
# => "SELECT JSON_OBJECT('key1': 1, 'key2': 'val')"

Each release, we test the pg_query deparser for completeness with the full set of Postgres regression tests, and be it SQL/JSON, or other new syntax, you can rest assured that pg_query supports it.

Alternate parse modes to work with PL/pgSQL expressions

Since Postgres 14, PL/pgSQL expressions are now parsed through the regular “raw_parser” functionality, by passing a special mode flag that then allows for PL/pgSQL specific syntax.

We didn’t support this in pg_query before, but thanks to a contribution by Landan Cheruka, there is now a way to parse PL/pgSQL expressions directly with pg_query.

Let’s first utilize parse_plpgsql to parse a function definition, the example taken from the Postgres documentation:

CREATE OR REPLACE FUNCTION cs_fmt_browser_version(v_name varchar,
                                              	  v_version varchar)
RETURNS varchar AS $$
BEGIN
  IF v_version IS NULL THEN
	RETURN v_name;
  END IF;
  RETURN v_name || '/' || v_version;
END;$$;

{
  "PLpgSQL_function": {
    "datums": [
      { "PLpgSQL_var": { "refname": "v_name", "datatype": { "PLpgSQL_type": { "typname": "UNKNOWN" } } } },
      { "PLpgSQL_var": { "refname": "v_version", "datatype": { "PLpgSQL_type": { "typname": "UNKNOWN" } } } },
      { "PLpgSQL_var": { "refname": "found", "datatype": { "PLpgSQL_type": { "typname": "UNKNOWN" } } } }
    ],
    "action": {
      "PLpgSQL_stmt_block": {
        "body": [
          {
            "PLpgSQL_stmt_if": {
              "cond": {
                "PLpgSQL_expr": { "query": "v_version IS NULL", "parseMode": 2 }
              },
              "then_body": [
                {
                  "PLpgSQL_stmt_return": {
                    "expr": {
                      "PLpgSQL_expr": { "query": "v_name", "parseMode": 2 }
                    }
...
            "PLpgSQL_stmt_return": {
              "expr": {
                "PLpgSQL_expr": { "query": "v_name || '/' || v_version", "parseMode": 2 }
...

In this function parse tree, you can see the different PLpgSQL_expr expressions, but the actual expression is just text. We can now use the new pg_query_parse_opt function to turn that text into a parse tree:

#include <pg_query.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  PgQueryParseResult result;

  result = pg_query_parse_opts("v_name || '/' || v_version", PG_QUERY_PARSE_PLPGSQL_EXPR);

  if (result.error) {
	printf("error: %s at %d\n", result.error->message, result.error->cursorpos);
  } else {
	printf("%s\n", result.parse_tree);
  }

  pg_query_free_parse_result(result);

  return 0;
}

And that gives us a regular parse tree to work with:

{
	"version": 160001,
	"stmts": [
    	{
        	"stmt": {
            	"SelectStmt": {
                	"targetList": [
                    	{
                        	"ResTarget": {
                            	"val": {
                                	"A_Expr": {
                                    	"kind": "AEXPR_OP",
…

We’re still in the process of updating language bindings to support optionally using these parse modes, and would be curious to hear about more use cases for working with PL/pgSQL and pg_query.

A shout-out to the community

pg_query wouldn’t be the same without the community!

We want to expressly call out:

Lele Gaifax for maintaining the Python binding “pglast” and proactively testing libpg_query PRs
Landan Cheruka for adding support for alternate parse modes
Anuraag Agrawal for contributions to enable use in WebAssembly (see pg_query_go without cgo)
Mehmet Emin KARAKAŞ for the many deparser improvements over the years
Philipp Steinrötter for creating the Postgres Language Server based on pg_query.rs, and giving lots of good feedback on how things could work better
And everyone else who contributed to libpg_query and related projects!

Looking ahead, we’re also looking forward to continued conversations with the Postgres community on how we could upstream parts of pg_query as a core part of Postgres, so a query parsing library could be provided directly as part of Postgres.

In conclusion

We’re excited about the new pg_query version, and we’re always happy to hear about new use cases you find for using it to work with Postgres queries. If you have ideas on how pg_query could be better, feel free to open an issue on GitHub.

And if you’ve benefited from pg_query in the past, and have not yet tried out pganalyze to optimize your Postgres performance, you can try out pganalyze with our free 14-day trial.

]]>

Postgres 16: Cumulative I/O statistics with pg_stat_io

Lukas Fittl — Tue, 14 Feb 2023 12:00:00 GMT

One of the most common questions I get from people running Postgres databases at scale is:
How do I optimize the I/O operations of my database?

Historically, getting a complete picture of all the I/O produced by a Postgres server has been challenging. To start with, Postgres splits its I/O activity into writing the WAL stream, and reads/writes to the data directory. The real challenge is understanding second-order effects around writes: Typically the write to the data directory happens after the transaction commits, and understanding which process actually writes to the data directory (and when) is hard.

This whole situation has become an even bigger challenge in the cloud, when faced with provisioned IOPS, or worse, having to pay for individual I/Os like on Amazon Aurora. Often the solution has been to look at parts of the system that have instrumentation (such as individual queries), to get at least some sense for where the activity is happening.

Last weekend, a major improvement to the visibility into I/O activity was committed to the upcoming Postgres 16 by Andres Freund, and authored by Melanie Plageman, with documentation contributed by Samay Sharma. My colleague Maciek Sakrejda and I have reviewed this patch through its various iterations, and we're very excited about what it brings to Postgres observability.

Welcome, pg_stat_io. Let's take a look:

Querying system-wide I/O statistics in Postgres
Use cases for pg_stat_io
Sneak peek: Visualizing pg_stat_io in pganalyze
The future of I/O observability in Postgres

Querying system-wide I/O statistics in Postgres

Let's start by using a local Postgres built fresh from the development branch. Note that Postgres 16 is still under heavy development, not even at beta stage, and should definitely not be used on production. For this I followed the new cheatsheet for using the Meson build system (also new in Postgres 16), which significantly speeds up the build and test process.

We can start by querying pg_stat_io to get a sense for which information is tracked, omitting rows that are empty:

SELECT * FROM pg_stat_io WHERE reads <> 0 OR writes <> 0 OR extends <> 0;

    backend_type     | io_object | io_context |  reads   | writes  | extends | op_bytes | evictions |  reuses  | fsyncs |          stats_reset          
---------------------+-----------+------------+----------+---------+---------+----------+-----------+----------+--------+-------------------------------
 autovacuum launcher | relation  | normal     |       19 |       5 |         |     8192 |        13 |          |      0 | 2023-02-13 11:50:27.583875-08
 autovacuum worker   | relation  | normal     |    15972 |    2494 |    2894 |     8192 |     17430 |          |      0 | 2023-02-13 11:50:27.583875-08
 autovacuum worker   | relation  | vacuum     |  5754853 | 3006563 |       0 |     8192 |      2056 |  5752594 |        | 2023-02-13 11:50:27.583875-08
 client backend      | relation  | bulkread   | 25832582 |  626900 |         |     8192 |    753962 | 25074439 |        | 2023-02-13 11:50:27.583875-08
 client backend      | relation  | bulkwrite  |     4654 | 2858085 | 3259572 |     8192 |    998220 |  2209070 |        | 2023-02-13 11:50:27.583875-08
 client backend      | relation  | normal     |   960291 |  376524 |  159497 |     8192 |   1103707 |          |      0 | 2023-02-13 11:50:27.583875-08
 client backend      | relation  | vacuum     |   128710 |       0 |       0 |     8192 |      1221 |   127489 |        | 2023-02-13 11:50:27.583875-08
 background worker   | relation  | bulkread   | 39059938 |  590896 |         |     8192 |    802939 | 38253662 |        | 2023-02-13 11:50:27.583875-08
 background worker   | relation  | normal     |   257533 |  118972 |       0 |     8192 |    256437 |          |      0 | 2023-02-13 11:50:27.583875-08
 background writer   | relation  | normal     |          |  243142 |         |     8192 |           |          |      0 | 2023-02-13 11:50:27.583875-08
 checkpointer        | relation  | normal     |          |  390141 |         |     8192 |           |          |  18812 | 2023-02-13 11:50:27.583875-08
 standalone backend  | relation  | bulkwrite  |        0 |       0 |       8 |     8192 |         0 |        0 |        | 2023-02-13 11:50:27.583875-08
 standalone backend  | relation  | normal     |      689 |     983 |     470 |     8192 |         0 |          |      0 | 2023-02-13 11:50:27.583875-08
 standalone backend  | relation  | vacuum     |       10 |       0 |       0 |     8192 |         0 |        0 |        | 2023-02-13 11:50:27.583875-08
(14 rows)

At a high level, this information can be interpreted as:

Statistics are tracked for a given backend type, I/O object type (i.e. whether it's a temporary table), and I/O context (more on that later)
The main statistics are counting I/O operations: reads, writes and extends (a special kind of write to resize data files)
For each I/O operation the size in bytes is noted to help interpret the statistics (currently always block size, i.e., usually 8kB)
Additionally, the number of shared buffer evictions, ring buffer re-uses and fsync calls are tracked

On Postgres 16, this system-wide information will always available. You can find the complete details of each field in the Postgres documentation.

Note that pg_stat_io shows logical I/O operations issued by Postgres. Whilst this often eventually maps to an actual I/O to a disk (especially in the case of writes), the operating system has its own caching and batching mechanism, and will for example often times split up an 8kB write to become two individual 4kB writes to the file system.

Generally we can assume that this captures all I/O issued by Postgres, except for:

I/O for writing the Write-Ahead-Log (WAL)
Special cases such as tables being moved between tablespaces
Temporary files (such as used for sorts, or extensions like pg_stat_statements)

Note that temporary relations are tracked (they are not the same as temporary files): In pg_stat_io these are marked as io_object = "temp relation" - you may otherwise be familiar with them being called "local buffers" in other statistics views.

With the basics in place, we can take a closer look at some use cases and learn why this matters.

Use cases for pg_stat_io

Tracking Write I/O activity in Postgres

Lifecycle of a write in Postgres, and what is currently not visible in most statistics

When looking at a write in Postgres, we need to look beyond what a client sees as the query runtime, or something like pg_stat_statements can track. Postgres has a complex set of mechanisms that guarantee durability of writes, whilst allowing clients to return quickly, trusting that the server has persisted the data in a crash safe manner.

The first thing that Postgres does to persist data, is to write it to the WAL log. Once this has succeeded, the client will receive confirmation that the write has been successful. But what happens afterwards is where the additional statistics tracking comes in handy.

For example, if you look at a given INSERT statement in pg_stat_statements, the shared_blks_written field is often going to tell you next to nothing, because the actual write to the data directory typically occurs at a later time, in order to batch writes for efficiency and to avoid I/O spikes.

In addition to writing the WAL, Postgres will also update the shared (or local) buffers for the write. Such an update will mark the buffer page in question as "dirty".

Then, in most cases, another process is responsible for actually writing the dirty page to the data directory. There are three main process types to consider:

The background writer: Runs continuously in the background to write out (some) dirty pages
The checkpointer: Runs on a scheduled basis, or based on amount of WAL written, and writes out all dirty pages not yet written
All other process types, including regular client backends: Write out dirty pages if they need to evict the buffer page in question

The main thing to understand is when the third case occurs - because it can drastically slow down queries. Even a simple "SELECT" might have to suddenly write to disk, before it has enough space in shared buffers to read in its data.

Historically you were already able to see some of this activity through the pg_stat_bgwriter view, specifically the fields named buffers_. However, this was incomplete, did not consider autovacuum activity explicitly, and did not let you understand the root cause of a write (e.g. a buffer eviction).

With pg_stat_io you can simply look at the writes field, and see both an accurate aggregate number, as well as exactly which process in Postgres actually ended up writing your data to disk.

Improve workload stability and sizing shared_buffers by monitoring shared buffer evictions

One of the most important metrics that pg_stat_io helps give clarity on, is the situation where a buffer page in shared buffers is evicted. Since shared buffers is a fixed size pool of pages (each 8kb in size, on most Postgres systems), what is cached inside it matters a great deal - especially when your working set exceeds shared buffers.

By default, if you're on a self-managed Postgres, the shared_buffers setting is set to 128MB - or about 16,000 pages. Let's imagine you end up having loaded something through a very inefficient index scan, that ended up consuming all 128MB.

What happens when you suddenly read something completely different? Postgres has to go and remove some of the old data from cache - also known as evicting a buffer page.

This eviction has two main effects:

Data that was in Postgres buffer cache before, is no longer in the cache (note it may still be in the OS page cache)
If the page that was evicted was marked as "dirty", the process evicting it also has to write the old page to disk

Both of these aspects matter for sizing shared buffers, and pg_stat_io can clearly show this by tracking evictions for each backend type across the system. Further, if you see a sudden spike in evictions, and then suddenly a lot of reads, it can help you infer that the cached data that was evicted, was actually needed again shortly afterwards. If in doubt, you can use the pg_buffercache extension to look at the current shared buffers contents in detail.

Tracking cumulative I/O activity by autovacuum and manual VACUUMs

It's a fact that every Postgres server needs the occasional VACUUM - whether you schedule it manually, or have autovacuum take care of it for you. It helps clean up dead rows and makes space re-usable, and it freezes pages to prevent transaction ID wraparound.

But there is such a thing as VACUUMing too often. If not tuned correctly, VACUUM and autovacuum can have a dramatic effect on I/O activity. Historically the best bet was to look at the output of log_autovacuum_min_duration, which will give you information like this:

  LOG:  automatic vacuum of table "mydb.pg_toast.pg_toast_42593": index scans: 0
        pages: 0 removed, 13594 remain, 13594 scanned (100.00% of total)
        tuples: 0 removed, 54515 remain, 0 are dead but not yet removable
        removable cutoff: 11915, which was 6 XIDs old when operation ended
        new relfrozenxid: 11915, which is 4139 XIDs ahead of previous value
        frozen: 13594 pages from table (100.00% of total) had 54515 tuples frozen
        index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed
        avg read rate: 0.113 MB/s, avg write rate: 0.113 MB/s
        buffer usage: 13614 hits, 13602 misses, 13600 dirtied
        WAL usage: 40786 records, 13600 full page images, 113072608 bytes
        system usage: CPU: user: 0.26 s, system: 0.52 s, elapsed: 939.84 s

From the buffer usage you can determine that this single VACUUM had to read 13602 pages, and marked 13600 pages as dirty. But what if we want to get a more complete picture, and across all our VACUUMs?

With pg_stat_io, you can now see a system-wide measurement of the impact of VACUUM, by looking at everything marked as io_context = 'vacuum', or associated to the autovacuum worker backend type:

SELECT * FROM pg_stat_io WHERE backend_type = 'autovacuum worker' OR (io_context = 'vacuum' AND (reads <> 0 OR writes <> 0 OR extends <> 0));

    backend_type    | io_object | io_context |  reads  | writes  | extends | op_bytes | evictions | reuses  | fsyncs |          stats_reset          
--------------------+-----------+------------+---------+---------+---------+----------+-----------+---------+--------+-------------------------------
 autovacuum worker  | relation  | bulkread   |       0 |       0 |         |     8192 |         0 |       0 |        | 2023-02-13 11:50:27.583875-08
 autovacuum worker  | relation  | normal     |   16306 |    2494 |    2915 |     8192 |     17785 |         |      0 | 2023-02-13 11:50:27.583875-08
 autovacuum worker  | relation  | vacuum     | 5824251 | 3028684 |       0 |     8192 |      2588 | 5821460 |        | 2023-02-13 11:50:27.583875-08
 client backend     | relation  | vacuum     |  128710 |       0 |       0 |     8192 |      1221 |  127489 |        | 2023-02-13 11:50:27.583875-08
 standalone backend | relation  | vacuum     |      10 |       0 |       0 |     8192 |         0 |       0 |        | 2023-02-13 11:50:27.583875-08
(5 rows)

In this particular example, in sum, the autovacuum worker has read 44.4 GB of data (5,824,251 buffer pages), and written 23.1GB (3,028,684 buffer pages).

If you track these statistics over time, it will help you have a crystal-clear picture of whether autovacuum is to blame for an I/O spike during business hours. It will also help you make changes to tune autovacuum with more confidence, e.g. making autovacuum more aggressive to prevent bloat.

Visibility into bulk read/write strategies (sequential scans and COPY)

Have you ever used COPY in Postgres to load data? Or read data from a table using a sequential scan? You may not know that in most cases, this data does not pass through shared buffers in the regular way. Instead, Postgres uses a special dedicated ring buffer that ensures that most of shared buffers is undisturbed by such large activities.

Before pg_stat_io, it was near impossible to understand this activity in Postgres, as there was simply no tracking for it. Now, we can finally see both bulk reads (typically large sequential scans) and bulk writes (typically COPY in), and the I/O activity they cause.

You can simply filter for the new bulkwrite and bulkread values in io_context, and have visibility into this activity:

SELECT * FROM pg_stat_io WHERE io_context IN ('bulkread', 'bulkwrite') AND (reads <> 0 OR writes <> 0 OR extends <> 0);

    backend_type    | io_object | io_context |  reads   | writes  | extends | op_bytes | evictions |  reuses  | fsyncs |          stats_reset          
--------------------+-----------+------------+----------+---------+---------+----------+-----------+----------+--------+-------------------------------
 client backend     | relation  | bulkread   | 25900458 |  627059 |         |     8192 |    754610 | 25141667 |        | 2023-02-13 11:50:27.583875-08
 client backend     | relation  | bulkwrite  |     4654 | 2858085 | 3259572 |     8192 |    998220 |  2209070 |        | 2023-02-13 11:50:27.583875-08
 background worker  | relation  | bulkread   | 39059938 |  590896 |         |     8192 |    802939 | 38253662 |        | 2023-02-13 11:50:27.583875-08
 standalone backend | relation  | bulkwrite  |        0 |       0 |       8 |     8192 |         0 |        0 |        | 2023-02-13 11:50:27.583875-08
(4 rows)

In this example, there is 495 GB of bulk read activity, and 21 GB of bulk write activity we had no good way of identifying before. However, and most importantly, we don't have to worry about the evictions count here - these are all evictions from the special bulk read / bulk write ring buffer, not from regular shared buffers.

Sneak peek: Visualizing pg_stat_io in pganalyze

It's still a while until Postgres 16 will be released (usually September or October each year), but to help test things (and because it's exciting!) I took a quick stab at updating pganalyze in an experimental branch to collect pg_stat_io metrics and visualize them over time.

Here is a very early look at how this may look like in the future:

Experimental view of how pg_stat_io could look like when visualized over time

Even though this is just running locally on my laptop, already we can see a clear pattern where writes are done by the checkpointer and background writer processes, most of the time. We can also see my checkpoint_timeout being set to 5min (the default), with both writes and fsyncs happening like clockwork - note the workload is periodic every 10 minutes, so every second checkpoint has less work to do.

However, we can also clearly see a spike in activity - and that spike can be easily explained: To generate more database activity, I triggered a big daily background process around 8:10pm UTC. The high amount of data read caused the working set to momentarily exceed shared buffers, and caused a large amount of buffer evictions, which then caused the client backend having to write out buffer pages unexpectedly.

On this system I have a very small shared_buffers setting (the default, 128 MB). I should probably increase shared_buffers...

The future of I/O observability in Postgres

A lot of the ground work for pg_stat_io actually happened previously in Postgres 15, through the new cumulative statistics system using shared memory.

Before Postgres 15, statistics tracking had to go through the statistics collector (an obscure process that received UDP packets from individual processes part of Postgres), which was slow and error prone. This historically limited the ability to collect more advanced statistics easily. As the addition of pg_stat_io shows, it is now much easier to track additional information about how Postgres operates.

Amongst the immediate improvements that are already being discussed are:

Tracking of system-wide buffer cache hits (to allow calculating an accurate buffer cache hit ratio)
Cumulative system-wide I/O times (not just I/O counts as currently present in pg_stat_io)
Better cumulative WAL statistics (i.e. going beyond what pg_stat_wal offers)
Additional I/O tracking for tables and indexes

Our team at pganalyze is excited to have helped shape the new pg_stat_io view, and we look forward to continue working with the community on making Postgres better.

Share this article: If you'd like to share this article with your peers, you can tweet about it here.

PS: If you're interested in learning more about optimizing Postgres I/O performance and costs you can check out our webinar recording.

]]>

How Postgres Chooses Which Index To Use For A Query

Lukas Fittl — Fri, 01 Apr 2022 12:00:00 GMT

Using Postgres sometimes feels like magic. But sometimes the magic is too much, such as when you are trying to understand the reason behind a seemingly bad Postgres query plan.

I've often times found myself in a situation where I asked myself: "Postgres, what are you thinking?". Staring at an EXPLAIN plan, seeing a Sequential Scan, and being puzzled as to why Postgres isn't doing what I am expecting.

This has led me down the path of reading the Postgres source, in search for answers. Why is Postgres choosing a particular index over another one, or not choosing an index altogether?

In this blog post I aim to give an introduction to how the Postgres planner analyzes your query, and how it decides which indexes to use. Additionally, we’ll look at a puzzling situation where the join type can impact which indexes are being used.

We’ll look at a lot of Postgres source code, but if you are short on time, you might want to jump to how B-tree index costing works, and why Nested Loop Joins impact index usage.

We’ll also talk about an upcoming pganalyze feature at the very end!

A tour of Postgres: Parse analysis and early stages of planning
Where Index Scans are made
New features coming soon to pganalyze
Conclusion
Other helpful resources

A tour of Postgres: Parse analysis and early stages of planning

To start with, let’s look at a query’s lifecycle in Postgres. There are four important steps in how a query is handled:

Parsing: Turning query text into an Abstract Syntax Tree (AST)
Parse analysis: Turning table names into actual references to table objects
Planning: Finding and creating the optimal query plan
Execution: Executing the query plan

For understanding how the planner chooses which indexes to use, let’s first take a look at what parse analysis does.

Whilst there are multiple entry points into parse analysis, depending if you have query parameters or not, the core function in parse analysis is transformStmt (source):

/*
* transformStmt -
*    recursively transform a Parse tree into a Query tree.
*/
Query *
transformStmt(ParseState *pstate, Node *parseTree)
{

This takes the raw parse tree output (from the first step), and returns a Query struct. It has a lot of specific cases, as it handles both regular SELECTs as well as UPDATEs and other DML statements. Note that utility statements (DDL, etc) mostly get passed through to the execution phase.

Since we are interested in tables and indexes, let’s take a closer look at how parse analysis handles the FROM clause:

void
transformFromClause(ParseState *pstate, List *frmList)
{
   ListCell   *fl;
 
   /*
    * The grammar will have produced a list of RangeVars, RangeSubselects,
    * RangeFunctions, and/or JoinExprs. Transform each one (possibly adding
    * entries to the rtable), check for duplicate refnames, and then add it
    * to the joinlist and namespace.
    */
   foreach(fl, frmList)
   {
       …
 
       n = transformFromClauseItem(pstate, n,
                                   &nsitem,
                                   &namespace);
 
…
/*
* transformFromClauseItem -
*    Transform a FROM-clause item, adding any required entries to the
*    range table list being built in the ParseState, and return the
*    transformed item ready to include in the joinlist.  Also build a
*    ParseNamespaceItem list describing the names exposed by this item.
*    This routine can recurse to handle SQL92 JOIN expressions.
*/
static Node *
transformFromClauseItem(ParseState *pstate, Node *n,
                       ParseNamespaceItem **top_nsitem,
                       List **namespace)
{
…

Postgres already separates between the range table list (essentially a list of all the tables referenced by the query), and the joinlist. This distinction will also be visible at a later point in the planner.

Note that at this point Postgres has not yet made up its mind which indexes to use - it just decided that the FROM reference you called “foobar” is actually the table “foobar” in the “public” schema with OID 16424.

This information now gets stored in the Query struct, which is the result of the parse analysis phase. This Query struct is then passed into the planner, and that’s where it gets interesting.

Four levels of planning a query

Commonly we would start with the standard_planner (source) function as an entry point into the Postgres planner:

PlannedStmt *
standard_planner(Query *parse, const char *query_string, int cursorOptions,
                ParamListInfo boundParams)
{

This takes our Query struct, and ultimately returns a PlannedStmt. For reference, the PlannedStmt struct (source) looks like this:

/* ----------------
*      PlannedStmt node
*
* The output of the planner is a Plan tree headed by a PlannedStmt node.
* PlannedStmt holds the "one time" information needed by the executor.
* ----------------
*/
typedef struct PlannedStmt
{
   NodeTag     type;
 
   CmdType     commandType;    /* select|insert|update|delete|utility */
 
…
 
   struct Plan *planTree;      /* tree of Plan nodes */
 
…

The tree of plan nodes is what you would be familiar with if you’ve looked at an EXPLAIN output before - ultimately EXPLAIN is based on walking that plan tree and showing you a text/JSON/etc version of it.

The core function of the planner is best described in these lines of standard_planner:

/* primary planning entry point (may recurse for subqueries) */
root = subquery_planner(glob, parse, NULL,
                        false, tuple_fraction);

/* Select best Path and turn it into a Plan */
final_rel = fetch_upper_rel(root, UPPERREL_FINAL, NULL);
best_path = get_cheapest_fractional_path(final_rel, tuple_fraction);

top_plan = create_plan(root, best_path);

The planner first creates what are called “paths” using the subquery_planner (which may recursively call itself), and then the planner picks the best path. Best on this best path, the actual plan tree is constructed.

For understanding how the planner chose which indexes to use, we must therefore look at paths, not at plan nodes. Let’s see what subquery_planner (source) does:

/*--------------------
* subquery_planner
*    Invokes the planner on a subquery.  We recurse to here for each
*    sub-SELECT found in the query tree.
…
*/
PlannerInfo *
subquery_planner(PlannerGlobal *glob, Query *parse,
                PlannerInfo *parent_root,
                bool hasRecursion, double tuple_fraction)
{

As described in the comment, this handles each sub-SELECT separately - but note that even if the original query contains a written sub-SELECT, the planner may optimize it away to pull it up into the parent planning process, if possible.

For the purposes of focusing on index choice, here are the two key parts of subquery_planner:

/*
 * Do the main planning.  If we have an inherited target relation, that
 * needs special processing, else go straight to grouping_planner.
 */
if (parse->resultRelation &&
    rt_fetch(parse->resultRelation, parse->rtable)->inh)
    inheritance_planner(root);
else
    grouping_planner(root, false, tuple_fraction);

…

/*
 * Make sure we've identified the cheapest Path for the final rel.  (By
 * doing this here not in grouping_planner, we include initPlan costs in
 * the decision, though it's unlikely that will change anything.)
 */
set_cheapest(final_rel);

This method also optimizes for the cheapest path - we’ll see more on that in a moment. But for now, let’s go deeper down the rabbit hole and look at grouping_planner (source):

/* --------------------
 * grouping_planner
 *    Perform planning steps related to grouping, aggregation, etc.
 *
 * This function adds all required top-level processing to the scan/join
 * Path(s) produced by query_planner.
 *
 * --------------------
 */
static void
grouping_planner(PlannerInfo *root, bool inheritance_update,
                double tuple_fraction)
{

Reading through its code, turns out we’re still not there. It’s actually query_planner that we are looking for, as described in this comment:

RelOptInfo *current_rel;
…
/*
* Generate the best unsorted and presorted paths for the scan/join
* portion of this Query, ie the processing represented by the
* FROM/WHERE clauses.  (Note there may not be any presorted paths.)
* We also generate (in standard_qp_callback) pathkey representations
* of the query's sort clause, distinct clause, etc.
*/
current_rel = query_planner(root, standard_qp_callback, &qp_extra);

Before we dive into the query_planner method, let’s pause for a moment and look at what the result of query_planner is, the RelOptInfo struct:

Breaking down a query into tables being scanned (RelOptInfo and RestrictInfo structs)

In the Postgres planner, RelOptInfo is best described as the internal representation of a particular table that is being scanned (with either a sequential scan, or an index scan).

When trying to understand how Postgres interprets your query, adding debug information that shows RelOptInfo would be the closest that you can get to seeing which tables Postgres is going to scan, and how it makes a decision between different scan methods, such as an Index Scan.

RelOptInfo (source) has many details to it, but the key parts for our focus on indexing are these:

/*----------
* RelOptInfo
*      Per-relation information for planning/optimization
…
*      pathlist - List of Path nodes, one for each potentially useful
*                 method of generating the relation
… 
*      baserestrictinfo - List of RestrictInfo nodes, containing info about
*                  each non-join qualification clause in which this relation
*                  participates (only used for base rels)
…
*      joininfo  - List of RestrictInfo nodes, containing info about each
*                  join clause in which this relation participates
…
*/
typedef struct RelOptInfo
{
…
   List       *pathlist;       /* Path structures */
…
   List       *baserestrictinfo;   /* RestrictInfo structures (if base rel) */
…
   List       *joininfo;       /* RestrictInfo structures for join clauses
                                * involving this rel */
…
}

Before we interpret this, let’s look at RestrictInfo (source):

/*
* Restriction clause info.
*
* We create one of these for each AND sub-clause of a restriction condition
* (WHERE or JOIN/ON clause).  Since the restriction clauses are logically
* ANDed, we can use any one of them or any subset of them to filter out
* tuples, without having to evaluate the rest.
..
*/
typedef struct RestrictInfo
{
   NodeTag     type;
   Expr       *clause;         /* the represented clause of WHERE or JOIN */
…
}

A note on terminology: This references “base relations”, which are relations (aka tables) that are looked at solely on their individual basis, as compared to in the context of a JOIN.

In the code sample, RestrictInfo is how our WHERE clause and JOIN conditions get represented. This is the part that is key to understanding how Postgres compares your query against the indexes that exist.

You can think about it this way - for each table that’s included in the query, Postgres generates two lists of “restriction” clauses:

Base restriction clauses: Typically part of your WHERE clause, and are expressions that involve only the table itself - for example users.id = 123
Join clauses: Typically part of your JOIN clause, and expressions that involve multiple tables - for example users.id = comments.user_id

Note the reason that Postgres calls these “restriction” clauses is because they restrict (or filter) the amount of data that is being returned from your table. And how can we effectively filter data from a table? By using an index!

The base restriction clauses will typically be used to filter down the amount of data that is being returned from the table. But join clauses oftentimes will not, as they are only used as part of the matching of rows that happens during the JOIN operation.

The one exception to this are Nested Loop Joins - but we’ll come back to that.

Choosing different paths and scan methods

Let’s go back to query_planner (source), and what it does:

/*
* query_planner
*    Generate a path (that is, a simplified plan) for a basic query,
*    which may involve joins but not any fancier features.
*
* Since query_planner does not handle the toplevel processing (grouping,
* sorting, etc) it cannot select the best path by itself.  Instead, it
* returns the RelOptInfo for the top level of joining, and the caller
* (grouping_planner) can choose among the surviving paths for the rel.
…
*/
RelOptInfo *
query_planner(PlannerInfo *root,
             query_pathkeys_callback qp_callback, void *qp_extra)
{
…
   /*
    * Construct RelOptInfo nodes for all base relations used in the query.
    */
   add_base_rels_to_query(root, (Node *) parse->jointree);
 
…
   /*
    * Ready to do the primary planning.
    */
   final_rel = make_one_rel(root, joinlist);
 
   return final_rel;
}

The main point of query_planner itself is to create a set of RelOptInfo nodes, do a bunch of processing involving them, and then passing them to make_one_rel. As that name says, it creates one “final rel”, which is also a RelOptInfo node, that is then used to create our final plan.

We’ve looked at a bunch of code already, but now it’s time to get to the exciting part!

The implementation of make_one_rel (source) sits in a file with the important sounding name of allpaths.c - and as referenced earlier, when we talk about plan choices, we need to understand which path is chosen, as that is used to then create a plan node.

/*
 * make_one_rel
 *    Finds all possible access paths for executing a query, returning a
 *    single rel that represents the join of all base rels in the query.
 */
RelOptInfo *
make_one_rel(PlannerInfo *root, List *joinlist)
{
…
   /*
    * Compute size estimates and consider_parallel flags for each base rel.
    */
   set_base_rel_sizes(root);
 
…
 
   /*
    * Generate access paths for each base rel.
    */
   set_base_rel_pathlists(root);
   /*
    * Generate access paths for the entire join tree.
    */
   rel = make_rel_from_joinlist(root, joinlist);
 
   return rel;
}

Paths are chosen in three steps:

Estimate the sizes of the involved tables
Find the best path for each base relation
Find the best path for the entire join tree

The first step is mainly concerned with size estimates as they relate to the output of scanning the relation. This impacts the cost and rows numbers you are familiar with from EXPLAIN - and this may impact joins, but typically should not directly impact index usage.

Now step 2 is key to our goal here. And set_base_rel_pathlists ultimately calls set_plain_rel_pathlist (source), which finally looks like what we are interested in:

/*
 * set_plain_rel_pathlist
 *    Build access paths for a plain relation (no subquery, no inheritance)
 */
static void
set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
{
   …
 
   /* Consider sequential scan */
   add_path(rel, create_seqscan_path(root, rel, required_outer, 0));
 
   /* If appropriate, consider parallel sequential scan */
   if (rel->consider_parallel && required_outer == NULL)
       create_plain_partial_paths(root, rel);
 
   /* Consider index scans */
   create_index_paths(root, rel);
 
   /* Consider TID scans */
   create_tidscan_paths(root, rel);
}

Where Index Scans are made

Creating the two types of index scans: plain vs parameterized

Let’s look at create_index_paths (source), since we want to see how indexes are picked:

/*
* create_index_paths()
*    Generate all interesting index paths for the given relation.
*    Candidate paths are added to the rel's pathlist (using add_path).
*
* To be considered for an index scan, an index must match one or more
* restriction clauses or join clauses from the query's qual condition,
* or match the query's ORDER BY condition, or have a predicate that
* matches the query's qual condition.
*
* There are two basic kinds of index scans.  A "plain" index scan uses
* only restriction clauses (possibly none at all) in its indexqual,
* so it can be applied in any context.  A "parameterized" index scan uses
* join clauses (plus restriction clauses, if available) in its indexqual.
* When joining such a scan to one of the relations supplying the other
* variables used in its indexqual, the parameterized scan must appear as
* the inner relation of a nestloop join; it can't be used on the outer side,
* nor in a merge or hash join.
…
*/
void
create_index_paths(PlannerInfo *root, RelOptInfo *rel)
{
…
   /* Examine each index in turn */
   foreach(lc, rel->indexlist)
   {
       IndexOptInfo *index = (IndexOptInfo *) lfirst(lc);
 
       …
 
       /*
        * Ignore partial indexes that do not match the query.
        */
       if (index->indpred != NIL && !index->predOK)
           continue;
 
       /*
        * Identify the restriction clauses that can match the index.
        */
       match_restriction_clauses_to_index(root, index, &rclauseset);
 
       /*
        * Build index paths from the restriction clauses.  These will be
        * non-parameterized paths.  Plain paths go directly to add_path(),
        * bitmap paths are added to bitindexpaths to be handled below.
        */
       get_index_paths(root, rel, index, &rclauseset,
                       &bitindexpaths);
 
       /*
        * Identify the join clauses that can match the index.  For the moment
        * we keep them separate from the restriction clauses.
        */
       match_join_clauses_to_index(root, rel, index,
                                   &jclauseset, &joinorclauses);
…
       /*
        * If we found any plain or eclass join clauses, build parameterized
        * index paths using them.
        */
       if (jclauseset.nonempty || eclauseset.nonempty)
           consider_index_join_clauses(root, rel, index,
                                       &rclauseset,
                                       &jclauseset,
                                       &eclauseset,
                                       &bitjoinpaths);
   }
 
…
}

There are a lot of things to take in here - and we’ve already removed BitmapOr/BitmapAnd index scans from this code sample.

First of all, this builds two types of index scans:

Plain index scans, that only use the base restriction clauses
Parameterized index scans, that use both base restriction clauses and join clauses

We’ll talk more about the second case in a moment.

Other key aspects to understand:

Partial indexes (i.e. those with an attached WHERE clause on the index definition) are matched against the set of restriction clauses and discarded here if they don’t match
Each index is both considered for an Index Scan and Index Only Scan (through the “build_index_paths” method), as well as for a Bitmap Heap Scan / Bitmap Index Scan
Each potential way of using an index gets a cost assigned - and this cost decides whether Postgres actually chooses the index (see earlier notion of the “best path”), or not

For understanding how costing works, you can look at the cost_index function (source), which gets called from build_index_paths through a few hoops.

/*
* cost_index
*    Determines and returns the cost of scanning a relation using an index.
…
* In addition to rows, startup_cost and total_cost, cost_index() sets the
* path's indextotalcost and indexselectivity fields.  These values will be
* needed if the IndexPath is used in a BitmapIndexScan.
*/
void
cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
          bool partial_path)
{
…
   /*
    * Call index-access-method-specific code to estimate the processing cost
    * for scanning the index, as well as the selectivity of the index (ie,
    * the fraction of main-table tuples we will have to retrieve) and its
    * correlation to the main-table tuple order.
    */
   amcostestimate(root, path, loop_count,
                  &indexStartupCost, &indexTotalCost,
                  &indexSelectivity, &indexCorrelation,
                  &index_pages);

Whilst there are other factors in costing an index scan, the main responsibility falls to the Index Access Method.

Understanding B-tree index cost estimates

The most common index access method (or index type) is B-tree, so let’s look at btcostestimate:

void
btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
              Cost *indexStartupCost, Cost *indexTotalCost,
              Selectivity *indexSelectivity, double *indexCorrelation,
              double *indexPages)
{
…
   /*
    * For a btree scan, only leading '=' quals plus inequality quals for the
    * immediately next attribute contribute to index selectivity (these are
    * the "boundary quals" that determine the starting and stopping points of
    * the index scan).
    */
   indexBoundQuals = …
 
   /*
    * If the index is partial, AND the index predicate with the
    * index-bound quals to produce a more accurate idea of the number of
    * rows covered by the bound conditions.
    */
   selectivityQuals = add_predicate_to_index_quals(index, indexBoundQuals);
 
   btreeSelectivity = clauselist_selectivity(root, selectivityQuals,
                                             index->rel->relid,
                                             JOIN_INNER,
                                             NULL);
   numIndexTuples = btreeSelectivity * index->rel->tuples;
…
   costs.numIndexTuples = numIndexTuples;
   genericcostestimate(root, path, loop_count, &costs);
…

As you can see a lot revolves around determining how many index tuples will be matched by the scan - as that’s the main expensive portion of querying a B-tree index.

The first step is determining the boundaries of the index scan, as it relates to the data stored in the index. In particular this is relevant for multi-column B-tree indexes, where only a subset of the columns might match the query.

You may have heard before about the best practice of ordering B-tree columns so the columns that are queried by an equality comparison (“=” operator) are put first, followed by one optional inequality comparison (“<>” operator), followed by any other columns. This recommendation is based on the physical structure of the B-tree index, and the cost model also reflects this constraint.

Put differently: The more specific you are with matching equality comparisons, the less parts of the index have to be scanned. This is represented here by the calculation of “btreeSelectivity”. If this number is small, the cost of the index scan will be less, as determined by “genericcostestimate” based on the estimated number of index tuples being scanned.

For creating the ideal B-tree index, you would:

Focus on indexing columns used in equality comparisons
Index the columns with the best selectivity (i.e. being most specific), so that only a small portion of the index has to be scanned
Involve a small number of columns (possibly only one), to keep the index size small - and thus reduce the total number of pages in the index

If you follow these steps, you will create a B-tree index that has a low cost, and that Postgres should choose.

Now, there is one more thing we wanted to talk about, and that involves the notion of Parameterized Index Scans:

Parameterized Index Scans, or: Why Nested Loop are sometimes a good join type

As noted earlier, when Postgres looks at the potential index scans, it creates both plain index scans, and parameterized index scans.

Plain index scans only involve parts of your query that involve the table itself, and would typically reference the clauses found in the WHERE clause.

Parameterized index scans on the other hand involve the part of your query that references two different tables. Oftentimes you would find these clauses in the JOIN clause.

Let’s take a look at a practical example. Assume the following schema and indexes:

CREATE TABLE t1 (
  id bigint PRIMARY KEY,
  field text
);
CREATE TABLE t2 (
  id bigint PRIMARY KEY,
  t1_id bigint,
  other_field text
);
CREATE INDEX t1_field_idx ON t1(field);
CREATE INDEX t2_t1_id_idx ON t2(t1_id);

And this query:

SELECT *
FROM t1
JOIN t2 ON (t1.id = t2.t1_id)
WHERE t1.field = '123'

We have two tables to scan - t1 and t2.

For t1, we can utilize a plain index scan on the t1_field_idx index - and that will perform well, since we have a specific value that is present in the query, that ideally matches a small amount of rows.

When we run an EXPLAIN on the query, the simplest plan will look like this:

EXPLAIN SELECT *
FROM t1
JOIN t2 ON (t1.id = t2.t1_id)
WHERE t1.field = '123';

                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Hash Join  (cost=13.74..37.26 rows=5 width=88)
   Hash Cond: (t2.t1_id = t1.id)
   ->  Seq Scan on t2  (cost=0.00..20.70 rows=1070 width=48)
   ->  Hash  (cost=13.67..13.67 rows=6 width=40)
         ->  Bitmap Heap Scan on t1  (cost=4.20..13.67 rows=6 width=40)
               Recheck Cond: (field = '123'::text)
               ->  Bitmap Index Scan on t1_field_idx  (cost=0.00..4.20 rows=6 width=0)
                     Index Cond: (field = '123'::text)
(8 rows)

Or put visually:

As we can see Postgres uses a Sequential Scan on t2. Let’s add some more data into the tables, to see if that changes the plan:

INSERT INTO t1 SELECT val, val::text FROM generate_series(0, 1000) AS x(val);
INSERT INTO t2 SELECT val, val, val::text FROM generate_series(0, 1000) AS x(val);

Note that we are effectively creating exactly one entry that matches the t1.field = '123' condition, and we are also creating exactly one t2 entry for each t1 entry.

If we re-run the EXPLAIN, we get the following plan:

                                  QUERY PLAN                                  
------------------------------------------------------------------------------
 Nested Loop  (cost=0.55..16.60 rows=1 width=30)
   ->  Index Scan using t1_field_idx on t1  (cost=0.28..8.29 rows=1 width=11)
         Index Cond: (field = '123'::text)
   ->  Index Scan using t2_t1_id_idx on t2  (cost=0.28..8.29 rows=1 width=19)
         Index Cond: (t1_id = t1.id)
(5 rows)

As you can see, we now get an index scan on t2_t1_id_idx. This shows a Parameterized Index Scan in action - this is only possible because the join chosen by Postgres is a Nested Loop - not a Hash Join or Merge Join.

A quick summary of how different join types impact index usage:

Merge Join: Needs sorted output from the scan node (thus can benefit from a sorted index like B-tree), but doesn't use the JOIN clause to restrict the data when scanning the table
Hash Join: Doesn’t need sorted output, and doesn’t use the JOIN clause to restrict the data when scanning the table
Nested Loop Join: Doesn’t need sorted output from the scan node, but for one of the two tables uses the JOIN clause to restrict the data when scanning the table

Understanding what’s in your WHERE, your JOIN clause and your likely JOIN type is key, as all three will impact index usage.

If you see a surprising Sequential Scan, you might want to review whether all possible index scans were parameterized index scans, and how the plan changes when you add an additional WHERE clause.

New features coming soon to pganalyze

If you find you’re having a hard time reasoning about all of this, you are not alone!

The reason we’ve spent a lot of time looking through these parts of the Postgres source code, is because they form the basis of a new upcoming version of the Index Advisor.

And as part of the new Index Advisor, we’ll show you additional information for all scans on a table, to help you assess how Postgres uses existing indexes, and what the best indexing strategy might be.

Here is a sneak peek from our current design iteration:

The same WHERE clause and JOIN clause data from the Postgres planner is shown in the Scans list, to help you make an assessment of how Postgres builds Plain Index Scans and Parameterized Index Scans for your queries.

But more on this another day!

Conclusion

In this post we’ve gone down and chased through the Postgres source code until we’ve found the place where indexing decisions happen. We’ve looked at B-tree costing in particular, and looked at a puzzling case of how Nested Loops can affect index usage, by allowing the use of Parameterized Index Scans.

If you optimize your queries, it helps to understand which tables you are scanning, and what the involved WHERE and JOIN clauses are. Additionally, it’s important to understand the different join types, and that only Nested Loop joins can make use of indexes on columns in the JOIN clause.

Do you think your peers might be interested in this article? Share this on Twitter.

Other helpful resources

]]>

Postgres in 2021: An Observer's Year In Review

Lukas Fittl — Fri, 07 Jan 2022 12:00:00 GMT

Every January, the pganalyze team takes time to sit down to reflect on the year gone by. Of course, we are thinking about pganalyze, our customers and how we can improve our product. But, more importantly, we always take a bird's-eye view at what has happened in our industry, and specifically in the Postgres community. As you can imagine: A lot!

So we thought: Instead of trying to summarize everything, let's review what happened with the Postgres project, and what is most exciting from our personal perspective. Coincidentally, a new Postgres Commitfest has just started, so it's the perfect time to read about new functionality that is being proposed by the PostgreSQL community.

The following are my own thoughts on the past year of Postgres, and a few of the things that I'm excited about looking ahead. Let's take a look:

Postgres Performance: Sometimes it's the small things
Does autovacuum dream of 64-bit Transaction IDs?
EXPLAIN: Nested Loops can be deceiving
Extended Statistics: Help the Postgres planner do its job better
Crustaceous Postgres: Using Rust For Extensions & more
Other highlights from Postgres development in 2021
Conclusion

Postgres Performance: Sometimes it's the small things

To start with, I wanted to look at one very specific change that I actually hadn't noticed until recently.

Specifically: The performance of IN clauses, and the work done to improve performance for long IN lists in Postgres 14.

First, let's set up a test table with some data that we can query:

CREATE TABLE tbl (
    id int
);

INSERT INTO tbl SELECT i FROM generate_series(1,100000) n(i);

Now, we run a very simple query with a long IN list on Postgres 13:

postgres=# SELECT count(*) FROM tbl WHERE id IN ([... 1000 integer values ...]);
 count 
-------
  1000
(1 row)

Time: 360.520 ms

This is noticeably slow. With Postgres 14 however:

postgres=# SELECT count(*) FROM tbl WHERE id IN ([... 1000 integer values ...]);
 count 
-------
  1000
(1 row)

Time: 12.246 ms

An amazing 30x improvement! Note that this is most pronounced with Sequential Scans, or other situations where the executor makes a lot of comparisons, i.e. when the expression shows up as a Filter clause.

The reason I like this change is that it demonstrates what the Postgres community does well: Refine the existing system, and optimize clear inefficiencies, without requiring users to change their queries.

Of course, there are many other exciting performance efforts, here are a few:

Postgres 14: Connection scaling improvements
Postgres 14: Memoization of Nested Loops
Postgres 14: libpq pipelining
In Development: libpq compression
In Development: Asynchronous I/O and direct I/O (see also this presentation by Andres Freund)

Does autovacuum dream of 64-bit Transaction IDs?

Now, on to a much bigger topic. If you've scaled Postgres, you've likely come to meet the archenemy of a large Postgres installation: VACUUM, or rather its cousin, autovacuum, which cleans up dead tuples from your tables and advances the transaction ID horizon in Postgres.

Much has been said (1, 2, 3) about what happens when you hit Transaction ID (TXID) Wraparound, a situation in which Postgres is unable to start a new transaction. A recent blog post illustrating Notion's motivation to shard their Postgres deployment, puts it well:

More worrying was the prospect of transaction ID (TXID) wraparound, a safety mechanism in which Postgres would stop processing all writes to avoid clobbering existing data. Realizing that TXID wraparound would pose an existential threat to the product, our infrastructure team doubled down and got to work.

- Garrett Fidalgo - Herding elephants: Lessons learned from sharding Postgres at Notion

The root cause here is actually very simple. Transaction IDs are stored as 32-bit integers in Postgres. For example on individual rows in the table, to identify when the row first became visible to other transactions.

Most people would agree that moving from 32-bit to 64-bit Transaction IDs is a good idea. There have been multiple attempts over the years, but in the last weeks a new patch by Maxim Orlov has kickstarted a new discussion.

Whilst the community's motivation to fix this is certainly there, the early reviews give a glimpse of what needs to be considered when moving to 64-bit TXIDs:

32-bit systems will have issues with atomic read/write of shared transaction ID variables
Extremely long-running transactions could fail if they exceed the new "short transaction ID" boundary (which remains at 32-bit in this patch)
On-disk format - keeping compatibility with the old format vs rewriting all data when an old cluster is upgraded (this patch tries to avoid changing the on-disk format)
Multixact freeze still needs to happen at a somewhat regular frequency (one of the activities that VACUUM takes care of today)
Memory overhead of larger 64-bit IDs in hot code paths (e.g. those optimized by recent connection scalability improvements)

And Peter Geoghegan puts it succinctly in reviewing the patch:

I believe that a good solution to the problem that this patch tries to solve needs to be more ambitious. I think that we need to return to first principles, rather than extending what we have already.

Despite the email thread being titled "Add 64-bit XIDs into PostgreSQL 15", given these concerns, it's extremely unlikely that a change like this would make it into Postgres 15 at this point - but one can dream, and look ahead to Postgres 16.

Looking for something you can use today?

Postgres 14 brought two great improvements in the area of VACUUM and bloat reduction: (1) The new bottom-up index deletion for B-tree indexes, (2) The new VACUUM "emergency mode" that provides better protection against impeding TXID Wraparound.

EXPLAIN: Nested Loops can be deceiving

Commitfests are about encouraging code reviews, first and foremost. Whilst looking through patches, I noticed a small one, which adds additional information about Nested Loops to EXPLAIN.

The patch was initially proposed back in 2020, and saw some minor refactorings in 2021, but no-one had reviewed it yet in this Commitfest. So I took the opportunity to review it.

First, to understand what the patch aims to do, let's look at a common EXPLAIN output for a Nested Loop:

                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Nested Loop (actual rows=23 loops=1)
   Output: tbl1.col1, tprt.col1
   ->  Seq Scan on public.tbl1 (actual rows=5 loops=1)
         Output: tbl1.col1
   ->  Append (actual rows=5 loops=5)
         ->  Index Scan using tprt1_idx on public.tprt_1 (actual rows=2 loops=5)
               Output: tprt_1.col1
               Index Cond: (tprt_1.col1 < tbl1.col1)
         ->  Index Scan using tprt2_idx on public.tprt_2 (actual rows=3 loops=4)
               Output: tprt_2.col1
               Index Cond: (tprt_2.col1 < tbl1.col1)
         ->  Index Scan using tprt3_idx on public.tprt_3 (actual rows=1 loops=2)
               Output: tprt_3.col1
               Index Cond: (tprt_3.col1 < tbl1.col1)
...

Based on this we might assume that each loop produces 5 rows, as the existing "actual rows" statistic shows the average across all loops.

But this example shows well where the math already doesn't add up: The parent Append node returns 5 rows on average, but the child node "actual rows" add up to 6. And the top Nested Loop node returns 23 rows, but we can't see clearly which index these rows are being found in.

With the patch in place, we get an extra row with Loop information:

                                   QUERY PLAN                                    
---------------------------------------------------------------------------------
 Nested Loop (actual rows=23 loops=1)
   Output: tbl1.col1, tprt.col1
   ->  Seq Scan on public.tbl1 (actual rows=5 loops=1)
         Output: tbl1.col1
   ->  Append (actual rows=5 loops=5)
         Loop Min Rows: 2  Max Rows: 6  Total Rows: 23
         ->  Index Scan using tprt1_idx on public.tprt_1 (actual rows=2 loops=5)
               Loop Min Rows: 2  Max Rows: 2  Total Rows: 10
               Output: tprt_1.col1
               Index Cond: (tprt_1.col1 < tbl1.col1)
         ->  Index Scan using tprt2_idx on public.tprt_2 (actual rows=3 loops=4)
               Loop Min Rows: 2  Max Rows: 3  Total Rows: 11
               Output: tprt_2.col1
               Index Cond: (tprt_2.col1 < tbl1.col1)
         ->  Index Scan using tprt3_idx on public.tprt_3 (actual rows=1 loops=2)
               Loop Min Rows: 1  Max Rows: 1  Total Rows: 2
               Output: tprt_3.col1
               Index Cond: (tprt_3.col1 < tbl1.col1)
...

You can see how much clearer the picture is with this. We can understand that both tprt1_idx and tprt2_idx contributed about equally to the result. We can also see that some loop iterations have smaller row counts (2), vs other iterations have higher counts (6). When TIMING is turned on, you also get information on the min/max time of the loop iterations.

Given the prevalance of slow query plans that contain a Nested Loop, this appears to be a very useful patch. The main open item with this patch appears to be the slight overhead caused by collecting additional statistics - something to be discussed further on the mailinglist.

Interested in other EXPLAIN improvements? Here's what happened recently:

Postgres 14: pg_stat_statements queryid is now built into core, and shows in EXPLAIN output
In Development: Showing applied extended statistics in EXPLAIN (to quote my colleague Maciek: "Oh neat, that's pretty cool.")
In Development: Showing I/O timings spent reading/writing temp buffers in EXPLAIN

Extended Statistics: Help the Postgres planner do its job better

Going back to what you can use today: Extended Statistics on Expressions, released in Postgres 14.

Let's back up there for a moment. If you are not familiar, extended statistics allow you to collect additional statistics about table contents, so the Postgres planner can provide better query plans.

The general syntax is like this:

CREATE STATISTICS [ IF NOT EXISTS ] statistics_name
    [ ( statistics_kind [, ... ] ) ]
    ON column_name, column_name [, ...]
    FROM table_name

Before Postgres 14 you could already create extended statistics that help the planner understand the correlation between two columns, which often times is necessary to avoid selectivity mis-estimates.

With the new extended statistics for expressions, you can inform the planner how selective a particular expression is, which in turn leads to better query plans. Here is an example of how to use this:

CREATE TABLE tbl (
    a timestamptz
);
CREATE STATISTICS st ON date_trunc('month', a) FROM tbl;

This will cause Postgres to not only collect statistics about a itself (which it does by default), but also the expression that uses the date_trunc function, and what the statistics of results of that expression are. You can find a complete example in the Postgres docs.

In addition to this, there are many changes in-flight that are being discussed:

In Development: Improve selectivity estimates when extended statistics are present
In Development: Extended statistics for Var op Var clauses / Expr op Expr
In Development: Estimating JOINs using extended statistics

Crustaceous Postgres: Using Rust For Extensions & more

A side topic that isn't actually about Postgres development itself, but still pretty exciting on a larger scale: Postgres and Rust. As you probably know, Postgres itself is written in C, and that is unlikely to change.

However there are two great examples of Rust being used to augment the Postgres ecosystem.

First, you can write Postgres extensions in Rust using pgx, and by now this approach has matured to the point that even established extension authors such as the TimescaleDB team have started adopting Rust for some of their projects, such as the TimescaleDB toolkit.

Second, there are new systems being developed that build on Postgres, that utilize Rust as their language of choice, e.g. for networked services. The most interesting development in 2021 in this regard is the work of the team at ZenithDB, that is working on an Apache 2.0-licensed variant of a shared disk-type scale-out architecture (similar to Amazon Aurora), built on Postgres, with services written in Rust.

Conclusion

The above might feel quite extensive, but that's not merely all of the things that have happened with Postgres in 2021. I'm excited to be part of such a vibrant community contributing to making Postgres continuously better and am eager to see what's to come for Postgres in 2022!

At pganalyze we're committed to providing the best Postgres monitoring and observability to help you uncover deep insights about Postgres performance. Whether your Postgres runs in the cloud, your on-premises data center, or a Raspberry Pi: You can give pganalyze a try.

Share this on Twitter

]]>

Understanding Postgres GIN Indexes: The Good and the Bad

Lukas Fittl — Thu, 02 Dec 2021 12:00:00 GMT

Adding, tuning and removing indexes is an essential part of maintaining an application that uses a database. Oftentimes, our applications rely on sophisticated database features and data types, such as JSONB, array types or full text search in Postgres. A simple B-tree index does not work in such situations, for example to index a JSONB column. Instead, we need to look beyond, to GIN indexes.

Almost 15 years ago to the dot, GIN indexes were added in Postgres 8.2, and they have since become an essential tool in the application DBA’s toolbox. GIN indexes can seem like magic, as they can index what a normal B-tree cannot, such as JSONB data types and full text search. With this great power comes great responsibility, as GIN indexes can have adverse effects if used carelessly.

In this article, we’ll take an in-depth look at GIN indexes in Postgres, building on, and referencing many great articles that have been written over the years by the community. We’ll start by reviewing what GIN indexes can do, how they are structured, and their most common use cases, such as for indexing JSONB columns, or to support Postgres full text search.

But, understanding the fundamentals is only part of the puzzle. It’s much better when we can also learn from real world examples on busy databases. We’ll review a specific situation that the GitLab database team found themselves in this year, as it relates to write overhead caused by GIN indexes on a busy table with more than 1000 updates per minute.

And we’ll conclude with a review of the trade-offs between the GIN write overhead and the possible performance gains. Plus: We’ve added support for GIN index recommendations to the pganalyze Index Advisor.

To start with, let’s review what a GIN index looks like:

GIN Index in Postgres: What is it actually?
- Indexing tsvector columns for Postgres full text search
- Indexing LIKE searches with Trigrams and gin_trgm_ops
PostgreSQL, JSONB and GIN Indexes
- Postgres GIN index for JSONB columns using jsonb_ops and jsonb_path_ops
Multi-Column GIN Indexes, and Combining GIN and B-tree indexes
The downside of GIN Indexes: Expensive Updates
GIN index support in the pganalyze Index Advisor
Conclusion
Other helpful resources

GIN Index in Postgres: What is it actually?

“The GIN index type was designed to deal with data types that are subdividable and you want to search for individual component values (array elements, lexemes in a text document, etc)” - Tom Lane

The GIN index type was initially created by Teodor Sigaev and Oleg Bartunov, first released in Postgres 8.2, on December 5, 2006 - almost 15 years ago. Since then, GIN has seen many improvements, but the fundamental structure remains similar. GIN stands for "Generalized Inverted iNdex". "Inverted" refers to the way that the index structure is set up, building a table-encompassing tree of all column values, where a single row can be represented in many places within the tree. By comparison, a B-tree index generally has one location where an index entry points to a specific row.

Another way of explaining GIN indexes comes from a presentation by Oleg Bartunov and Alexander Korotkov at PGConf.EU 2012 in Prague. They describe a GIN index like the table of contents in a book, where the heap pointers (to the actual table) are the page numbers. Multiple entries can be combined to yield a specific result, like the search for “compensation accelerometers” in this example:

It’s important to note that the exact mapping of a column of a given data type is dependent on the GIN index operator class. That means, instead of having a uniform representation of data in the index, like with B-trees, a GIN index can have very different index contents depending on which data type and operator class you are using. Some data types, such as JSONB have more than one GIN operator class to support the most optimal index structure for specific query patterns.

Before we move on, one more thing to know: GIN indexes only support Bitmap Index Scans (not Index Scan or Index Only Scan), due to the fact that they only store parts of the row values in each index page. Don’t be surprised when EXPLAIN always shows Bitmap Index / Heap Scans for your GIN indexes.

Let’s take a look at a few examples:

Indexing tsvector columns for Postgres full text search

The initial motivation for GIN indexes was full text search. Before GIN was added, there was no way to index full text search in Postgres, instead requiring a very slow sequential scan of the table.

We’ve previously written about Postgres full text search with Django, as well as how to do it with Ruby on Rails on the pganalyze blog.

A simple example for a full text search index looks like this:

CREATE INDEX pgweb_idx ON pgweb USING GIN (to_tsvector('english', body));

This uses an expression index to create a GIN index that contains the indexed tsvector values for each row. You can then query like this:

SELECT title
FROM pgweb
WHERE to_tsvector('english', body) @@ to_tsquery('english', 'friend');

As described in the Postgres documentation, the tsvector GIN index structure is focused on lexemes:

“GIN indexes are the preferred text search index type. As inverted indexes, they contain an index entry for each word (lexeme), with a compressed list of matching locations. Multi-word searches can find the first match, then use the index to remove rows that are lacking additional words.”

GIN indexes are the best starting point when using Postgres Full Text Search. There are situations where a GIST index might be preferred (see the Postgres documentation for details), and if you run your own server you could also consider the newer RUM index types available through an extension.

Let's see what else GIN has to offer:

Indexing LIKE searches with Trigrams and gin_trgm_ops

Sometimes Full Text Search isn't the right fit, but you find yourself needing to index a LIKE search on a particular column:

CREATE TABLE test_trgm (t text);
SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

Due to the nature of the LIKE operation, which supports arbitrary wildcard expressions, this is fundamentally hard to index. However, the pg_trgm extension can help. When you create an index like this:

CREATE INDEX trgm_idx ON test_trgm USING gin (t gin_trgm_ops);

Postgres will split the row values into trigrams, allowing indexed searches:

EXPLAIN SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';

                               QUERY PLAN                               
------------------------------------------------------------------------
 Bitmap Heap Scan on test_trgm  (cost=16.00..20.02 rows=1 width=32)
   Recheck Cond: (t ~~ '%foo%bar'::text)
   ->  Bitmap Index Scan on trgm_idx  (cost=0.00..16.00 rows=1 width=0)
         Index Cond: (t ~~ '%foo%bar'::text)
(4 rows)

Effectiveness of this method varies with the exact data set. But when it works, it can speed up searches on arbitrary text data quite significantly.

PostgreSQL, JSONB and GIN Indexes

JSONB was added to Postgres almost 10 years after GIN indexes were introduced - and it shows the flexibility of the GIN index type that they are the preferred way to index JSONB columns.

Postgres GIN index for JSONB columns using jsonb_ops and jsonb_path_ops

With JSONB in Postgres we gain the flexibility of not having to define our schema upfront, but instead we can dynamically add data to a column in our table in JSON format.

The most basic GIN index example for JSONB looks like this:

CREATE TABLE test (
  id bigserial PRIMARY KEY,
  data jsonb
);
INSERT INTO test(data) VALUES ('{"field": "value1"}');
INSERT INTO test(data) VALUES ('{"field": "value2"}');
INSERT INTO test(data) VALUES ('{"other_field": "value42"}');
CREATE INDEX ON test USING gin(data);

As you can see with EXPLAIN, this is able to use the index, for example when querying for all rows that have the field key defined:

EXPLAIN SELECT * FROM test WHERE data ? 'field';

                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=8.00..12.01 rows=1 width=40)
   Recheck Cond: (data ? 'field'::text)
   ->  Bitmap Index Scan on test_data_idx  (cost=0.00..8.00 rows=1 width=0)
         Index Cond: (data ? 'field'::text)
(4 rows)

The way this gets stored is based on the keys and values of the JSONB data. In the above test data, the default jsonb_ops operator class would store the following values in the GIN index, as separate entries: field, other_field, value1, value2, value42. Depending on the search the GIN index will combine multiple index entries to satisfy the specific query conditions.

Now, we can also use the non-default jsonb_path_ops operator class with a JSONB GIN index. This uses an optimized GIN index structure that would instead store the above data as three individual entries using a hash function: hashfn(field, value1), hashfn(field, value2) and hashfn(other_field, value42).

The jsonb_path_ops class is intended to efficiently support containment queries. First we specify the operator class during index creation:

CREATE INDEX ON test USING gin(data jsonb_path_ops);

And then we can use it for queries such as the following:

EXPLAIN SELECT * FROM test WHERE data @> '{"field": "value1"}';

                                 QUERY PLAN                                  
-----------------------------------------------------------------------------
 Bitmap Heap Scan on test  (cost=8.00..12.01 rows=1 width=40)
   Recheck Cond: (data @> '{"field": "value1"}'::jsonb)
   ->  Bitmap Index Scan on test_data_idx1  (cost=0.00..8.00 rows=1 width=0)
         Index Cond: (data @> '{"field": "value1"}'::jsonb)
(4 rows)

As you can see it’s easy to index a JSONB column. Note that you could technically also index JSONB with other index types by taking specific parts of the data. For example, we could use a B-tree expression index to index the field keys:

CREATE INDEX ON test USING btree ((data ->> 'field'));

The Postgres query planner will then use the specific expression index behind the scenes, if your query matches the expression:

EXPLAIN SELECT * FROM test WHERE data->>'field' = 'value1';

                                 QUERY PLAN                                 
---------------------------------------------------------------------------
 Index Scan using test_expr_idx on test  (cost=0.13..8.15 rows=1 width=40)
   Index Cond: ((data ->> 'field'::text) = 'value1'::text)
(2 rows)

There is one more thing we should look at with finding the right GIN index, and that is multi-column GIN indexes.

Multi-Column GIN Indexes, and Combining GIN and B-tree indexes

Often times you’ll have queries that filter on a column that uses a data type that’s ideal for GIN indexes, such as JSONB, but you are also filtering on another column, that is more of a typical B-tree index candidate:

CREATE TABLE records (
  id bigserial PRIMARY KEY,
  customer_id int4,
  data jsonb
);

SELECT * FROM records WHERE customer_id = 123 AND data @> '{ "location": "New York" }';

In addition you might have a query like the following:

SELECT * FROM records WHERE customer_id = 123;

And you are considering which index to create for the two queries combined.

There are two fundamental strategies you can take:

(1) Create two separate indexes, one on customer_id using a B-tree, and one on data using GIN
- In this situation, for the first query, Postgres might use BitmapAnd to combine the index search results from both indexes to find the affected rows
- Whilst the idea of using two separate indexes sounds great in theory, in practice it often turns out to be the worse performing option. You can find some discussions about this on the Postgres mailing lists.
(2) Create one multi-column GIN index on both customer_id and data
- Note that multi-column GIN indexes don’t help much with making the index more effective, but they can help cover multiple queries with the same index

For implementing the second strategy, we need the help of the “btree_gin” extension in Postgres (part of contrib) that contains operator classes for data types that are not subdividable.

You can create the extension and the multi-column index like this:

CREATE EXTENSION btree_gin;
CREATE INDEX ON records USING gin (data, customer_id);

Note that index column order does not matter for GIN indexes. And as we can see, this gets used during query planning:

EXPLAIN SELECT * FROM records WHERE customer_id = 123 AND data @> '{ "location": "New York" }';

                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Bitmap Heap Scan on records  (cost=16.01..20.03 rows=1 width=41)
   Recheck Cond: ((customer_id = 123) AND (data @> '{"location": "New York"}'::jsonb))
   ->  Bitmap Index Scan on records_customer_id_data_idx  (cost=0.00..16.01 rows=1 width=0)
         Index Cond: ((customer_id = 123) AND (data @> '{"location": "New York"}'::jsonb))
(5 rows)

It’s rather uncommon to use multi-column GIN indexes, but depending on your workload it might make sense. Remember that larger indexes mean more I/O, making index lookups slower, and writes more expensive.

The downside of GIN Indexes: Expensive Updates

As you saw in the examples above, GIN indexes are special because they often contain multiple index entries per single row that is being inserted. This is essential to enable the use cases that GIN supports, but causes one significant problem: Updating the index is expensive.

Due to the fact that a single row can cause 10s or worst case 100s of index entries to be updated, it’s important to understand the special fastupdate mechanism of GIN indexes.

By default fastupdate is enabled for GIN indexes, and it causes index updates to be deferred, so they can occur at a point where multiple updates have to be made, reducing the overhead for a single UPDATE, at the expense of having to do the work at a later point.

The data that is deferred is kept in the special pending list, which then gets flushed to the main index structure in one of three situations:

The gin_pending_list_limit (default of 4MB) is reached during a regular index update
Explicit call to the gin_clean_pending_list function
Autovacuum on the table with the GIN index (GIN pending list cleanup happens at the end of vacuum)

As you can imagine this can be quite an expensive operation, which is why one symptom of index write overhead with GIN can be that every Nth INSERT or UPDATE statement suddenly is a lot slower, in case you run into the first scenario above, where the gin_pending_list_limit is reached.

This exact situation happened to the team at GitLab recently. Let’s look at a real life example of where GIN updates became a problem.

GIN trigram indexes: A lesson from GitLab

The team at GitLab often publishes their discussions of database optimizations publicly, and we can learn a lot from these interactions. A recent example discussed a GIN trigram index that caused merge requests to be quite slow occasionally:

“We can see there are a number of slow updates for updating a merge request. The interesting thing here is that we see very little locking statements (locking is logged after 5 seconds waiting), which suggests something else is occurring to make these slow.”

This was determined to be caused by the GIN pending list:

“Anecdotally, cleaning the gin index pending-list for the description field on the merge_requests table can cost multiple seconds. The overhead does increase when there are more pending entries to write to the index. In this informal survey of manually running gin_clean_pending_list( 'index_merge_requests_on_description_trigram'::regclass ) the duration varied between 465 ms and 3155 ms.”

The team further investigated, and determined that the GIN pending list was flushed a very high number of times during business hours:

“this gin index's pending list fills up roughly once every 2.7 seconds during the peak hours of a normal weekday.”

If you want to read the full story, GitLab’s Matt Smiley has done an excellent analysis of the problem they’ve encountered.

As we can see, getting good data about the actual overhead of GIN pending list updates is critical.

Measuring GIN pending list overhead and size

To validate whether the GIN pending list is a problem on a busy table, we can do a few things:

First, we could utilize the pgstatginindex function together with something like psql’s \watch command to keep a close eye on a particular index:

CREATE EXTENSION pgstattuple;
SELECT * FROM pgstatginindex('myindex');

 version | pending_pages | pending_tuples 
---------+---------------+----------------
       2 |             0 |              0
(1 row)

Second, If you run your own database server, you can use “perf” dynamic tracepoints to measure calls to the ginInsertCleanup function in Postgres:

sudo perf probe -x /usr/lib/postgresql/14/bin/postgres ginInsertCleanup
sudo perf stat -a -e probe_postgres:ginInsertCleanup -- sleep 60

An alternate method, using DTrace, was described in a 2019 PGCon talk. The authors of that talk also ended up visualizing different gin_pending_list_limit and work_mem settings:

As they discovered, the memory limit during flushing of the pending list makes a quite noticable difference.

If you don't have the luxury of direct access to your database server, you can estimate how often the pending list fills up based on the average size of index tuples and other statistics.

Now, if we determine that we have a problem, what can we do about it?

Strategies for dealing with GIN pending list update issues

There are multiple alternate ways you can resolve issues like the one GitLab encountered:

(1) Reduce gin_pending_list_limit
- Have more frequent, smaller flushes
- This may sound odd - but gin_pending_list_limit started out as being determined by work_mem (instead of being its own setting), and is only configurable separately since Postgres 9.5 - explaining the 4MB default, which may be too high in some cases
(2) Increase gin_pending_list_limit
- Have more opportunities to cleanup the list outside of the regular workload
(3) Turning off fastupdate
- Taking the overhead with each individual INSERT/UPDATE
(4) Tune autovacuum to run more often on the table, in order to clean the pending list
(5) Explicitly calling gin_clean_pending_list(), instead of relying on Autovacuum
(6) Drop the GIN index
- If you have alternate ways of indexing the data, for example using expression indexes

Depending on your workload one or multiple of these approaches could be a good fit.

In addition, it’s important to ensure you have sufficient memory available during the GIN pending list cleanup. The memory limit used for the pending list flush can be confusing, and is not related to the size of gin_pending_list_limit. Instead it uses the following Postgres settings:

work_mem during regular INSERT/UPDATE
maintenance_work_mem during gin_clean_pending_list() call
autovacuum_work_mem during autovacuum

Last but not least, you may want to consider partitioning or sharding a table that encounters problems like this. It may not be the easiest thing to do, but scaling GIN indexes to heavy write workloads is quite a tricky business.

GIN index support in the pganalyze Index Advisor

Not sure if your workload could utilize a GIN index, or which index to create for your queries?

We have now added initial support for GIN and GIST index recommendations to the pganalyze Index Advisor.

Here is an example of a GIN index recommendation for an existing tsvector column:

Note that the costing and size estimation logic for GIN and GIST indexes is still being actively developed.

We recommend trying out the Index Advisor recommendation on your own system to assess its effectiveness, as well as monitoring the production table for write overhead after you have added an index. You may also need to tweak your queries to make use of a particular index.

Conclusion

GIN indexes are powerful, and often the only way to index certain queries and data types. But with great power comes great responsibility. Use GIN indexes wisely, especially on tables that are heavily written to.

And when you are not sure which GIN index could work, try out the pganalyze Index Advisor.

If you want to share this article with your peers, feel free to tweet it.

Other helpful resources

Using Postgres CREATE INDEX: Understanding operator classes, index types & more

How we deconstructed the Postgres planner to find indexing opportunities

Efficient Search in Rails with Postgres (PDF eBook)

Efficient Postgres Full Text Search in Django

Full Text Search in Milliseconds with Rails and PostgreSQL

Efficient Pagination in Django and Postgres

eBook: Effective Indexing in Postgres

Webinar: How To Reason About Indexing Your Postgres Database

5mins of Postgres E17: Demystifying Postgres for application developers: A mental model for tables and indexes

pganalyze Index Advisor for Postgres

]]>

How we deconstructed the Postgres planner to find indexing opportunities

Lukas Fittl — Tue, 02 Nov 2021 12:00:00 GMT

Everyone who has used Postgres has directly or indirectly used the Postgres planner. The Postgres planner is central to determining how a query gets executed, whether indexes get used, how tables are joined, and more. When Postgres asks itself "How do we run this query?”, the planner answers.

And just like Postgres has evolved over decades, the planner has not stood still either. It can sometimes be challenging to understand what exactly the Postgres planner does, and which data it bases its decisions on.

Earlier this year we set out to gain a deep understanding of the planner to improve indexing tools for Postgres. Based on this work we launched the first iteration of the pganalyze Index Advisor over a month ago, and have received an incredible amount of feedback and overall response.

In this post we take a closer look at how we extracted the planner into a standalone library, just like we did with pg_query. We then assess whether this approach compares to an actually running server, and what is possible now that we can run the planner code. Based on this we look at how we used its decision making know-how to find indexing opportunities, and review the topic of clause selectivity, and how we incorporated feedback by a Postgres community member.

Planning a Postgres query without a running database server
Understanding Postgres clause selectivity
- How we incorporated Postgres community feedback
Creating the best index, vs creating “good enough” indexes
- Join us for design research sessions
Conclusion

Planning a Postgres query without a running database server

At pganalyze we offer performance recommendations for production database systems, without requiring complex installation steps or version upgrades. Whilst Postgres’ extension system is very capable, and we have many ideas on what we could track or do inside Postgres itself, we intentionally decided not to focus on a Postgres extension for giving index advice.

There are three top motivations for not creating an extension:

Index decisions often happen during development, where the database that you are working with is not production sized
Not everyone has direct access to the production database - it’s important we create tooling that can be used by the whole development team
Adopting a new Postgres extension on a production database is risky, especially if the code is new - and you may not be able to install custom extensions (e.g. on Amazon RDS)

We’ve thus focused on creating something that runs separately from Postgres, but knows how Postgres works. Our approach is inspired by our work on pg_query, and enables planning a query solely based on the query text, the schema definition, and table statistics.

We utilized libclang to automatically extract source code from Postgres, just like we've done for pg_query. Whilst for pg_query we extracted a little bit over 100,000 lines of Postgres source, for the planner we extracted almost 470,000 lines of Postgres source, more than 4x the amount of code. For reference, Postgres itself is almost 1,000,000 lines of source code (as determined by sloccount).

Examples of code from Postgres we didn’t use: The executor (except for some initialization routines), the storage subsystem, frontend code, and various specialized code paths.

A good amount of engineering time later, we ended up with a seemingly simple function in a C library, that takes a query, a schema definition, and returns a result similar to an EXPLAIN plan:

/*
 * Plan the provided query utilizing the schema definition and the
 * provided table statistics, and return an EXPLAIN-like result.
 */
PgPlanResult pg_plan(const char* query, const char *schema_and_statistics)
{
  …
}

This function is idempotent, that is, when you pass the same set of input parameters, you will always get the same output parameters.

This required some additional modifications to the extracted code (we have about 90 small patches to adjust certain code paths), especially in places where Postgres does the rare on-demand checking of file sizes, or looking at the B-tree meta page. All of these are instead a fixed input parameter, defined using SET commands in the schema definition.

How accurate is this planning process?

Let’s take a look at one of our own test queries:

WITH unused_indexes AS MATERIALIZED (
  SELECT schema_indexes.id, schema_indexes.name, schema_indexes.last_used_at, schema_indexes.database_id, schema_indexes.table_id 
    FROM schema_indexes
         JOIN schema_tables ON (schema_indexes.table_id = schema_tables.id)
   WHERE schema_indexes.database_id IN (1)
         AND schema_tables.invalidated_at_snapshot_id IS NULL
         AND schema_indexes.invalidated_at_snapshot_id IS NULL
         AND schema_indexes.is_valid
         AND NOT schema_indexes.is_unique AND 0 <> ALL (schema_indexes.columns)
         AND schema_indexes.last_used_at < now() - '14 day'::interval
)
SELECT ui.id, ui.name, ui.last_used_at, ui.database_id, ui.table_id 
  FROM unused_indexes ui
 WHERE COALESCE((
         SELECT size_bytes
           FROM schema_index_stats_35d sis
          WHERE sis.schema_index_id = ui.id
                AND collected_at = '2021-10-31 06:40:04' LIMIT 1), 0) > 32768

This query is used inside the pganalyze application to find indexes that were not in use in the last 14 days. Running EXPLAIN (FORMAT JSON) for the query on our production system, we get a result like this:

 [
   {
     "Plan": {
       "Node Type": "CTE Scan",
       …
       "Startup Cost": 3172.85,
       "Total Cost": 3311.01, 
       "Plan Rows": 11,
       "Plan Width": 60,
       "Filter": "(COALESCE((SubPlan 2), '0'::bigint) > 32768)",
       "Plans": [
         {
           "Node Type": "Nested Loop",
           …
           "Startup Cost": 1.12,
           "Total Cost": 3172.85,
           "Plan Rows": 32,
           "Plan Width": 63, 
           "Inner Unique": true,
           "Plans": [
             {
               "Node Type": "Index Scan",
              "Parent Relationship": "Outer", 
               …
               "Index Name": "index_schema_indexes_on_database_id",
               …
               "Startup Cost": 0.56,
               "Total Cost": 2581.00,
               "Plan Rows": 69,
               "Plan Width": 63,
               "Index Cond": "(database_id = 1)",
               "Filter": "(is_valid AND (NOT is_unique) AND (last_used_at < '2021-10-17'::date) AND (0 <> ALL (columns)))"
             },
             {
               "Node Type": "Index Scan",
               "Parent Relationship": "Inner",
               …
               "Index Name": "schema_tables_pkey",
               …
               "Startup Cost": 0.56,
               "Total Cost": 8.58,
               "Plan Rows": 1,
               "Plan Width": 8,
               "Index Cond": "(id = schema_indexes.table_id)",
               "Filter": "(invalidated_at_snapshot_id IS NULL)"
             }
...

Note that we are intentionally running EXPLAIN without ANALYZE, since we care about the cost-based estimation model used by the planner.

And now, running the same query, with its schema definition and production statistics (but not the actual table data!) provided to the pg_plan function:

[
  {
    "Plan": {
      "Node ID": 0,
      "Node Type": "CTE Scan",
      …
      "Startup Cost": 3181.43,
      "Total Cost": 3324.07,
      "Plan Rows": 11,
      "Plan Width": 60,
      "Filter": "(COALESCE((SubPlan 2), '0'::bigint) > 32768)",
      "Plans": [
        {
          "Node ID": 1,
          "Node Type": "Nested Loop",
          …
          "Startup Cost": 1.12,
          "Total Cost": 3181.43,
          "Plan Rows": 33,
          "Plan Width": 63,
          "Inner Unique": true,
          "Plans": [
            {
              "Node ID": 2,
              "Node Type": "Index Scan",
              "Parent Relationship": "Outer", 
              …
              "Index Name": "index_schema_indexes_on_database_id",
              …
              "Startup Cost": 0.56,
              "Total Cost": 2581.00,
              "Plan Rows": 70,
              "Plan Width": 63,
              "Index Cond": "(database_id = 1)",
              "Filter": "(is_valid AND (NOT is_unique) AND (last_used_at < '2021-10-17'::date) AND (0 <> ALL (columns)))",
            },
            {
              "Node ID": 3,
              "Node Type": "Index Scan",
              "Parent Relationship": "Inner",
              …
              "Index Name": "schema_tables_pkey",
              …
              "Startup Cost": 0.56,
              "Total Cost": 8.58,
              "Plan Rows": 1,
              "Plan Width": 8,
              "Index Cond": "(id = schema_indexes.table_id)",
              "Filter": "(invalidated_at_snapshot_id IS NULL)",
            }
            …

As you can see, for this query the plan cost estimation is within a 1% margin of the actual production estimates. That means, we provided the Postgres planner the exact same input parameters as used on the actual database server, and the cost calculation matched almost to the dot.

Now that we’ve established a basis for running the planner and getting cost estimates, let’s look at what we can do with this.

Finding multiple possible plan paths, not just the best path

When the Postgres planner plans a query, it is under time-sensitive circumstances. That is, all extra work to find a better plan would lead to the planner itself being slow. To be fast, the planner quickly throws away plan options it does not consider worth pursuing.

That unfortunately means we can’t just run EXPLAIN with a flag that says “show me all possible plan variants” - the planner code is simply not written in a way that’s possible, at least not today.

However, with our pg_plan logic running outside the server itself, we do not have these strict speed requirements, and can therefore spend more time looking at alternatives and keeping them around for analysis. For example, here is the internal information we have for a scan node on a table, that illustrates the different paths that could be taken to fulfill the query:

    "Scans": [
      {
        "Node ID": 2,
        "Relation OID": 16398,
        "Restriction Clauses": [
	      …
        ],
        "Plans": [
          {
            "Plan": {
              "Node Type": "Index Scan",
              "Index Name": "schema_indexes_table_id_name_idx",
              …
              "Startup Cost": 0.68,
              "Total Cost": 352.94,
              "Index Cond": "(table_id = id)",
              "Filter": "(is_valid AND (NOT is_unique) AND (last_used_at < '2021-10-17'::date) AND (database_id = 1) AND (0 <> ALL (columns)))",
            },
          },
          {
            "Plan": {
              "Node Type": "Index Scan",
              "Index Name": "index_schema_indexes_on_database_id",
              ...
              "Startup Cost": 0.56,
              "Total Cost": 2581.00,
              "Index Cond": "(database_id = 1)",
              "Filter": "(is_valid AND (NOT is_unique) AND (last_used_at < '2021-10-17'::date) AND (0 <> ALL (columns)))",
            },
          },
          {
            "Plan": {
              "Node ID": 0,
              "Node Type": "Seq Scan",
              ...
              "Startup Cost": 0.00,
              "Total Cost": 3933763.60,
              "Filter": "((invalidated_at_snapshot_id IS NULL) AND is_valid AND (NOT is_unique) AND (last_used_at < '2021-10-17'::date) AND (database_id = 1) AND (0 <> ALL (columns)))",
            },
          },

As you can see, the Seq Scan option was clearly more expensive and not considered. You can also see the different index options and their costs.

What is especially interesting with this plan is that there was actually a cheaper index scan available, but Postgres did not end up using it in the final plan. This is because the Nested Loop ended up being cheaper by using the schema_indexes table as the outer table in the nested loop. The first index could only have been used if the Nested Loop relationship was inverted. That is, if table_id values were used as the input to the schema_indexes scan, instead of table_id values being the output thats matched against the schema_tables table's id column.

As you can see, this data can be especially useful when determining why a particular index wasn’t used, or to consider how to consolidate indexes. In the pganalyze Index Advisor this is surfaced visually in the advanced analysis view:

Note we also indicate the individual filter clauses for the scan, and show which indexes are matching each clause.

Making index recommendations based on restriction clauses

In addition to comparing different existing indexes, we can use the data available to the planner to ask the question “What would the best index look like?”.

For a scan like the above example, we get a list of restriction clauses, which is a combination of the WHERE clauses as well as the JOIN condition. For index scans to work as expected, one or more of the clauses need to match the index definition.

The data looks like this for each scan:

        "Restriction Clauses": [
          {
            "ID": 1,
            "Expression": "schema_indexes.is_valid",
            "Selectivity": 0.9926,
            "Relation Column": "is_valid"
          },
          {
            "ID": 2,
            "Expression": "(schema_indexes.database_id = 1)",
            "Selectivity": 0.0001,
            "OpExpr": {
              "Operator": {
                "Oid": 416,
                "Name": "=",
                "Left Type": "bigint",
                "Right Type": "integer",
                "Result Type": "boolean",
                "Source Func": "int84eq"
              }
            },
            "Relation Column": "database_id"
          },
         ...

Using this data we then attempt a best guess at making a new index, run a CREATE INDEX command behind the scenes, and re-run the Postgres planner to reconsider the new index. If the cost of the new scan improves on the initial scan we make a recommendation and note the difference in estimated cost.

In summary, you can imagine the Index Advisor working roughly like this:

Understanding Postgres clause selectivity

If you look closely at the earlier advanced analysis screenshot, you will notice a new field that we’ve just made available in a new Index Advisor update: Selectivity.

What is Selectivity? It indicates what fraction of rows of the table will be matched by the particular clause of the query. This information is then used by Postgres to estimate the row count that a node returns, as well as determine the cost of that plan node.

Selectivity estimations are front and center to how the planner operates, but they are unfortunately hidden behind the scenes, and historically one would have had to resort to counting/filtering the actual data to confirm how frequent certain values are, or do manual queries against the Postgres catalog.

How does the planner know the selectivity? Counting actual table rows would be very expensive in time sensitive situations. Instead, it primarily relies on the pg_statistic table (often accessed through the pg_stats view for debugging), that keeps table statistics collected by the ANALYZE command in Postgres. You can learn more about how the Postgres planner uses statistics in the Postgres documentation.

The data in the pg_stats view can be queried like this:

SELECT * FROM pg_stats WHERE tablename = 'z' AND attname = 'a';

-[ RECORD 1 ]----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
schemaname             | public
tablename              | z
attname                | a
inherited              | f
null_frac              | 0
avg_width              | 4
n_distinct             | 17
most_common_vals       | {2,3,7,12,13,4,1,5,11,14,9,6,10,8,15,16,0}
most_common_freqs      | {0.0653,0.06446667,0.063766666,0.06363333,0.063533336,0.063433334,0.0629,0.061966665,0.061833333,0.0618,0.0611,0.0605,0.0604,0.060366668,0.059666667,0.0332,0.032133333}
histogram_bounds       | 
correlation            | 0.061594862
most_common_elems      | 
most_common_elem_freqs | 
elem_count_histogram   |

In this example you can see that there are a total of 17 distinct values (n_distinct), with values between 1 and 15 having equal frequency, and 0 and 16 being less frequent (most_common_vals/most_common_freqs). None of the rows have NULL values (null_frac).

Now, to have accurate plans in the Index Advisor, this same information can be provided using the new special SET commands in the schema definition:

SET pganalyze.avg_width.public.z.a = 4;
SET pganalyze.correlation.public.z.a = 0.061594862;
SET pganalyze.most_common_freqs.public.z.a = '{0.0653,0.06446667,0.063766666,0.06363333,0.063533336,0.063433334,0.0629,0.061966665,0.061833333,0.0618,0.0611,0.0605,0.0604,0.060366668,0.059666667,0.0332,0.032133333}';
SET pganalyze.most_common_vals.public.z.a = '{2,3,7,12,13,4,1,5,11,14,9,6,10,8,15,16,0}';
SET pganalyze.n_distinct.public.z.a = 17;
SET pganalyze.null_frac.public.z.a = 0;

You can learn how to retrieve this information, as well as all the available settings, in the Index Advisor documentation.

Based on this data we can now calculate the selectivity of a clause like z.a = 12 to determine that it is 0.0636. Or put differently, the planner estimates that 6.36% of the table would match this condition. This same information is now directly visible in the Index Advisor, when viewing the advanced analysis.

How we incorporated Postgres community feedback

At this point we’d also like to give a shout-out to Hubert Lubaczewski (aka “depesz”), who reviewed the initial version of the index advisor, had some critical feedback, and provided an example we could investigate further.

Based on improvements we've done, we now take selectivity estimates into account for index suggestions. In particular, we give priority to columns with low selectivity, i.e. those that match a small number of rows. Note this requires use of SET commands in addition to the raw schema data for the best results.

With these recent changes the pganalyze Index Advisor recommendation matches depesz's handcrafted index suggestion in the blog post:

This is a good example of how our planner-based index advisor approach can be improved and tuned, as its behavior is modeled on Postgres itself.

Creating the best index, vs creating “good enough” indexes

Another question that came up in multiple conversations, is “Should I just create all indexes that the Index Advisor recommends for each query?”.

Unless you have just a handful of queries, the answer to that is no - you shouldn’t just create every index, because that would slow down writes to the table, as they have to update each index separately.

Today, the best way to utilize the index advisor for a whole database, is to try out different CREATE INDEX statements - and make sure to update the schema definition with your index definition, to have the Index Advisor make a determination based on the existing indexes.

But we are taking this a step further. The work we are currently doing in this area is focused on two aspects:

Utilize query workload data from pg_stat_statements to weigh common queries heavier in index recommendations, and come up with “good enough” indexes that cover more queries
Estimate the write overhead of a new index, based on the number of updates/deletes/inserts on a table, as well as the estimated index size

With this, not only are we targeting better summary recommendations, but we also want to help you determine when you can consolidate indexes, where you have two existing very similar indexes.

Curious to learn more? Sign up to join us for a design research session:

Join us for design research sessions

If you are up for testing early prototypes and answering questions to help us understand your workflows better, then we would like to invite you to our design research sessions:

pganalyze Design Research Sign-Up

Conclusion

Realistically, there will always be a trial-and-error aspect in making indexing decisions. But good tools can help guide you in those decisions, no matter your level of Postgres know-how. Our goal with the pganalyze Index Advisor is to make indexing an activity that can be done by the whole team, and where it’s easy to get a CREATE INDEX statement to start working from.

As you see, the Index Advisor is based on the core logic of Postgres itself, and that forms the basis for making complex assessments behind the scenes. We believe in an iterative process and sharing what we’ve learned, and hope to continue the conversation on how to make indexing better for Postgres.

We've recently made a number of updates to the Index Advisor. Give it a try, and use the new SET table statistics syntax for best results. Encounter an issue with the Index Advisor? You can provide feedback through our dedicated discussion board on GitHub, or send us a support request for in-app functionality.

If you want to share this article with your peers, feel free to tweet it.

]]>

A better way to index your Postgres database: pganalyze Index Advisor

Lukas Fittl — Thu, 23 Sep 2021 12:00:00 GMT

When you run an application with a relational database attached, you will no doubt have encountered this question: Which indexes should I create?

For some of us, indexing comes naturally, and B-tree, GIN and GIST are words of everyday use. And for some of us it’s more challenging to find out which index to create, taking a lot of time to get right. But what unites us is that creating and tweaking indexes is part of our job when we use a relational database such as Postgres in production. We need to get indexes right, in order to make sure our application performs well.

There are multiple ways to determine which indexes get used in your Postgres database. For example, you may choose to query the pg_stat_user_indexes table. There are Postgres extensions like HypoPG to try out hypothetical indexes on your database server. And some of us may decide to go ahead and simply index every column on every table.

But the reality nowadays is that modern apps are complex, and applications built on Postgres grow at an incredible pace. This makes indexing more important, but also more challenging than ever. As developers we want to focus on what matters, and not spend hours investigating which Postgres index to create.

At the beginning of this year we set out to improve the status quo for indexing with Postgres. And today, after many months of effort and having published an eBook about Index in Postgres, we’re excited to announce the new pganalyze Index Advisor for Postgres.

Before we dive into all the details, let’s take a step back and ask ourselves “How could we determine which index to create?”

Postgres Indexing: Is machine learning the answer?
How Postgres determines when to use an index
Creating the best Postgres index for your query
Review existing indexes with the Index Advisor
Try out the Index Advisor for free with the standalone tool
Automatic index advisor for your production queries in pganalyze
pganalyze Index Advisor and new pricing plans
Conclusion

Postgres Indexing: Is machine learning the answer?

It’s 2021, and of course we had to ask ourselves - is this a problem that requires ML and AI? Couldn’t we just train a model to create the right indexes for us?

We turned to GitHub CoPilot, the most sophisticated AI-based helper that exists today for developers, and asked it to create an index for a real world query in our own Postgres database:

Suffice to say that indexing like this is not effective. You will end up with significant overhead due to indexing almost everything, including columns that are not even referenced in the query.

Whilst this ML model will certainly improve, and there is research on more purpose-built solutions for databases, the point is: ML is not the magic solution we are looking for. We need more than just machine learning to know which indexes to create.

In fact, from our own experience, knowing which index to create does not require an ML model at all. Knowing how to create the best index can be done with a deterministic approach, that takes into account production database queries and schema statistics, and has a detailed understanding of how Postgres works.

And who knows best how Postgres works? Postgres itself!

How Postgres determines when to use an index

We started out by asking ourselves the question: How does Postgres decide which index to use? We can find this logic in the Postgres planner, which takes a parsed query and turns it into an execution plan.

Specifically, we decided to look at the function create_index_paths(..), where you can see that Postgres loops over all indexes on a particular table, and decides which indexes can be used:

void
create_index_paths(PlannerInfo *root, RelOptInfo *rel)
{
	...

	/* Skip the whole mess if no indexes */
	if (rel->indexlist == NIL)
		return;

	/* Bitmap paths are collected and then dealt with at the end */
	bitindexpaths = bitjoinpaths = joinorclauses = NIL;

	/* Examine each index in turn */
	foreach(lc, rel->indexlist)
	{
		IndexOptInfo *index = (IndexOptInfo *) lfirst(lc);

		/*
		 * Ignore partial indexes that do not match the query.
		 * (generate_bitmap_or_paths() might be able to do something with
		 * them, but that's of no concern here.)
		 */
		if (index->indpred != NIL && !index->predOK)
			continue;
   ...

Going into all this logic would likely fill multiple books, and it is based on decades of academic research. Cleary, Postgres is very sophisticated about determining which indexes can be used for a given query. Amongst the core decisions it makes are:

Does the index match the columns used in the query?
Does the query’s operator match the operator class of the index?
Does the index have a sort order that can be used by the query to avoid an explicit Sort step?
Does the query condition match a partial index condition?

And many other requirements and heuristics that need extensive knowledge of Postgres’ inner workings.

At pganalyze, we looked at this, and other functions, and we asked ourselves: What if we used the Postgres planner to tell us which index it would like to see, based on a given query?

That is, instead of asking “does this index match this query?”, we are asking “what’s the perfect index for this query?”. Perfect as in: ticks all the boxes in terms of operators/operator classes, columns and data types, and can be used to fulfill the query filter and join clauses of the query, if possible.

This logic based on the Postgres planner is the centerpiece of the new pganalyze Index Advisor. Our index advisor is available in the pganalyze app, but we also decided to provide a free, standalone version available to anyone.

Simply paste your query and schema data and get insights on whether existing indexes are useful, or learn why indexes you thought might help are ignored. Note that data uploaded to the standalone pganalyze Index Advisor stays local within your browser, unless you explicitly use the share functionality.

Going forward in this article, when you see examples and screenshots of the index advisor for Postgres, we are showing the public, standalone tool.

Creating the best Postgres index for your query

Let’s go back to our earlier example, and run it through the pganalyze Index Advisor:

As you can see, we get a recommendation for a single multi column index that covers all columns that are in the WHERE clause, except for the column that’s inside the OR condition. This is the best index that we can create to ensure the query runs fast.

At launch the index advisor is focused on recommending B-tree indexes, with support for other index types coming soon.

Note that the index advisor also understands common query patterns like filtering out records based on deleted_at column, and recommends partial indexes for these queries:

Review existing indexes with the Index Advisor

The pganalyze Index Advisor is also able to determine how different existing indexes perform, to help you understand which index Postgres will most likely use.

For example, imagine a schema and index definition like this:

CREATE TABLE events(
  id bigserial PRIMARY KEY,
  created_at timestamptz,
  severity smallint,
  organization_id bigint,
  description text,
  details jsonb
);
CREATE INDEX ON events(organization_id, severity);

We want to understand how effective this index is for queries that only query the “severity” column, without looking up a particular organization.

With the index advisor, we can see the cost difference between the indexes, and that Postgres prefers using the single-column index in most situations:

This can be explained by the fact that single-column indexes are usually smaller, and it’s more efficient, especially in older Postgres releases, to find index records when the queried column is listed first in the column list. You may still choose to use a multi-column index, but this helps you understand the trade-off.

Try out the Index Advisor for free with the standalone tool

Want to try out the index advisor yourself? As mentioned above, we developed a standalone version of the index advisor that runs fully in your web browser, powered by our self-contained Postgres planner compiled to WebAssembly.

You can simply go to https://pganalyze.com/index-advisor, paste your query and schema, and get your recommendations. If you don’t have a query and schema ready, for example because you are reading this on your mobile phone, you can take a look at how it works with a set of examples we added for your convenience.

We’ve also ensured that the standalone tool is ready for collaboration. If you want to share index recommendations with your team, simply click the [Share] button. After you confirm, this uploads the result of the index advisor to the pganalyze servers for sharing, and gives you a unique URL to share. Note that unless you share, all data stays local within your web browser.

Of course, copying query texts can be tedious and a lot of work. But, if you are a pganalyze customer, we already have your query information in our app. The second part of today’s launch is about the new in-app pganalyze Index Advisor.

Automatic index advisor for your production queries in pganalyze

With the new Index Advisor in pganalyze, you can now see at a glance what index recommendations exist for each of your queries. You can simply go to the query details page for your queries, and see what the Index Advisor recommends:

This is really nice, but we already have work underway to help you get an even better assessment of index usage summarized across your whole database. But more on that soon (sign up for the newsletter if you want to get updates about this).

pganalyze Index Advisor and new pricing plans

The pganalyze Index Advisor represents a significant improvement to the core functionality of pganalyze, and introduces additional sophisticated processing for each query received by pganalyze. We are therefore taking this moment to introduce both a new Production and a new Scale plan. In addition to the Index Advisor, the new Scale plan also features SAML-based Single Sign On in early access, to integrate with identity providers such as Okta.

If you are an existing pganalyze customer on (what is now) a legacy plan you can try out the Index Advisor until the end of October 2021. Trying out the Index Advisor requires no changes to your existing pganalyze integration.

If you do not have an account with us at the moment but sign up for a new trial the pganalyze Index Advisor will be activated for your 14-day trial. Try it out today in the pganalyze app, or start a new trial.

Conclusion

All of us at pganalyze are excited to share the new pganalyze Index Advisor with you. Try out the standalone tool or explore the new in-app functionality today. We hope the standalone tool is a service you will come back to time and again and get value out of it. Feel free to bookmark it!

You can provide feedback through our dedicated discussion board on GitHub, or send us a support request for in-app functionality. We look forward to hearing from you.

If you want to share this article with your peers, feel free to tweet it.

]]>

Using Postgres CREATE INDEX: Understanding operator classes, index types & more

Lukas Fittl — Thu, 12 Aug 2021 12:00:00 GMT

Most developers working with databases know the challenge: New code gets deployed to production, and suddenly the application is slow. We investigate, look at our APM tools and our database monitoring, and we find out that the new code caused a new query to be issued. We investigate further, and discover the query is not able to use an index.

But what makes an index usable by a query, and how can we add the right index in Postgres?

In this post we’ll look at the practical aspects of using the CREATE INDEX command, as well as how you can analyze a PostgreSQL query for its operators and data types, so you can choose the best index definition.

How do you create an index in Postgres?
- Parse analysis: How Postgres interprets your query
- Looking behind the scenes: Operators and data types
Finding the right index type
Specifying operator classes during CREATE INDEX
Specifying multiple columns when adding a Postgres index
Using functions and expressions in an index definition
Specifying a WHERE clause to create partial PostgreSQL indexes
Using INCLUDE to create a covering index for Index-Only Scans
Adding and dropping PostgreSQL indexes safely on production
Conclusion

How do you create an index in Postgres?

Before we dive into the internals, let’s set the stage and look at the most basic way of creating an index in Postgres. The essence of adding an index is this:

CREATE INDEX ON [table] ([column1]);

For an actual example, let’s say we have a query on our users table that looks for a particular email address:

SELECT * FROM users WHERE users.email = 'test@example.com';

We can see this query is searching for values in the “email” column - so the index we should create is on that particular column:

CREATE INDEX ON users (email);

When we run this command, Postgres will create an index for us.

It's important to remember that indexes are redundant data structures. If you drop an index you don't lose any data. The primary benefit of an index is to allow faster searching of particular rows in a table. The alternative to having an index is to have Postgres scan each row individually ("Sequential Scan"), which is of course very slow for large tables.

Let's take a look behind the scenes of how Postgres determines whether to use an index.

Parse analysis: How Postgres interprets your query

When Postgres runs our query, it steps through multiple stages. At a high level, they are:

Parsing (see our blog post on the Postgres parser)
Parse analysis
Planning
Execution

Throughout these stages the query is no longer just text - it's represented as a tree. Each stage modifies and annotates the tree structure, until it's finally executed. For understanding Postgres index usage, we need to first understand what parse analysis does.

Lets pick a slightly more complex example:

SELECT * FROM users WHERE users.email = 'test@example.com' AND users.deleted_at IS NULL;

We can look at the result of parse analysis by turning on the debug_print_parse setting, and then looking at the Postgres logs (not recommended on production databases):

LOG:  parse tree:
DETAIL:     {QUERY 
	   ...
	      :quals 
	         {BOOLEXPR 
	         :boolop and 
	         :args (
	            {OPEXPR 
	            :opno 98 
	            :opfuncid 67 
	            :opresulttype 16 
	            :opretset false 
	            :opcollid 0 
	            :inputcollid 100 
	            :args (
	               ...
	            )
	            :location 38
	            }
	            {NULLTEST 
	            :arg 
	               ...
	            :nulltesttype 0 
	            :argisrow false 
	            :location 80
	            }
...

This format is a bit hard to read - let’s look at it in a more visual way, and with names instead of OIDs:

We can see two important parse nodes here, one for each expression in the WHERE clause. The OpExpr node, and the NullTest node. For now, let's focus on the OpExpr node.

Looking behind the scenes: Operators and data types

It's important to remember that Postgres is an object-relational database system. That is, it's designed from the ground up to be extensible. Many of the references that are added in parse analysis are not hard-coded logic, but instead reference actual database objects in the Postgres catalog tables.

The two most important objects to know about are data types and operators. You are most likely familiar with data types in Postgres, for example you have used them when specifying the schema for your table. Operators in Postgres define how particular comparisons between one or two values, for example in a WHERE clause, are implemented.

The OpExpr node represents an expression that uses an operator to compare one or two values of a given type. In this case you can see we are using the =(text, text) operator. This operator utilizes the = symbol as its name, and has a text data type on the left and right of the operator.

We can query the pg_operator table to see details about it, including which function implements the operator:

SELECT oid, oid::regoperator, oprcode, oprnegate::regoperator
  FROM pg_operator
 WHERE oprname = '=' AND oprleft = 'text'::regtype AND oprright = 'text'::regtype;

 oid |     oid      | oprcode |   oprnegate   
-----+--------------+---------+---------------
  98 | =(text,text) | texteq  | <>(text,text)
(1 row)

And if you really want to know what’s happening, you can look up the operator's underlying texteq function in the Postgres source:

/*
 * Comparison functions for text strings.
 */
Datum
texteq(PG_FUNCTION_ARGS)
{
    ...
    if (lc_collate_is_c(collid) ||
		collid == DEFAULT_COLLATION_OID ||
		pg_newlocale_from_collation(collid)->deterministic)
    {
        ...
    	result = (memcmp(VARDATA_ANY(targ1), VARDATA_ANY(targ2),
				  len1 - VARHDRSZ) == 0);
        ...
    }
    else
	{
        ...
        result = (text_cmp(arg1, arg2, collid) == 0);
        ...
    }
    ...
}

That function illustrates nicely how Postgres considers the collation to determine whether it can do a fast comparison that simply compares bytes, or whether it has to do a more expensive full text comparison. As we can see from the source, using a C locale for your collation can yield performance benefits.

Of course you can also define your own custom operators that work on your own custom data types. Postgres is extensible like that, and that’s actually pretty neat.

Operators are essential for creating the right index. The operator that is used by an expression is the most important detail, besides the column name, that indicates whether a particular index can be used.

You can think of operators as the "how" we want to search the table for values. For example, we may use a simple = operator to match values for equality against an input value. Or we may utilize a more complex operator, such as @@ to perform a text search on a tsvector column.

Finding the right index type

When you think of an index type, it's important to remember that it's ultimately a specific data structure that supports a specific, limited set of search operators. For example, the most common index type in Postgres, the B-tree index, supports the = operator as well as the range comparison operators (<, <=, =>, >), and the ~ and ~* operators in some cases. It does not support any other operators.

Let's say we have a tsvector column on our users table, and we use the @@ operator to search the column:

SELECT * FROM users WHERE about_text_search @@ to_tsquery('index');

Even if I create an index, it keeps doing a sequential scan:

CREATE INDEX ON users(about_text_search);

pgaweb=# EXPLAIN SELECT * FROM users WHERE about_text_search @@ to_tsquery('index');
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on users  (cost=10000000000.00..10000000006.51 rows=1 width=4463)
   Filter: (about_text_search @@ to_tsquery('index'::text))
(2 rows)

This is because a B-tree index does not have the correct data structure to support text searches. There is no operator class that matches B-Tree indexes and the @@(tsvector,tsquery) operator.

Like earlier, thanks to Postgres extensibility, we can introspect the system to understand operator classes. Which index type can support the @@ operator on a tsvector column?

We can query the internal tables to answer this question:

SELECT am.amname AS index_method,
       opf.opfname AS opfamily_name,
       amop.amopopr::regoperator AS opfamily_operator
  FROM pg_am am,
       pg_opfamily opf,
       pg_amop amop
 WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid
       AND amop.amopopr = '@@(tsvector,tsquery)'::regoperator;

 index_method | opfamily_name |  opfamily_operator   
--------------+---------------+----------------------
 gist         | tsvector_ops  | @@(tsvector,tsquery)
 gin          | tsvector_ops  | @@(tsvector,tsquery)
(2 rows)

Looks like we need either a GIN or GIST index! We can create a GIN index like this:

CREATE INDEX ON users USING gin (about_text_search);

And voilà, it can be used by the query:

=# EXPLAIN SELECT * FROM users WHERE about_text_search @@ to_tsquery('index');
                                        QUERY PLAN                                         
-------------------------------------------------------------------------------------------
 Bitmap Heap Scan on users  (cost=8.25..12.51 rows=1 width=4463)
   Recheck Cond: (about_text_search @@ to_tsquery('index'::text))
   ->  Bitmap Index Scan on users_about_text_search_idx1  (cost=0.00..8.25 rows=1 width=0)
         Index Cond: (about_text_search @@ to_tsquery('index'::text))
(4 rows)

What's that tsvector_ops name we saw in the internal Postgres table?

That's how index types are linked to operators, using operator families and operator classes. For a given operator, there can be multiple different operator classes - an operator class defines how data is represented for a particular index type, and how the search operation for that index works to implement the operator used in a query.

Specifying operator classes during CREATE INDEX

For example let’s look at =(text,text), which is the operator used in an earlier query:

SELECT am.amname AS index_method,
       opf.opfname AS opfamily_name,
       amop.amopopr::regoperator AS opfamily_operator
  FROM pg_am am,
       pg_opfamily opf,
       pg_amop amop
 WHERE opf.opfmethod = am.oid AND amop.amopfamily = opf.oid
       AND amop.amopopr = '=(text,text)'::regoperator;

 index_method |  opfamily_name   | opfamily_operator 
--------------+------------------+-------------------
 btree        | text_ops         | =(text,text)
 hash         | text_ops         | =(text,text)
 btree        | text_pattern_ops | =(text,text)
 hash         | text_pattern_ops | =(text,text)
 spgist       | text_ops         | =(text,text)
 brin         | text_minmax_ops  | =(text,text)
 gist         | gist_text_ops    | =(text,text)
(7 rows)

You can see there is a default operator class (text_ops) that gets used when you don’t explicitly specify it - for text columns the default operator class is often all you need.

But there are cases where we want to set a particular operator class. For example, let's say we run a LIKE query on our database, and our database happens to use the en_US.UTF-8 collation - in that case, you will see the LIKE query is not actually able to use an index:

CREATE INDEX ON users (email);

pgaweb=# EXPLAIN SELECT * FROM users WHERE email LIKE 'lukas@%';
                                 QUERY PLAN                                 
----------------------------------------------------------------------------
 Seq Scan on users  (cost=10000000000.00..10000000001.26 rows=1 width=4463)
   Filter: ((email)::text ~~ 'lukas@%'::text)
(2 rows)

Generally, LIKE queries are challenging to index, but if you do not have a leading wildcard, an index can be created that works for them - but you need to either (1) use the C locale on your database (effectively saying you don’t want language-specific text sorting/comparison), or (2) use the text_pattern_ops operator class.

Let’s create the same index, but this time specify the text_pattern_ops operator class:

CREATE INDEX ON users (email text_pattern_ops);

pgaweb=# EXPLAIN SELECT * FROM users WHERE email LIKE 'lukas@%';
                                         QUERY PLAN                                         
--------------------------------------------------------------------------------------------
 Index Scan using users_email_idx on users  (cost=0.14..8.16 rows=1 width=4463)
   Index Cond: (((email)::text ~>=~ 'lukas@'::text) AND ((email)::text ~<~ 'lukasA'::text))
   Filter: ((email)::text ~~ 'lukas@%'::text)
(3 rows)

As you can see now the same LIKE query can use the index.

Now that we know our index type and operator class for our columns, let's look at a few other aspects of creating an index.

Specifying multiple columns when adding a Postgres index

One essential feature is the option to add multiple columns to an index definition.

You can do it simply like that:

CREATE INDEX ON [table] ([column_a], [column_b]);

But what does that actually do? Turns out it’s dependent on the index type. Each index type has a different representation for multiple columns in its data structure. And some index types like BRIN or Hash do not support multiple columns.

However with the most common index type, B-tree, multi-column indexes work well, and they are commonly used. The most important thing to know for multi-column B-tree indexes: Column order matters. If you have some queries that only utilize column_a, but all queries utilize column_b, you should put column_b first in your index definition. If you don’t follow this rule, you will end up with queries doing a lot more work because they have to skip over all the earlier columns that they can’t filter on. With GIST indexes on the other hand, this does not matter - and you can specify columns in any order.

Another decision to make is: Should I create multiple indexes, one for each column I’m querying by, or should I create a single multi-column index?

CREATE INDEX ON [table] ([column_a]);
CREATE INDEX ON [table] ([column_b]);
--- or
CREATE INDEX ON [table] ([column_a], [column_b]);

When looking at an individual query, the answer will almost always be: Create a single multi-column index that matches the query. It will be faster than having multiple indexes.

But if you have a larger workload, it may make sense to create multiple single-column indexes. Be aware that Postgres will have to do more work in that case, and you should verify what indexes actually get chosen by looking at your EXPLAIN plans.

Using functions and expressions in an index definition

Stepping back from specific index types for a moment: Postgres has a universal feature that applies to all index types, that's pretty useful: Instead of indexing a particular column's value, you can index an expression that references the column's data.

For example, we might typically compare our user email addresses with the lower(..) function:

SELECT * FROM users WHERE lower(email) = $1

If you were to run EXPLAIN on this, you would notice that Postgres is not able to use a simple index on email here - since it doesn’t match the expression.

But since lower(..) is what’s called a "immutable" function, we can use it to create an expression index, that indexes all values of email with their lower-case form:

CREATE INDEX ON users (lower(email));

Now our query will be able to use the index. Note that this does not work for all functions. For example, if you were to create an index on now(), it would fail:

CREATE INDEX ON users (now());

ERROR:  functions in index expression must be marked IMMUTABLE

Additionally, remember that expression indexes only work when they match the query. If we only have an index on lower(email), a query that simply references email won’t be able to use the index.

Specifying a WHERE clause to create partial PostgreSQL indexes

Let’s return to an example we saw at the beginning of the post - but now let’s look at the NullTest expression:

Here we are making sure we only get rows that are not yet marked as deleted by our application. Depending on your workload, this may be a very large number of rows that needs to be skipped over.

Whilst you could create an index that includes the deleted_at column, it would be quite wasteful to have all these index entries that you don’t actually want to ever look at.

Postgres has a better way: With partial indexes, you can restrict for which rows the index has index entries. When the restriction does not apply, the row won’t be saved to the index, saving space. And during query execution, this also acts as a significant time saver in many cases, since the planner can do a simple check to determine which partial indexes match, and ignore all that don't match.

In practice, all you need to do is add a WHERE clause to your index definition:

CREATE INDEX ON users(email) WHERE deleted_at IS NULL;

There are reasons why you may not want to do that though:

First, adding this restriction means that only queries that contain deleted_at IS NULL will be able to use the index. That means you may need two indexes, one with that restriction and the other without.

Second, adding hundreds or thousands of partial indexes causes overhead in the Postgres planner, as it has to do a more expensive analysis to determine which indexes can be used.

Using INCLUDE to create a covering index for Index-Only Scans

Last but not least, let’s talk about a more recent addition to Postgres: The INCLUDE keyword that can be added to CREATE INDEX.

Before we look at what this keyword does, let’s understand the difference between an Index Scan and an Index-Only Scan. An Index-Only Scan is possible when all data that is needed can be retrieved from the index itself - instead of having to fetch it from disk.

Note that Index-Only scans only work when the table has been recently VACUUMed - otherwise Postgres will need to check visibility too often for each index entry, and therefore does not opt to use Index-Only Scans, preferring an Index Scan instead in most cases.

Let's look at two examples - one query that matches an index fully, and one that does not (because of the target list):

CREATE INDEX ON users (email, id);

=# EXPLAIN SELECT id FROM users WHERE email = 'test@example.com';
                                     QUERY PLAN                                      
-------------------------------------------------------------------------------------
 Index Only Scan using users_email_id_idx on users  (cost=0.14..4.16 rows=1 width=4)
   Index Cond: (email = 'test@example.com'::text)
(2 rows)

=# EXPLAIN SELECT id, fullname FROM users WHERE email = 'test@example.com';
                                    QUERY PLAN                                    
----------------------------------------------------------------------------------
 Index Scan using users_email_id_idx on users  (cost=0.14..8.15 rows=1 width=520)
   Index Cond: ((email)::text = 'test@example.com'::text)
(2 rows)

Now, to get an Index Only Scan for the second query we can create an index that includes that column at the end - and that makes Postgres use an Index Only Scan:

CREATE INDEX ON users (email, id, fullname);

=# EXPLAIN SELECT id, fullname FROM users WHERE email = 'test@example.com';
                                           QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 Index Only Scan using users_email_id_fullname_idx on users  (cost=0.14..4.16 rows=1 width=520)
   Index Cond: (email = 'test@example.com'::text)
(2 rows)

However, doing this has a few restrictions: It doesn’t work if you have unique indexes (since any column would modify what’s being checked for being unique), and it bloats the data stored in the index for searching.

For B-tree indexes the new INCLUDE keyword is the better approach:

CREATE INDEX ON users (email, id) INCLUDE (fullname);

This keeps the overhead for such additional columns slightly lower, works without problems with UNIQUE constraint indexes, and clearly communicates the intent: That you only added a column in order to support Index Only Scans.

This is a feature best used sparingly: Adding more data to the index means larger index values, which on its own can be a problem - it’s usually not a good idea to just add a lot of columns to the INCLUDE clause for an index.

Adding and dropping PostgreSQL indexes safely on production

I’ll end with a warning: Creating indexes on production databases requires a bit of thought. Not just which index definition to use, but also how to create them, and when to take the I/O impact of the new index being built.

The most important thing: Remember that Postgres will take an exclusive lock when you simply run CREATE INDEX, that will block all reads and writes to that table. That’s why Postgres has the special CONCURRENTLY keyword. When you create an index on a table on production that already has data, always specify this keyword:

CREATE INDEX CONCURRENTLY ON users (email) WHERE deleted_at IS NULL;

This is the same when dropping an index with DROP INDEX - adding CONCURRENTLY reduces the locking requirements slightly, making it faster to use this operation on production.

Conclusion

In this post you should have gotten a fundamental understanding of how operators and operator classes related to indexing, and why knowing these concepts is essential to creating the best index for complex queries. We also looked at a few complimentary features of the CREATE INDEX command, that are typically needed when reasoning about which index to create.

There are actually a few things we didn’t talk about: Adding indexes to specific tablespaces, using index storage parameters (especially useful for GIN index types!) and specifying the sort order for a particular column. I encourage you to take a further look at the Postgres documentation for these topics.

Share this article: If you liked this article you might want to tweet it to your peers.

]]>

A look at Postgres 14: Performance and Monitoring Improvements

Lukas Fittl — Fri, 21 May 2021 12:00:00 GMT

The first beta release of the upcoming Postgres 14 release was made available yesterday. In this article we'll take a first look at what's in the beta, with an emphasis on one major performance improvement, as well as three monitoring improvements that caught our attention.

Before we get started, I wanted to highlight what always strikes me as an important unique aspect of Postgres: Compared to most other open-source database systems, Postgres is not the project of a single company, but rather many individuals coming together to work on a new release, year after year. And that includes everyone who tries out the beta releases, and reports bugs to the Postgres project. We hope this post inspires you to do your own testing and benchmarking.

Now, I'm personally most excited about better connection scaling in Postgres 14. For this post we ran a detailed benchmark comparing Postgres 13.3 to 14 beta1 (note that the connection count is log scale):

Improved Active and Idle Connection Scaling in Postgres 14
Dive into memory use with pg_backend_memory_contexts
Track WAL activity with pg_stat_wal
Monitor queries with the built-in Postgres query_id
And 200+ other improvements in the Postgres 14 release!
Conclusion

Improved Active and Idle Connection Scaling in Postgres 14

Postgres 14 brings significant improvements for those of us that need a high number of database connections. The Postgres connection model relies on processes instead of threads. This has some important benefits, but it also has overhead at large connection counts. With this new release, scaling active and idle connections has gotten significantly better, and will be a major improvement for the most demanding applications.

For our test, we've used two 96 vCore AWS instances (c5.24xlarge), one running Postgres 13.3, and one running Postgres 14 beta1. Both of these use Ubuntu 20.04, with the default system settings, but the Postgres connection limit has been increased to 11,000 connections.

We use pgbench to test connection scaling of active connections. To start, we initialize the database with pgbench scale factor 200:

# Postgres 13.3
$ pgbench -i -s 200
...
done in 127.71 s (drop tables 0.02 s, create tables 0.02 s, client-side generate 81.74 s, vacuum 2.63 s, primary keys 43.30 s).

# Postgres 14 beta1
$ pgbench -i -s 200
...
done in 77.33 s (drop tables 0.02 s, create tables 0.02 s, client-side generate 48.19 s, vacuum 2.70 s, primary keys 26.40 s).

Already here we can see that Postgres 14 does much better in the initial data load.

We now launch read-only pgbench with a varying set of active connections, showing 5,000 concurrent connections as an example of a very active workload:

# Postgres 13.3
$ pgbench -S -c 5000 -j 96 -M prepared -T30
...
tps = 417847.658491 (excluding connections establishing)

# Postgres 14 beta1
$ pgbench -S -c 5000 -j 96 -M prepared -T30
...
tps = 495108.316805 (without initial connection time)

As you can see, the throughput of Postgres 14 at 5000 active connections is about 20% higher. At 10,000 active connections the improvement is 50% over Postgres 13, and at lower connection counts you can also see consistent improvements.

Note that you will usually see a noticeable TPS drop when the number of connections exceeds the number of CPUs, this is most likely due to CPU scheduling overhead, and not a limitation in Postgres itself. Now, most workloads don't actually have this many active connections, but rather a high number of idle connections.

The original author of this work, Andres Freund, ran a benchmark on the throughput of a single active query, whilst also running 10,000 idle connections. The query went from 15,000 TPS to almost 35,000 TPS - that's over 2x better than in Postgres 13. You can find all the details in Andres Freund's original post introducing these improvements.

Dive into memory use with pg_backend_memory_contexts

Have you ever been curious why a certain Postgres connection is taking up a higher amount of memory? With the new pg_backend_memory_contexts view you can take a close look at what exactly is allocated for a given Postgres process.

To start, we can calculate how much memory is used by our current connection in total:

SELECT pg_size_pretty(SUM(used_bytes)) FROM pg_backend_memory_contexts;

 pg_size_pretty 
----------------
 939 kB
(1 row)

Now, let's dive a bit deeper. When we query the table for the top 5 entries by memory usage, you will notice there is actually a lot of detailed information:

SELECT * FROM pg_backend_memory_contexts ORDER BY used_bytes DESC LIMIT 5;

          name           | ident |      parent      | level | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes 
-------------------------+-------+------------------+-------+-------------+---------------+------------+-------------+------------
 CacheMemoryContext      |       | TopMemoryContext |     1 |      524288 |             7 |      64176 |           0 |     460112
 Timezones               |       | TopMemoryContext |     1 |      104120 |             2 |       2616 |           0 |     101504
 TopMemoryContext        |       |                  |     0 |       68704 |             5 |      13952 |          12 |      54752
 WAL record construction |       | TopMemoryContext |     1 |       49768 |             2 |       6360 |           0 |      43408
 MessageContext          |       | TopMemoryContext |     1 |       65536 |             4 |      22824 |           0 |      42712
(5 rows)

A memory context in Postgres is a memory region that is used for allocations to support activities such as query planning or query execution. Once Postgres completes work in a context, the whole context can be freed, simplifying memory handling. Through the use of memory contexts the Postgres source actually avoids doing manual free calls for the most part (even though it's written in C), instead relying on memory contexts to clean up memory in groups. The top memory context here, CacheMemoryContext is used for many long-lived caches in Postgres.

We can illustrate the impact of loading additional tables into a connection by running a query on a new table, and then querying the view again:

SELECT * FROM test3;
SELECT * FROM pg_backend_memory_contexts ORDER BY used_bytes DESC LIMIT 5;

          name           | ident |      parent      | level | total_bytes | total_nblocks | free_bytes | free_chunks | used_bytes 
-------------------------+-------+------------------+-------+-------------+---------------+------------+-------------+------------
 CacheMemoryContext      |       | TopMemoryContext |     1 |      524288 |             7 |      61680 |           1 |     462608
...

As you can see the new view illustrates that simply having queried a table on this connection will retain about 2kb of memory, even after the query has finished. This caching of table information is done to speed up future queries, but can sometimes cause surprising amounts of memory usage for multi-tenant databases with many different schemas. You can now illustrate such issues easily through this new monitoring view.

If you'd like to access this information for processes other than the current one, you can use the new pg_log_backend_memory_contexts function which will cause the specified process to output its own memory consumption to the Postgres log:

SELECT pg_log_backend_memory_contexts(10377);

LOG:  logging memory contexts of PID 10377
STATEMENT:  SELECT pg_log_backend_memory_contexts(pg_backend_pid());
LOG:  level: 0; TopMemoryContext: 80800 total in 6 blocks; 14432 free (5 chunks); 66368 used
LOG:  level: 1; pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1408 free (0 chunks); 6784 used
LOG:  level: 1; TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used
LOG:  level: 1; RowDescriptionContext: 8192 total in 1 blocks; 6880 free (0 chunks); 1312 used
LOG:  level: 1; MessageContext: 16384 total in 2 blocks; 5152 free (0 chunks); 11232 used
LOG:  level: 1; Operator class cache: 8192 total in 1 blocks; 512 free (0 chunks); 7680 used
LOG:  level: 1; smgr relation table: 16384 total in 2 blocks; 4544 free (3 chunks); 11840 used
LOG:  level: 1; TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); 264 used
...
LOG:  level: 1; ErrorContext: 8192 total in 1 blocks; 7928 free (3 chunks); 264 used
LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560 used

Track WAL activity with pg_stat_wal

Building on the WAL monitoring capabilities in Postgres 13, the new release brings a new server-wide summary view for WAL information, called pg_stat_wal.

You can use this to monitor WAL writes over time more easily:

SELECT * FROM pg_stat_wal;

-[ RECORD 1 ]----+------------------------------
wal_records      | 3334645
wal_fpi          | 8480
wal_bytes        | 282414530
wal_buffers_full | 799
wal_write        | 429769
wal_sync         | 428912
wal_write_time   | 0
wal_sync_time    | 0
stats_reset      | 2021-05-21 07:33:22.941452+00

With this new view we can get summary information such as how many Full Page Images (FPI) were written to the WAL, which can give you insights on when Postgres generated a lot of WAL records due to a checkpoint. Secondly, you can use the new wal_buffers_full counter to quickly see when the wal_buffers setting is set too low, which can cause unnecessary I/O that can be prevented by raising wal_buffers to a higher value.

You can also get more details of the I/O impact of WAL writes by enabling the optional track_wal_io_timing setting, which then gives you the exact I/O times for WAL writes, and WAL file syncs to disk. Note this setting can have noticeable overhead, so it's best turned off (the default) unless needed.

Monitor queries with the built-in Postgres query_id

In a recent survey done by TimescaleDB in March and April 2021, the pg_stat_statements extension was named one of the top three extensions the surveyed user base uses with Postgres. pg_stat_statements is bundled with Postgres, and with Postgres 14 one of the important features of the extensions got merged into core Postgres:

The calculation of the query_id, which uniquely identifies a query, whilst ignoring constant values. Thus, if you run the same query again it will have the same query_id, enabling you to identify workload patterns on the database. Previously this information was only available with pg_stat_statements, which shows aggregate statistics about queries that have finished executing, but now this is available with pg_stat_activity as well as in log files.

First we have to enable the new compute_query_id setting and restart Postgres afterwards:

ALTER SYSTEM SET compute_query_id = 'on';

If you use pg_stat_statements query IDs will be calculated by automatically, through the default compute_query_id setting of auto.

With query IDs enabled, we can look at pg_stat_activity during a pgbench run and see why this is helpful as compared to just looking at query text:

SELECT query, query_id FROM pg_stat_activity WHERE backend_type = 'client backend' LIMIT 5;

                                 query                                  |      query_id      
------------------------------------------------------------------------+--------------------
 UPDATE pgbench_tellers SET tbalance = tbalance + -4416 WHERE tid = 3;  | 885704527939071629
 UPDATE pgbench_tellers SET tbalance = tbalance + -2979 WHERE tid = 10; | 885704527939071629
 UPDATE pgbench_tellers SET tbalance = tbalance + 2560 WHERE tid = 6;   | 885704527939071629
 UPDATE pgbench_tellers SET tbalance = tbalance + -65 WHERE tid = 7;    | 885704527939071629
 UPDATE pgbench_tellers SET tbalance = tbalance + -136 WHERE tid = 9;   | 885704527939071629
(5 rows)

All of these queries are the same from an application perspective, but their text is slightly different, making it hard to find patterns in the workload. With the query ID however we can clearly identify the number of certain kinds of queries, and assess performance problems more easily. For example, we can group by the query ID to see what's keeping the database busy:

SELECT COUNT(*), state, query_id FROM pg_stat_activity WHERE backend_type = 'client backend' GROUP BY 2, 3;

 count | state  |       query_id       
-------+--------+----------------------
    40 | active |   885704527939071629
     9 | active |  7660508830961861980
     1 | active | -7810315603562552972
     1 | active | -3907106720789821134
(4 rows)

When you run this on your own system you may find that the query ID is different from the one shown here. This is due to query IDs being dependent on the internal representation of a Postgres query, which can be architecture dependent, and also considers internal IDs of tables instead of their names.

The query ID information is also available in log_line_prefix through the new %Q option, making it easier to get auto_explain output thats linked to a query:

2021-05-21 08:18:02.949 UTC [7176] [user=postgres,db=postgres,app=pgbench,query=885704527939071629] LOG:  duration: 59.827 ms  plan:
	Query Text: UPDATE pgbench_tellers SET tbalance = tbalance + -1902 WHERE tid = 6;
	Update on pgbench_tellers  (cost=4.14..8.16 rows=0 width=0) (actual time=59.825..59.826 rows=0 loops=1)
	  ->  Bitmap Heap Scan on pgbench_tellers  (cost=4.14..8.16 rows=1 width=10) (actual time=0.009..0.011 rows=1 loops=1)
	        Recheck Cond: (tid = 6)
	        Heap Blocks: exact=1
	        ->  Bitmap Index Scan on pgbench_tellers_pkey  (cost=0.00..4.14 rows=1 width=0) (actual time=0.003..0.004 rows=1 loops=1)
	              Index Cond: (tid = 6)

Want to link auto_explain and pg_stat_statements, and can't wait for Postgres 14?

We built our own open-source query fingerprint mechanism that uniquely identifies queries based on their text. This is used in pganalyze for matching EXPLAIN plans to queries, and you can also use this in your own scripts, with any Postgres version.

And 200+ other improvements in the Postgres 14 release!

These are just some of the many improvements in the new Postgres release. You can find more on what's new in the release notes, such as:

The new predefined roles pg_read_all_data/pg_write_all_data give global read or write access
Automatic cancellation of long-running queries if the client disconnects
Vacuum now skips index vacuuming when the number of removable index entries is insignificant
Per-index information is now included in autovacuum logging output
Partitions can now be detached in a non-blocking manner with ALTER TABLE ... DETACH PARTITION ... CONCURRENTLY

And many more. Now is the time to help test!

Download beta1 from the official package repositories, or build it from source. We can all contribute to making Postgres 14 a stable release in a few months from now.

Conclusion

At pganalyze, we're excited about Postgres 14, and hope this post got you interested as well! Postgres shows again how many small improvements make it a stable, trustworthy database, that is built by the community, for the community.

Share this post on Twitter

]]>

Introducing pg_query 2.0: The easiest way to parse Postgres queries

Lukas Fittl — Thu, 18 Mar 2021 12:00:00 GMT

The query parser is a core component of Postgres: the database needs to understand what data you're asking for in order to return the right results. But this functionality is also useful for all sorts of other tools that work with Postgres queries. A few years ago, we released pg_query to support this functionality in a standalone C library.

pganalyze uses pg_query to parse and analyze every SQL query that runs on your Postgres database. Our initial motivation was to create pg_query for checking which tables a query references, or what kind of statement it is. Since then we've expanded its use in pganalyze itself. pganalyze now truncates query text in a smart manner in the query overview. The pganalyze-collector supports collecting EXPLAIN plans, and uses pg_query to support log-based EXPLAIN. And we link together pg_stat_statements and auto_explain data in pganalyze using query fingerprints (another pg_query feature we'll discuss in detail in a later section).

Postgres community tools build on pg_query

But, what we didn't expect at the time, was the tremendous interest we've seen from the community. The Ruby library alone has received over 3.5 million downloads in its lifetime.

Thanks to many contributors, pg_query now has bindings for other languages beyond Ruby and Go, such as Python (pglast, maintained by Lele Gaifax), Node.js (pgsql-parser, maintained by Dan Lynch) and even OCaml. There are also many notable third-party projects that use pg_query to parse Postgres queries. Here are some of our favorites:

sqlc provides type safe SQL-based databases access in Go
pgdebug lets you debug complex CTEs and execute parts as a standalone query
Google's HarbourBridge uses pg_query for helping customers trial Spanner from Postgres sources
DuckDB uses a forked version of pg_query for their parsing layer
GitLab uses pg_query for normalizing queries in their internal error reporting
Splitgraph uses pg_query via the pglast Python binding to parse the SQL statements in Splitfiles
sqlint lints your SQL files for correctness

Today, it's time to bring pg_query to the next level.

Announcing pg_query 2.0: Better & faster parsing, with Postgres 13 support

We're excited to announce the next major version of pg_query, pg_query 2.0.

In this version, you'll find support for:

Parsing the PostgreSQL 13 query syntax
Deparser as part of the core C library, to turn modified parse trees back into SQL
New parse tree format based on Protocol Buffers (Protobuf)
Improved, faster query fingerprinting mechanism
And much more!

Postgres community tools build on pg_query
Announcing pg_query 2.0: Better & faster parsing, with Postgres 13 support
How pg_query turns a Postgres statement into a parse tree
Turning parse trees back into SQL using a deparser
- The pg_query deparser with coverage for all Postgres regression tests
Fingerprints in pg_query: A better way to check if two queries are identical
- Why did we create our own query fingerprint concept?
Additional changes for pg_query 2.0
Conclusion

To start, let's revisit how pg_query actually works.

How pg_query turns a Postgres statement into a parse tree

There are many ways to parse SQL, but the scope for pg_query is very specific. That is, to be able to parse the full Postgres query syntax, the same way as Postgres does. The only reliable way to do this, is to use the Postgres parser itself.

pg_query isn't the first project to do this, for example pgpool has a copy of the Postgres parser as well. But we needed an easily maintainable, self-contained version of the parser in a standalone C library. This would let us, and the Postgres community, use the parser from almost any language by writing a simple wrapper.

How did we do this? We started by looking at the Postgres source. Looking at the source, you will find the function called raw_parser:

/*
 * raw_parser
 *		Given a query in string form, do lexical and grammatical analysis.
 *
 * Returns a list of raw (un-analyzed) parse trees.  The immediate elements
 * of the list are always RawStmt nodes.
 */
List *
raw_parser(const char *str)
{
    ...

After raw parsing, Postgres goes into parse analysis. In that phase Postgres identifies the types of columns, maps table names to the schema and more. After that, Postgres does planning (see our introduction to Postgres query planning), and then executes the query based on the query plan.

For pg_query, all we need is the raw parser. Looking at the code, we discovered a problem. The parser code still depends on a lot of Postgres code, such as for memory management or error handling. We needed a repeatable way to extract just enough source code to compile and run the parser.

Thus the idea was born to automatically extract the Postgres parser code and its dependencies.

Using LibClang to extract C source code from Postgres

Our goal: A set of self-contained C files that represent a copy of Postgres' raw_parser function. But we don't want to copy the code manually. Luckily we can use LibClang to parse C code, and understand its dependencies.

The details of this could fill many pages, but here is a simplified version of how this works:

1. Each translation unit (.c file) in the source is analyzed via LibClang's Ruby binding:

require 'ffi/clang'

index = FFI::Clang::Index.new(true, true)
translation_unit = index.parse_translation_unit(file, ['... CFLAGS ...'])

2. The analysis walks through the file and marks each C method, as well as the symbols it references:

translation_unit.cursor.visit_children do |cursor, parent|
  @file_to_symbol_positions[cursor.location.file] ||= {}
  @file_to_symbol_positions[cursor.location.file][cursor.spelling] = [cursor.extent.start.offset, cursor.extent.end.offset]
  cursor.visit_children do |child_cursor, parent|
    if child_cursor.kind == :cursor_decl_ref_expr || child_cursor.kind == :cursor_call_expr
      @references[cursor.spelling] ||= []
      (@references[cursor.spelling] << child_cursor.spelling).uniq!
    end
    :recurse
  end
end

3. We resolve required C methods and their code, based on the top-level method we are looking for:

def deep_resolve(method_name, depth: 0, trail: [], global_resolved_by_parent: [], static_resolved_by_parent: [], static_base_filename: nil)
  ...
  global_dependents = (@references[method_name] || []
  global_dependents.each do |symbol|
    deep_resolve(symbol, depth: depth + 1, trail: trail + [method_name], global_resolved_by_parent: global_resolved_by_parent + global_dependents)
  end
  ...
end

deep_resolve('raw_parser')

4. We write out just the portions of the C code that are required (see details here)

With this, we have a working Postgres parser!

You can find the full details in the pg_query source.

Once we can call the Postgres parser in our standalone library, we can get the result as a parse tree, represented as Postgres parser C structs. But now we needed to make this useful in other languages, such as Ruby or Go.

Turning Postgres parser C structs into JSON and Protobufs

It's a little known fact, but Postgres actually has a text representation of a query parse tree. Its rarely used directly, being reserved for internal communication and debugging. The easiest way to see an example is by looking at the adbin field in pg_attref, which shows the internal representation for an expression of an column default value (to contrast, pg_get_expr shows the expression in SQL):

SELECT adbin, pg_get_expr(adbin, adrelid) FROM pg_attrdef WHERE adrelid = 'mytable'::regclass AND adnum = 1;

-[ RECORD 1 ]-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
adbin       | {FUNCEXPR :funcid 480 :funcresulttype 23 :funcretset false :funcvariadic false :funcformat 2 :funccollid 0 :inputcollid 0 :args ({FUNCEXPR :funcid 1574 :funcresulttype 20 :funcretset false :funcvariadic false :funcformat 0 :funccollid 0 :inputcollid 0 :args ({CONST :consttype 2205 :consttypmod -1 :constcollid 0 :constlen 4 :constbyval true :constisnull false :location 68 :constvalue 4 [ -27 10 -122 1 0 0 0 0 ]}) :location 60}) :location -1}
pg_get_expr | nextval('mytable_id_seq'::regclass)

This text format is not useful for working with a parse tree in other languages. Thus, we needed a more portable format to export the parse tree from C, and import it in another language such as Ruby.

The initial version of pg_query used JSON for this. JSON is great, since you can parse it in pretty much any programming language. Thus, in this new pg_query release, we still support JSON.

We're also introducing support for a new schema-based format, using Protocol Buffers (Protobuf).

Why pg_query 2.0 adds support for Protocol Buffers

Whilst JSON is convenient for passing around the parse tree, it has a few problems:

JSON is slower to parse than a binary format
Memory usage can become an issue with complex parse trees
Building logic around a tree of JSON data is error-prone, as one needs to add a lot of checks to identify each node and its supported fields
It's hard to instantiate new parse tree nodes, for example to use for deparsing back into a SQL statement

In pg_query 1.0, accessing the value of a "SELECT 1" would have looked like this with the Ruby binding:

result = PgQuery.parse("SELECT 1")
result.tree[0]['RawStmt']['stmt']['SelectStmt']['targetList'][0]['ResTarget']['val']
# => {"A_Const"=>{"val"=>{"Integer"=>{"ival"=>1}}, "location"=>7}}

Here is how Protobuf improves the parse tree handling in Ruby:

result = PgQuery.parse("SELECT 1")
result.tree.stmts[0].stmt.select_stmt.target_list[0].res_target.val.a_const
# => <PgQuery::A_Const: val: <PgQuery::Node: integer: <PgQuery::Integer: ival: 1>>, location: 7>

Note how we have a full class definition for each parse tree node type, making interaction with the tree nodes significantly easier.

Now, let's say I want to change a parse tree and turn it back into a SQL statement. For this, I need a deparser.

Turning parse trees back into SQL using a deparser

Postgres itself has deparser logic in many places. For example postgres_fdw has a deparser to generate the query to send to the remote server. But, the deparser code in Postgres requires a post-parse analysis parse tree (that directly references relation OIDs, etc). That means we can't make use of it in pg_query, which works with raw parse trees.

For many years now, the Ruby pg_query library has had a deparser. Over the years we've had many community contributions to make it complete. The third-party libraries for Python and Node.js also have their own deparser. These efforts were all done in parallel, without sharing code. And the Go library is missing a deparser altogether.

How can we reduce the duplicated effort in the community? By creating a new portable deparser for raw parse trees. This avoids having duplicate efforts for every pg_query-based library.

The pg_query deparser with coverage for all Postgres regression tests

pg_query 2.0 features a new deparser, written in C. This was by far the biggest undertaking of this new release. The new deparser is able to generate all SQL queries used in the Postgres regression tests (which the pg_query parser can of course parse), and more.

It works like this, here by example of the Go library, which before did not have a deparser:

package main

import (
  "fmt"
  pg_query "github.com/pganalyze/pg_query_go/v2"
)

func main() {
  // Parse a query
  result, err := pg_query.Parse("SELECT 42")
  if err != nil {
    panic(err)
  }

  // Modify the parse tree
  result.Stmts[0].Stmt.GetSelectStmt().GetTargetList()[0].GetResTarget().Val =
    pg_query.MakeAConstStrNode("Hello World", -1)

  // Deparse back into a query
  stmt, err := pg_query.Deparse(result)
  if err != nil {
    panic(err)
  }
  fmt.Printf("%s\n", stmt)
}

This will output the following:

SELECT 'Hello World'

First, the deparsing step encodes the Go structs into the new Protobuf format. Then, the C library decodes this into Postgres parse tree C structs. Last but not least, the C library's new deparser turns the C structs into the SQL query text.

Stepping away from deparsing, let's take a look at the new fingerprinting mechanism:

Fingerprints in pg_query: A better way to check if two queries are identical

Let's start with the motivation for query fingerprints. pganalyze needs to link together Postgres statistics across different data sources. For example queries from pg_stat_statements with the Postgres auto_explain logs. You can see the fingerprint in pganalyze on the query details page:

This query can be represented differently depending on which part of Postgres you look at:

pg_stat_statements: SELECT "abalance" FROM "pgbench_accounts" WHERE "aid" = ?
auto_explain: SELECT abalance FROM pgbench_accounts WHERE aid = 4674588

A simple text comparison would not be sufficient to determine that these queries are identical.

Why did we create our own query fingerprint concept?

Postgres already has the concept of a "queryid", calculated based on the post-parse analysis tree. It's used in places such as pg_stat_statements to distinguish the different query entries.

But, this queryid is not available everywhere today, e.g. you can't get it with auto_explain plans. It's also not portable between databases, as it's dependent on specific relation OIDs. Even if you have the exact same queries on your staging and production system, they will have different queryid values. And the queryid can't be generated outside the context of a Postgres server. Thus, pganalyze has its own mechanism, called a query fingerprint.

Fingerprints identify a Postgres query based on its raw parse tree alone. We've open-sourced this mechanism in pg_query:

PgQuery.fingerprint('SELECT a, b FROM c')
# => "fb1f305bea85c2f6"

PgQuery.fingerprint('SELECT b, a FROM c')
# => "fb1f305bea85c2f6"

This mechanism does not need a running server, so all you need as input is a valid Postgres query.

With pg_query 2.0, we've done a few enhancements to the fingerprint functionality:

Use the faster XXH3 hash function, instead of SHA-1. pg_query 1.0 used the outdated cryptographic hash function SHA-1. Cryptographic guarantees are not needed for this use case, and XXH3 is much faster.
Contain the fingerprint in a 64-bit value, instead of 136 bits. We've determined that 64-bit precision is good enough for query fingerprints. Postgres itself thinks so too, since it uses 64-bit for the Postgres queryid. We often use data from pg_stat_statements, so there is little benefit to more bits. Using a smaller data type also means better performance for pganalyze.
Fix edge cases where two almost identical queries had different fingerprints. Fingerprints should ignore query differences, when they result in the same query intent. We've addressed a few cases where this was not working as expected. You can look at the corresponding wiki page to understand these changes in more detail.

Additional changes for pg_query 2.0

A few other things about the new release:

The pg_query library now resides in the pganalyze organization on GitHub. This makes it clear who maintains and funds the core development. We will continue to make pg_query available under the BSD 3-clause license.
pg_query has a new method for splitting queries. This can be useful when you want to split a multi-statement string into its component statements, for example SELECT ';'; SELECT 'foo' into SELECT ';' and SELECT 'foo'
There is a new function available to access the Postgres scanner. This includes the location of comments in a query text. One could envision building a syntax highlighter based on this. Or extract comments from queries whilst ignoring comment-like tokens in a constant value.

Conclusion

The new pg_query 2.0 is available today, with bindings for Go and Ruby available to start. We are also working on a new pganalyze-maintained Rust binding that we'll have news about soon.

Help us get the word out by sharing this post on Twitter.

]]>

Postgres 11: Monitoring JIT performance, Auto Prewarm & Stored Procedures

Lukas Fittl — Thu, 04 Oct 2018 12:00:00 GMT

Everyone’s favorite database, PostgreSQL, has a new release coming out soon: Postgres 11

In this post we take a look at some of the new features that are part of the release, and in particular review the things you may need to monitor, or can utilize to increase your application and query performance.

Just-In-Time compilation (JIT) in Postgres 11

Just-In-Time compilation (JIT) for query execution was added in Postgres 11. It's not going to be enabled for queries by default, similar to parallel query in Postgres 9.6, but can be very helpful for CPU-bound workloads and analytical queries.

Specifically, JIT currently aims to optimize two essential parts of query execution: Expression evaluation and tuple deforming. To quote the Postgres documentation:

Expression evaluation is used to evaluate WHERE clauses, target lists, aggregates and projections. It can be accelerated by generating code specific to each case.

Tuple deforming is the process of transforming an on-disk tuple into its in-memory representation. It can be accelerated by creating a function specific to the table layout and the number of columns to be extracted.

Often you will have a workload that is mixed, where some queries will benefit from JIT, and some will be slowed down by the overhead.

Here is how you can monitor JIT performance using EXPLAIN and auto_explain, as well as how you can determine whether your queries are benefiting from JIT optimization.

Monitoring JIT with EXPLAIN / auto_explain

First of all, you will need to make sure that your Postgres packages are compiled with JIT support (--with-llvm configuration switch). Assuming that you have Postgres binaries compiled like that, the jit configuration parameter controls whether JIT is actually being used.

For this example, we’re working with one of our staging databases, and pick a relatively simple query that can benefit from JIT:

SELECT COUNT(*) FROM log_lines
 WHERE log_classification = 65 AND (details->>'new_dead_tuples')::integer >= 0;

For context, the table log_lines is an internal log event statistics table of pganalyze, which is typically indexed per-server, but in this case we want to run an analytical query across all servers to count interesting autovacuum completed log events.

First, if we run the query with jit = off, we will get an execution plan and runtime like this:

EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM log_lines
    WHERE log_classification = 65 AND (details->>'new_dead_tuples')::integer >= 0;

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        QUERY PLAN                                                        │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=649724.03..649724.04 rows=1 width=8) (actual time=3498.939..3498.939 rows=1 loops=1)                    │
│   Buffers: shared hit=1538 read=386328                                                                                   │
│   I/O Timings: read=1098.036                                                                                             │
│   ->  Seq Scan on log_lines  (cost=0.00..649675.55 rows=19393 width=0) (actual time=0.028..3437.032 rows=667063 loops=1) │
│         Filter: ((log_classification = 65) AND (((details ->> 'new_dead_tuples'::text))::integer >= 0))                  │
│         Rows Removed by Filter: 14396065                                                                                 │
│         Buffers: shared hit=1538 read=386328                                                                             │
│         I/O Timings: read=1098.036                                                                                       │
│ Planning Time: 0.095 ms                                                                                                  │
│ Execution Time: 3499.089 ms                                                                                              │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(10 rows)

Time: 3499.580 ms (00:03.500)

Note the usage of EXPLAIN's BUFFERS option so we can compare whether any caching behavior affects our benchmarking. We can also see that I/O time was 1,098 ms out of 3,499 ms, so this query is definitely CPU bound.

For comparison, when we enable JIT, we can see the following:

SET jit = on;
EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM log_lines
    WHERE log_classification = 65 AND (details->>'new_dead_tuples')::integer >= 0;

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                        QUERY PLAN                                                         │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=649724.03..649724.04 rows=1 width=8) (actual time=2816.497..2816.498 rows=1 loops=1)                     │
│   Buffers: shared hit=1570 read=386296                                                                                    │
│   I/O Timings: read=1154.438                                                                                              │
│   ->  Seq Scan on log_lines  (cost=0.00..649675.55 rows=19393 width=0) (actual time=78.912..2759.717 rows=667063 loops=1) │
│         Filter: ((log_classification = 65) AND (((details ->> 'new_dead_tuples'::text))::integer >= 0))                   │
│         Rows Removed by Filter: 14396065                                                                                  │
│         Buffers: shared hit=1570 read=386296                                                                              │
│         I/O Timings: read=1154.438                                                                                        │
│ Planning Time: 0.095 ms                                                                                                   │
│ JIT:                                                                                                                      │
│   Functions: 4                                                                                                            │
│   Options: Inlining true, Optimization true, Expressions true, Deforming true                                             │
│   Timing: Generation 1.044 ms, Inlining 14.205 ms, Optimization 46.678 ms, Emission 17.868 ms, Total 79.795 ms            │
│ Execution Time: 2817.713 ms                                                                                               │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(14 rows)

Time: 2818.250 ms (00:02.818)

In this case, JIT yields about a 25% speed-up, due to spending less CPU time, without any extra effort on our end. We can also see that JIT tasks themselves added 79 ms to the runtime.

You can fine tune whether JIT is used for a particular query by the jit_above_cost parameter which applies to the total cost of the query as determined by the Postgres planner. The cost is 649724 in the above EXPLAIN output, which exceeds the default jit_above_cost threshold of 100000. In a future post we'll walk through more examples of when using JIT can be beneficial.

You can gather these JIT statistics either for individual queries that you are interested in (using EXPLAIN), or automatically collect it for all of your queries using the auto_explain extension. If you want to learn more about how to enable auto_explain we recommend reviewing our guide about it: pganalyze Log Insights - Tuning Log Config Settings.

Fun fact: As part of the writing of this article we ran experiments with JIT and auto_explain, and discovered that JIT information wasn’t included with auto_explain, but only with regular EXPLAINs. Luckily, we were able to contribute a bug fix to Postgres, which has been merged and will be part of the Postgres 11 release.

Preventing cold caches: Auto prewarm in Postgres 11

A neat feature that will help you improve performance right after restarting Postgres, is the new autoprewarm background worker functionality.

If you are not familiar with pg_prewarm, its an extension thats bundled with Postgres (much like pg_stat_statements), that you can use to preload data that’s on disk into the Postgres buffer cache.

It is often very useful to ensure that a certain table is cached before the first production query hits the database, to avoid an overly slow response due to data being loaded from disk.

Previously, you needed to manually specify which relations (i.e. tables) and which page offsets to preload, which was cumbersome, and hard to automate.

Caching tables with autoprewarm

Starting in Postgres 11, you can instead have this done automatically, by adding pg_prewarm to shared_preload_libraries like this:

shared_preload_libraries = 'pg_prewarm,pg_stat_statements'

Doing this will automatically save information on which tables/indices are in the buffer cache (and which parts of them) every 300 seconds to a file called autoprewarm.blocks, and use that information after Postgres restarts to reload the previously cached data from disk into the buffer cache, thus improving initial query performance.

Stored procedures in Postgres 11

Postgres has had database server-side functions for a long time, with a variety of supported languages. You might have used the term “procedures” before to refer to such functions, as they are similar to what’s called “Stored Procedures” in other database systems such as Oracle.

However, one detail that is sometimes missed, is that the existing functions in Postgres were always running within the same transaction. There was no way to begin, commit, or rollback a transaction within a function, as they were not allowed to run outside of a transaction context.

Starting in Postgres 11, you will have the ability to use CREATE PROCEDURE instead of CREATE FUNCTION to create procedures.

Benefits of using stored procedures

Compared to regular functions, procedures can do more than just query or modify data: They also have the ability to begin/commit/rollback transactions within the procedure.

Particularly for those moving over from Oracle to PostgreSQL, the new procedure functionality can be a significant time saver. You can find some examples of how to convert procedures between those two relational database systems in the Postgres documentation.

How to use stored procedures

First, let’s create a simple procedure that handles some tables:

CREATE PROCEDURE my_table_task() LANGUAGE plpgsql AS $$
DECLARE
BEGIN
  CREATE TABLE table_committed (id int);
  COMMIT;
  CREATE TABLE table_rolled_back (id int);
  ROLLBACK;
END $$;

We can then call this procedure like this, using the new CALL statement:

=# CALL my_table_task();
CALL
Time: 1.573 ms

Here you can see the benefit of procedures - despite the rollback the overall execution is successful, and the first table got created, but the second one was not since the transaction was rolled back.

Be careful: Transaction timestamps and xact_start for procedures

Expanding on how transactions work inside procedures, there is currently an oddity with the transaction timestamp, which for example you can see in xact_start. When we expand the procedure like this:

CREATE PROCEDURE my_table_task() LANGUAGE plpgsql AS $$
DECLARE
  clock_str TEXT;
  tx_str TEXT;
BEGIN
  CREATE TABLE table_committed (id int);
    SELECT clock_timestamp() INTO clock_str;
    SELECT transaction_timestamp() INTO tx_str;
    RAISE NOTICE 'After 1st CREATE TABLE: % clock, % xact', clock_str, tx_str;
    PERFORM pg_sleep(5);
  COMMIT;
  CREATE TABLE table_rolled_back (id int);
    SELECT clock_timestamp() INTO clock_str;
    SELECT transaction_timestamp() INTO tx_str;
    RAISE NOTICE 'After 2nd CREATE TABLE: % clock, % xact', clock_str, tx_str;
  ROLLBACK;
END $$;

And then call the procedure, we see the following:

=# CALL my_table_task();
NOTICE:  00000: After 1st CREATE TABLE: 2018-10-03 22:17:26 clock, 2018-10-03 22:17:26 xact
NOTICE:  00000: After 2nd CREATE TABLE: 2018-10-03 22:17:31 clock, 2018-10-03 22:17:26 xact
CALL
Time: 5022.598 ms

Despite there being two transactions in the procedure, the transaction start timestamp is that of when the procedure got called, not when the embedded transaction actually started.

You will see the same problem with the xact_start field in pg_stat_activity, causing monitoring scripts to potentially detect false positives for long running transactions. This issue is currently in discussion and likely to be changed before the final release.

How often does my stored procedure get called?

Now, if you want to monitor the performance of procedures, it gets a bit difficult. Whilst regular functions can be tracked using track_functions = on, there is no such facility for procedures. You can however track the execution of CALL statements using pg_stat_statements:

SELECT query, calls, total_time FROM pg_stat_statements WHERE query LIKE 'CALL%';

┌────────────┬───────┬────────────┐
│   query    │ calls │ total_time │
├────────────┼───────┼────────────┤
│ CALL abc() │     4 │    5.62299 │
└────────────┴───────┴────────────┘

In addition, when you enable pg_stat_statements.track = all, queries that are called from within a procedure will be tracked, and made available in Postgres query performance monitoring tools such as pganalyze.

Conclusion

Postgres 11 is going to be the best Postgres release yet, and we are excited to put it into use.

Whilst common wisdom is to not upgrade right after a release, we encourage you to try out the new release early, help the community find bugs (just like we did!), and make sure that your performance monitoring systems are ready to handle the new features that were added.

PS: If this article was useful to you and you want to share it with your peers you can tweet it by clicking here.

]]>

Postgres Log Monitoring 101: Deadlocks, Checkpoint Tuning & Blocked Queries

Lukas Fittl — Mon, 12 Feb 2018 00:00:00 GMT

Those of us who operate production PostgreSQL databases have many jobs to do - and often there isn't enough time to take a regular look at the Postgres log files.

However, often times those logs contain critical details on how new application code is affecting the database due to locking issues, or how certain configuration parameters cause the database to produce I/O spikes.

This post highlights three common performance problems you can find by looking at, and automatically filtering your Postgres logs.

Blocked Queries

One of the most performance-related log events are blocked queries, due to waiting for locks that another query has taken. On systems that have problems with locks you will often also see very high CPU utilization that can't be explained.

First, in order to enable logging of lock waits, set log_lock_waits = on in your Postgres config. This will emit a log event like the following if a query has been waiting for longer than deadlock_timeout (default 1s):

LOG: process 123 still waiting for ShareLock on transaction 12345678 after 1000.606 ms
STATEMENT: SELECT table WHERE id = 1 FOR UPDATE;
CONTEXT: while updating tuple (1,3) in relation “table”
DETAIL: Process holding the lock: 456. Wait queue: 123.

This tells us that we're seeing lock contention on updates for table, as another transaction holds a lock on the same row we're trying to update. You can often see this caused by complex transactions that hold locks for too long. One frequent anti-pattern in a typical web app is to:

Open a transaction
Update a timestamp field (e.g. updated_at in Ruby on Rails)
Make an API call to an external service
Commit the transaction

The lock on the row that you updated in Step 2 will be held all the way to 4., which means if the API call takes a few seconds total, you will be holding a lock on that row for that time. If you have any concurrency in your system that affects the same rows, you will see lock contention, and the above lock notice for the queries in Step 2.

Often you however have to go back to a development or staging system with full query logging, to understand the full context of a transaction thats causing the problem.

Deadlocks

Related to blocked queries, but slightly different, are deadlocks, which result in a cancelled query due to it deadlocking against another query.

The easiest way to reproduce a deadlock is doing the following:

--- session 1
BEGIN;
SELECT * FROM table WHERE id = 1 FOR UPDATE;

--- session 2
BEGIN;
SELECT * FROM table WHERE id = 2 FOR UPDATE;
SELECT * FROM table WHERE id = 1 FOR UPDATE; --- this will block waiting for session 1 to finish

--- session 1
SELECT * FROM table WHERE id = 2 FOR UPDATE; --- this can never finish as it deadlocks against session 2

Again after deadlock_timeout Postgres will see the locking problem. In this case it decides that this will never finish, and emit the following to the logs:

2018-02-12 09:24:52.176 UTC [3098] ERROR:  deadlock detected
2018-02-12 09:24:52.176 UTC [3098] DETAIL:  Process 3098 waits for ShareLock on transaction 219201; blocked by process 3099.
	Process 3099 waits for ShareLock on transaction 219200; blocked by process 3098.
	Process 3098: SELECT * FROM table WHERE id = 2 FOR UPDATE;
	Process 3099: SELECT * FROM table WHERE id = 1 FOR UPDATE;
2018-02-12 09:24:52.176 UTC [3098] HINT:  See server log for query details.
2018-02-12 09:24:52.176 UTC [3098] CONTEXT:  while locking tuple (0,1) in relation "table"
2018-02-12 09:24:52.176 UTC [3098] STATEMENT:  SELECT * FROM table WHERE id = 2 FOR UPDATE;

You might think that deadlocks never happen in production, but the unfortunate truth is that heavy use of ORM frameworks can hide the circular dependency situation that produces deadlocks, and its certainly something to watch out for when you make use of complex transactions.

Checkpoints

Last but not least, checkpoints. For those unfamiliar, checkpointing is the mechanism by which PostgreSQL persists all changes to the data directory, which before were only in shared buffers and the WAL. Its what gives you a consistent copy of your data in one place (the data directory).

Due to the fact that checkpoints have to write out all the changes you've submitted to the database (which before were already written to the WAL), they can produce quite a lot of I/O - in particular when you are actively loading data.

The easiest way to produce a checkpoint is to call CHECKPOINT, but very few people would do that frequently in production. Instead Postgres has a mechanism that automatically triggers a checkpoint, most commonly due to either time, or xlog. After turning on log_checkpoints = 1 you can see this in the logs like this:

Feb 09 08:30:07am PST 12772 LOG: checkpoint starting: time
Feb 09 08:15:50am PST 12772 LOG: checkpoint starting: xlog
Feb 09 08:10:39am PST 12772 LOG: checkpoint starting: xlog

Or when visualized over time, it can look like this:

Occasionally Postgres will also output the following warning, which hints at the tuning you can do:

Feb 09 10:21:11am PST 5677 LOG: checkpoints are occurring too frequently (17 seconds apart)

With checkpoints you want to avoid having them occur to frequently, as each checkpoint will produce significant I/O, as well as cause all changes that are written to WAL right after to be written as a full-page write.

Ideally you would see checkpoints spaced out evenly and usually started by time instead of xlog. You can influence this behavior by the following config settings:

checkpoint_timeout - the time after which a time checkpoint will be kicked off (defaults to every 5 minutes)
max_wal_size - the maximum amount of WAL that will be accumulated before an xlog checkpoint gets triggered (defaults to 1 GB)
checkpoint_completion_target - how quickly a checkpoint finishes (defaults to 0.5 which means it will finish in half the time of checkpoint_timeout, i.e. 2.5 minutes)

On many production systems I've seen max_wal_size be increased to support higher write rates, checkpoint_timeout to be slightly increased as well to avoid too frequent time-based checkpoints, as well as setting checkpoint_completion_target to 0.9.

You should however tune all of this based on your own system, and the logs, so you can choose whats correct for your setup. Also note that less frequent checkpoints mean recovery of the server is going to take longer, as Postgres will have to replay all WAL, starting from the previous checkpoint, when booting after a crash.

Conclusion

Postgres log files contain a treasure of useful data you can analyze in order to make your system behave faster, as well as debug production issues. This data is readily available, but often difficult to parse.

This article tries to point the way towards which log lines are worth filtering for on production systems.

If you don't want to bother with setting up your own filters in a third party logging system, try out pganalyze Postgres Log Insights: a real-time PostgreSQL log analysis and log monitoring system built into pganalyze.

]]>

Visualizing & Tuning Postgres Autovacuum

Lukas Fittl — Tue, 28 Nov 2017 00:00:00 GMT

In this post we'll take a deep dive into one of the mysteries of PostgreSQL: VACUUM and autovacuum.

The Postgres autovacuum logic can be tricky to understand and tune - it has many moving parts, and is hard to understand, in particular for application developers who don't spend all day looking at database documentation.

But luckily there are recent improvements in Postgres, in particular the addition of pg_stat_progress_vacuum in Postgres 9.6, that make understanding autovacuum and VACUUM behavior a bit easier.

In this post we describe an approach to autovacuum tuning that is based on sampling these statistics over time, visualizing them, and then making tuning decisions based on data. The visualizations shown are all screenshots of real data, and are available for early access in pganalyze.

Why VACUUM?

First of all, why we need VACUUM, 101:

When you perform UPDATE and DELETE operations on a table in Postgres, the database has to keep around the old row data for concurrently running queries and transactions, due to its MVCC model. Once all concurrent transactions that have seen these old rows have finished, they effectively become dead rows which will need to be removed.

VACUUM is the process by which PostgreSQL cleans up these dead rows, and turns the space they have occupied into usable space again, to be used for future writes.

A more detailed description can be found in the PostgreSQL documentation.

Which tables have VACUUM running?

The easiest thing you can check on a running PostgreSQL system is which VACUUM operations are running right now. In all Postgres versions this information shows up in the pg_stat_activity view, look for query values that start with "autovacuum: ", or which contain the word "VACUUM":

SELECT pid, query FROM pg_stat_activity WHERE query LIKE 'autovacuum: %';

-------+----------------------------------------------------------------------------
 10469 | autovacuum: VACUUM ANALYZE public.schema_columns
 12848 | autovacuum: VACUUM public.replication_follower_stats (to prevent wraparound)
 28626 | autovacuum: VACUUM public.schema_index_stats (to prevent wraparound)

Based on sampling this data, we can generate a timeline view that helps us distinguish tables that are frequently vacuumed, from tables that have long running vacuums, to tables that don't get vacuumed much at all.

In the screenshot you can see the top 10 tables (by frequency) colored the same way, and in particular the table thats colored light yellow stand out as effectively having VACUUM running continuously.

We can also see that one manual VACUUM was started by the DBA user (colored in cyan), and that it ran much quicker than the same colored version started by autovacuum earlier in the day.

When does autovacuum run?

Another question that frequently comes up is, why did autovacuum decide to start VACUUMing a table?

There are essentially two major reasons:

1) To prevent Transaction ID wraparound

The number of non-frozen transaction IDs has reached "autovacuum_freeze_max_age" (default 200 million transactions), and VACUUM is required to prevent transaction ID wraparound.

We won't go too much into detail on tuning this parameter in this post, but rather reserve this as a follow-on topic.

Note that this can't be disabled, so it will cause autovacuum to start VACUUM, even if it is otherwise disabled. If you keep cancelling autovacuum processes started for this reason you will eventually have to perform a manual VACUUM, as Postgres will shut down the database otherwise.

2) To mark dead rows & enable re-use for new data

As you run UPDATEs and DELETEs, dead rows will accumulate, as described earlier in the post. Once the number of dead rows (or tuples) has exceeded the threshold, autovacuum will start a VACUUM run.

The following formula is used to decide whether vacuuming is needed:

vacuum threshold = vacuum base threshold + vacuum scale factor * number of tuples

By default the base threshold is 50 rows, and the scale factor is 20%. That means, a table will be vacuumed as soon as the number of dead rows exceeds 20% of all rows in the table, given that at least 50 rows are marked as dead.

In order to understand when this gets triggered, you can look at the n_live_tup and n_dead_tup values in pg_stat_user_tables:

SELECT * FROM pg_stat_user_tables WHERE relname = 'backend_states';

-[ RECORD 1 ]-------+------------------------------
relid               | 732156523
schemaname          | public
relname             | backend_states
...
n_live_tup          | 23047184
n_dead_tup          | 108373
...

We can then take this information, together with the autovacuum settings, and visualize it:

Here you can see that as soon as the dead tuples (grey/red area) reach the threshold (grey line), a VACUUM process kicks off (red line in the lower graph).

On a table that can't keep up with VACUUM, which results in bloat due to dead rows, this would instead look like this:

How fast does autovacuum run?

A VACUUM process that was started by autovacuum is artificially throttled in the default PostgreSQL configuration, so it doesn't fully utilize the CPU and I/O available.

That is the correct way to operate for most systems, as you wouldn't want VACUUM to slow down application queries during business hours.

The system that Postgres follows for this is that every VACUUM operation accumulates cost, which you can think of as points that get added up:

vacuum_cost_page_hit (cost for vacuuming a page found in the buffer cache, default 1)
vacuum_cost_page_miss (cost for vacuuming a page retrieved from disk, default 10)
vacuum_cost_page_dirty (cost for writing back a modified page to disk, default 20)

Once the sum of costs has reached autovacuum_cost_limit (default 200 for autovacuum, disabled for manual VACUUM), the VACUUM process will sleep and do nothing for autovacuum_vacuum_cost_delay (default 20 ms).

With the default parameters, that means that autovacuum will at most write 4MB/s to disk, and read 8MB/s from disk or the OS page cache.

How far has this VACUUM made progress?

VACUUM runs through three different major phases as part of its operation:

Scanning Heap
Vacuuming Indices
Vacuuming Heap

As well as a few minor phases that are usually really quick.

The "Vacuuming Indices" and "Vacuuming Heap" phase might run multiple times if the autovacuum_work_mem setting is set to a too low value that not all dead tuples can be held in memory.

Based on sampling pg_stat_progress_vacuum we can visualize in detail what goes on:

This works even whilst a autovacuum or manual VACUUM is still running, and so we can get a visual indication of how long we will roughly have to wait for it to finish.

What should I tune first?

In general, one might think that VACUUM is an expensive operation, and you'd want to only run it infrequently, maybe even as a nightly maintenance task.

That however is often the wrong way to approach it, as rarely run VACUUMs are much more expensive since they have more work to do, and it also means your system will spend more time in a sub-optimal state.

Instead, try to have VACUUM run more often, in proportion to UPDATEs and DELETEs your application performs. Frequently run VACUUMs will be faster, as there is less work to perform.

There is two primary tunings you should consider on production Postgres databases:

1) Lower autovacuum_vacuum_scale_factor on tables with old, inactive data

For tables with a lot of old, inactive data, consider lowering the threshold by which autovacuum is triggered. Since the calculation is based on the number of total rows in the table, autovacuum will not notice if most recent rows have been modified, since the overall number of dead rows will still be way below the default threshold of 20%.

However, you will see the impact of dead rows on your query performance, as the dead rows have to be scanned over when reading data. Reducing the scale factor to keep down the total number of dead rows can make sense in such cases.

2) Adjust autovacuum_cost_limit / autovacuum_cost_delay for bigger machines

The default settings for throttling are quite conservative on modern systems. Unless you run on the smallest instance type, or with the cheapest storage, it often makes sense to speed up autovacuum a bit.

In addition, for small tables that have a lot of updates/deletes, it can happen that autovacuum is not able to keep up, and that you will see new VACUUMs start pretty much right after the previous one was finished. In such cases adjusting the throttling on a per-table basis might also make sense.

Note that most autovacuum configuration settings can be overridden on a per-table basis:

ALTER TABLE my_table SET (autovacuum_vacuum_scale_factor = 0.05);

It often makes sense to review that table's particular statistics, e.g. how often is the table updated and how many dead tuples does it accumulate, before modifying autovacuum settings.

The visualizations shown in this post are based on real data, and are now available for early access to all pganalyze customers on the Scale plan and higher.

Reach out to have this feature enabled for your account - we'd be happy to walk you through it, and help you tune autovacuum on your database.

]]>

Whats New in Postgres 10: Monitoring Improvements

Lukas Fittl — Wed, 04 Oct 2017 00:00:00 GMT

Postgres 10 has been stamped on Monday, and will most likely be released this week, so this seems like a good time to review what this new release brings in terms of Monitoring functionality built into the database.

In this post you'll see a few things that we find exciting about the new release, as well as some tips on what to adjust, whether you use a hosted Postgres monitoring tool like pganalyze, or if you've written your own scripts.

New "pg_monitor" Monitoring Role

Most users of Postgres obviously don't want to give monitoring tools access to superuser, but in the past this was often required, as many Postgres statistic views (e.g. pg_stat_statements) only show the values for the current user, unless you are superuser.

This meant that you had to workaround with SECURITY DEFINER functions that queries the statistics views as superuser, but could be called from a restricted user.

Now, you can use the monitoring role in Postgres 10 to instead give a user specific access to monitor statistics views, without giving out any other access.

Its as simple as:

GRANT pg_monitor TO monitoring_user;

And afterwards that user can simply access statistics views without running into <insufficient privilege> issues like before.

This also works with pganalyze out of the box, so once you upgrade to 10 you can simply grant the monitoring role to the pganalyze user, and drop the helper functions we've previously asked you to create.

A subset of often used views that the monitoring role now grants you access to:

pg_stat_statements
pg_stat_activity
pg_stat_replication
pg_stat_progress_vacuum
.. and more

Note that there more fine-grained roles you can assign, should you want to.

Renaming of "xlog" to "wal", and "location" to "lsn"

If you've written your own monitoring scripts to check replication lag, and other statistics that have to do with WAL or LSNs, you'll need to update some function names.

In this new release, besides the WAL directory being renamed from "pg_xlog" to "pg_wal", all system administration functions have also been renamed to match this change. In addition, where previously functions had the name "location" in them, it now refers to "lsn".

You are most likely going to run into this with the often used pg_current_xlog_location (now pg_current_wal_lsn), as well as the helper method pg_xlog_location_diff (now pg_wal_lsn_diff).

Also note that the sent_location, write_location, etc fields in pg_stat_replication have been renamed to sent_lsn, write_lsn and so forth.

Wait Events & Non-Client Connections in pg_stat_activity

The pg_stat_activity view and underlying data structure has been thoroughly improved this release, and now shows not just client connections and autovacuum, but also other background workers that are running in the system:

SELECT pid, backend_type, backend_start FROM pg_stat_activity WHERE backend_type != 'client backend';

 pid |    backend_type     |         backend_start         
-----+---------------------+-------------------------------
  58 | autovacuum launcher | 2017-10-03 21:02:45.458053+00
  60 | background worker   | 2017-10-03 21:02:45.459172+00
  56 | background writer   | 2017-10-03 21:02:45.457657+00
  55 | checkpointer        | 2017-10-03 21:02:45.457491+00
  57 | walwriter           | 2017-10-03 21:02:45.457817+00

If you have previously written monitoring scripts that rely on counting the number of entries in pg_stat_activity, you should filter the view by backend_type = 'client backend', or switch to using numbackends from pg_stat_database.

In addition to this, the new release also brings an additional 115 wait events (visible in wait_event_type and wait_event in pg_stat_activity), in particular more than 60 new I/O related events which help you understand better what a query is busy with.

You can find the full list of wait events in the Postgres documentation.

amcheck

Last but not least, a useful feature for consistency checking got added in this release. Initially developed by Peter Geoghegan and battle-tested at Heroku Postgres, this new tool allows you to check a B-Tree index for corruption as well as verify that invariants in the structure of the index are as expected.

It first needs to be created as CREATE EXTENSION amcheck and can then be run by a superuser like this:

SELECT bt_index_check('my_test_index');

 bt_index_check
----------------

An empty result indicates that the index is consistent, as would be expected.

Note that amcheck accesses the index through the shared buffer cache, so it might not show problems at the disk level right away. See more details on its documentation page.

This concludes a short overview of new monitoring functionality in Postgres 10.

Note that there are many other amazing new features like parallel query, logical replication and declarative partitioning that are not covered in this post.

If this article proved useful to you, you might also be interested in our Postgres Log Monitoring 101 article where we take a closer look at Deadlocks, Checkpoint Tuning, and Blocked Queries.

]]>

Introducing pg_query: Parse PostgreSQL queries in Ruby

Lukas Fittl — Tue, 17 Jun 2014 00:00:00 GMT

In this article we'll take a look at the new pg_query Ruby library.

pg_query is a Ruby library I wrote to help you parse SQL queries and work with the PostgreSQL parse tree. We use this extension inside pganalyze to provide contextual information for each query and find columns which might need an index.

At the end of this article you'll also find monitor.rb - a ready-to-use example that filters pg_stat_statements output and restricts it to only show a specific table.

Existing Solutions to Parse SQL Queries

After a longer period of research on this problem, we've come to a few realizations:

Obviously, using regular expressions for parsing any complex language is a bad idea.
None of the existing parsers work really well, or are maintained. For example sqlparse is focused on re-indenting and beautifying SQL - not for actually working with the query.
Writing and maintaining our own SQL parser is a bad idea. SQL is complex, even for simple things like SELECT. And don't get me started on Common Table Expressions, sub-queries and other fun features.

Our conclusion: The only way to correctly parse all valid SQL queries that PostgreSQL understands, now and in the future, is to use PostgreSQL itself.

And in general, PostgreSQL turns out to have a pretty good SQL parser - other SQL databases even use it as a reference implementation.

So we've pretty much determined that we wanted to use the PostgreSQL parser itself - but how do we access it?

Accessing the PostgreSQL Parser

Lets get the PostgreSQL server source, go down the rabbit hole and find what we need:

/*
 * raw_parser
 * Given a query in string form, do lexical
 * and grammatical analysis.
 *
 * Returns a list of raw (un-analyzed) parse trees.
 */
List *
raw_parser(const char *str)
{
	...
}

This is the C function that takes a query and returns a parse tree as C structs.

Luckily this function is fairly independent, it does not need pg_catalog access (tables, indices, statistics, etc) since it runs before the query is rewritten, planned and executed:

Unfortunately raw_parser(...) is not exposed or included in any of the PostgreSQL libraries - and its quite difficult to extract the parser from PostgreSQL without taking a whole lot of other code with you.

The pgpool project has actually done this, but they do need to update that code for every new major release. We've therefore turned to a slightly different approach:

We use the PostgreSQL server code directly - by statically linking the code into our own shared library. Through a bit of linking magic, we simply call the internal parser functions, and expose that function through a Ruby interface, to be used like this:

require 'pg_query'

pp PgQuery.parse("SELECT 1")
#<PgQuery:0x007f8cdaa8f8b8
 @parsetree=
  [{"SELECT"=>
     {"distinctClause"=>nil,
      "intoClause"=>nil,
      "targetList"=>
       [{"RESTARGET"=>
          {"name"=>nil,
           "indirection"=>nil,
           "val"=>{"A_CONST"=>{"val"=>1, "location"=>7}},
           "location"=>7}}],
      "fromClause"=>nil,
      "whereClause"=>nil,
      "groupClause"=>nil,
      "havingClause"=>nil,
      "windowClause"=>nil,
      "valuesLists"=>nil,
      "sortClause"=>nil,
      "limitOffset"=>nil,
      "limitCount"=>nil,
      "lockingClause"=>nil,
      "withClause"=>nil,
      "op"=>0,
      "all"=>false,
      "larg"=>nil,
      "rarg"=>nil}}],
 @query="SELECT 1",
 @warnings=[]>

The result is a PostgreSQL parse tree as used by PostgreSQL internally.

Parsing Normalized Queries

Now, to the interesting part. Assume we collect pg_stat_statements queries like this one:

SELECT "users".* FROM "users" WHERE "users"."id" = ?

Note that the actual value has been replaced by the ? character. Unfortunately, the PostgreSQL parser can't parse queries normalized in this manner. It would simply return a syntax error.

At first, we simply replaced all occurences of ? with $0 (a parameter reference) before parsing, so that the query can be parsed correctly.

There are however a few problems with that kind of "dumb" string replacement - most prominentely: We're breaking all operators containing ?, like for example those for JSONB in 9.4.

Our improved solution to this: We've patched the PostgreSQL parser to support ? as a parameter reference (identical with $0).

require 'pg_query'

pp PgQuery.parse("SELECT * FROM x WHERE y = ?")
#<PgQuery:0x007f8cdaaaae10
 @parsetree=
  [{"SELECT"=>
     {"distinctClause"=>nil,
      "intoClause"=>nil,
      "targetList"=>
       [{"RESTARGET"=>
          {"name"=>nil,
           "indirection"=>nil,
           "val"=>{"COLUMNREF"=>{"fields"=>[{"A_STAR"=>{}}], "location"=>7}},
           "location"=>7}}],
      "fromClause"=>
       [{"RANGEVAR"=>
          {"schemaname"=>nil,
           "relname"=>"x",
           "inhOpt"=>2,
           "relpersistence"=>"p",
           "alias"=>nil,
           "location"=>14}}],
      "whereClause"=>
       {"AEXPR"=>
         {"name"=>["="],
          "lexpr"=>{"COLUMNREF"=>{"fields"=>["y"], "location"=>22}},
          "rexpr"=>{"PARAMREF"=>{"number"=>0, "location"=>26}},
          "location"=>24}},
      "groupClause"=>nil,
      "havingClause"=>nil,
      "windowClause"=>nil,
      "valuesLists"=>nil,
      "sortClause"=>nil,
      "limitOffset"=>nil,
      "limitCount"=>nil,
      "lockingClause"=>nil,
      "withClause"=>nil,
      "op"=>0,
      "all"=>false,
      "larg"=>nil,
      "rarg"=>nil}}],
 @query="SELECT * FROM x WHERE y = ?",
 @warnings=[]>

Unfortunately, right now, this parser change limits the usage of ? in operators to those in core - specifically JSONB and gemetric operators. If you use third-party extensions or custom operators that contain ?, pg_query likely won't be able to parse those queries.

The Result

As a proof of concept, I wrote monitor.rb, a Ruby script that shows the current information stored inside pg_stat_statements in a top-like manner, filtered by a specific table:

monitor.rb -d sampledb -t users

AVG     | QUERY
--------------------------------------------------------------------------------
1.5ms   | SELECT "users".* FROM "users"
0.1ms   | SELECT "users".* FROM "users" WHERE "users"."id" = ? ORDER BY "users"."id" ASC LIMIT ?
0.1ms   | UPDATE "users" SET "fullname" = $1, "updated_at" = $2 WHERE "users"."id" = ?
0.0ms   | SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT 1

This could be easily extended to highlight queries accessing large tables, potentially missing indices, etc.

Going Forward

As you can see, PostgreSQL parse trees are quite useful - and there are many more analysis/grouping options that could be explored.

If you enjoyed reading this, please give pg_query a try. Simply install it using:

gem install pg_query

During installation of the library a full PostgreSQL server is compiled, so it might take 5-10 minutes. Using a gem cache is advised for deployment.

Interested in support for other languages? Drop me a line and I'd love to chat how we can add support for Python, Perl, you name it.

Furthermore, we'll try to get some of our patches upstream for PostgreSQL 9.5 - this specifically relates to our changes in outfuncs.c, supporting additional query nodes and JSON output. Your help and feedback is appreciated.

And of course, if you build something cool with this, let us know! :)

]]>

pganalyze Blog - Articles by Lukas Fittl

Introducing pg_query for Postgres 16 - Parsing SQL/JSON, Windows support, PL/pgSQL parse mode & more

pg_query, the Postgres parser as a standalone C library

Using query fingerprints to identify queries across servers

Utilizing deparsing to upgrade queries to Postgres 16 SQL/JSON syntax

Alternate parse modes to work with PL/pgSQL expressions

A shout-out to the community

In conclusion

Postgres 16: Cumulative I/O statistics with pg_stat_io

Querying system-wide I/O statistics in Postgres

Use cases for pg_stat_io

Tracking Write I/O activity in Postgres

Improve workload stability and sizing shared_buffers by monitoring shared buffer evictions

Tracking cumulative I/O activity by autovacuum and manual VACUUMs

Visibility into bulk read/write strategies (sequential scans and COPY)

Sneak peek: Visualizing pg_stat_io in pganalyze

The future of I/O observability in Postgres

How Postgres Chooses Which Index To Use For A Query

A tour of Postgres: Parse analysis and early stages of planning

Four levels of planning a query

Breaking down a query into tables being scanned (RelOptInfo and RestrictInfo structs)

Choosing different paths and scan methods

Where Index Scans are made

Creating the two types of index scans: plain vs parameterized

Understanding B-tree index cost estimates

Parameterized Index Scans, or: Why Nested Loop are sometimes a good join type

New features coming soon to pganalyze

Conclusion

Other helpful resources

Postgres in 2021: An Observer's Year In Review

Postgres Performance: Sometimes it's the small things

Does autovacuum dream of 64-bit Transaction IDs?

EXPLAIN: Nested Loops can be deceiving

Extended Statistics: Help the Postgres planner do its job better

Crustaceous Postgres: Using Rust For Extensions & more

Other highlights from Postgres development in 2021

Conclusion

Understanding Postgres GIN Indexes: The Good and the Bad

GIN Index in Postgres: What is it actually?

Indexing tsvector columns for Postgres full text search

Indexing LIKE searches with Trigrams and gin_trgm_ops

PostgreSQL, JSONB and GIN Indexes

Postgres GIN index for JSONB columns using jsonb_ops and jsonb_path_ops

Multi-Column GIN Indexes, and Combining GIN and B-tree indexes

The downside of GIN Indexes: Expensive Updates

GIN trigram indexes: A lesson from GitLab

Measuring GIN pending list overhead and size

Strategies for dealing with GIN pending list update issues

GIN index support in the pganalyze Index Advisor

Conclusion

Other helpful resources

How we deconstructed the Postgres planner to find indexing opportunities

Planning a Postgres query without a running database server

How accurate is this planning process?

Finding multiple possible plan paths, not just the best path

Making index recommendations based on restriction clauses

Understanding Postgres clause selectivity

How we incorporated Postgres community feedback

Creating the best index, vs creating “good enough” indexes

Join us for design research sessions

Conclusion

A better way to index your Postgres database: pganalyze Index Advisor

Postgres Indexing: Is machine learning the answer?

How Postgres determines when to use an index

Creating the best Postgres index for your query

Review existing indexes with the Index Advisor

Try out the Index Advisor for free with the standalone tool

Automatic index advisor for your production queries in pganalyze

pganalyze Index Advisor and new pricing plans

Conclusion

Using Postgres CREATE INDEX: Understanding operator classes, index types & more

How do you create an index in Postgres?

Parse analysis: How Postgres interprets your query

Looking behind the scenes: Operators and data types

Finding the right index type

Specifying operator classes during CREATE INDEX

Specifying multiple columns when adding a Postgres index

Using functions and expressions in an index definition

Specifying a WHERE clause to create partial PostgreSQL indexes

Using INCLUDE to create a covering index for Index-Only Scans