Introducing pg_query for Postgres 16 - Parsing SQL/JSON, Windows support, PL/pgSQL parse mode & more

Published on:

January 11, 2024

Parsing SQL queries and turning them into a syntax tree is not a simple task. Especially when you want to support special syntax that is specific to a particular database engine, like Postgres. And when you’re working with queries day in day out, like we do at pganalyze, understanding the actual intent of a query, which tables it scans, which columns it filters on, and such, is essential.

Almost 10 years ago, we determined that in order to create the best product for monitoring and optimizing Postgres, we needed to parse queries the way that Postgres does. We released the first version of pg_query back in 2014, and have seen many different projects outside of pganalyze utilize our open-source project. For example, to support migration use cases, create linting tools, or check which queries an application executes (see our post from 2021 for some examples). And to name just one vanity metric, the Ruby binding for pg_query has been downloaded an incredible 34 million times!

Today, we’re excited to announce the new pg_query release based on the Postgres 16 parser, which introduces support for running on Windows (a frequently requested addition), alternate query parse modes (e.g. to parse PL/pgSQL assignments), as well as parsing and deparsing new Postgres syntax, such as SQL/JSON. We’ve released updated Ruby, Rust and Go bindings, and expect bindings maintained by the community, such as for Node.js and Python, to be updated soon as well.

In this post, we showcase how to use pg_query in your application, and a few benefits of the new release. But first, let’s go back to the basics - how does pg_query work?

pg_query, the Postgres parser as a standalone C library

At its core, pg_query is all about making the “raw_parser” function from Postgres available. We’ve written about this in more detail in the original pg_query announcement, but the quick summary is:

We apply a tiny amount of patches on top of Postgres, e.g. to help with parsing $n parameter references in queries from pg_stat_statements
We utilize libclang to build a tree of dependencies between functions and global variables in the Postgres source code
In some cases, we apply mocks to avoid entering parts of Postgres we don’t need (e.g., functions that access the file system)
We locate all the source code necessary for the functions we want to call (like “raw_parser”), and remove all other code, to make sure the compiler doesn’t do unnecessary work, or pull in functionality we don’t need
From the built-in node definitions (which are C structs), we automatically create output functions for JSON and protocol buffers, to make it convenient to write bindings in other programming languages

Overall, this results in a library that can parse SQL text and return a Postgres parse tree for you to work with and modify, whilst supporting the full syntax that Postgres itself supports.

From an end user perspective that means you can, for example in the Ruby library, use the following code to parse a query, and find out which table it's querying:

require 'pg_query'
parsed_query = PgQuery.parse("SELECT * FROM users")
puts parsed_query.tree.stmts.first.stmt.select_stmt.from_clause.first.range_var.inspect
# => <PgQuery::RangeVar: catalogname: "", schemaname: "", relname: "users", inh: true, relpersistence: "p", location: 14>

The parse tree structs are automatically generated as protocol buffer definitions based on Postgres’ internal structs located in parsenodes.h and adjacent files, and the language-specific bindings can use each language’s protobuf libraries to have properly typed structs as well.

The main change in the core parsing functionality in this release is that we’ve added support for compiling libpg_query on Windows (with either MSVC, or an MSYS2 stack using MinGW/etc), a frequently requested feature.

Using query fingerprints to identify queries across servers

Besides parsing itself, there was another major use case that we needed to solve for pganalyze: The ability to group queries together.

Postgres itself generates a “queryid” to support this. Originally part of pg_stat_statements, it has been part of Postgres core since Postgres 14, and is generated when “compute_query_id” is enabled (automatically done when using pg_stat_statements). However, the Postgres queryid has its flaws: Besides not always grouping together as well as it could (e.g. in the case of IN lists), it’s not portable. If you ran the same query on two different servers, you would get two different query IDs. This difference in query IDs is primarily explained by the fact that Postgres determines which tables a query references based on the relation OIDs. But those OIDs are not stable across servers, as they are internal identifiers.

With the pg_query fingerprint we intentionally went another way: We utilize the name (and schema) of the table, as it is present in the raw parse tree that pg_query has access to, when generating a unique identifier for a query.

There are of course many other parts of a query we also take into consideration, e.g. referenced columns, expressions, functions, etc. To enable grouping we do not include constant values in the fingerprint, to ensure that two similar queries get the same fingerprint:

PgQuery.fingerprint("SELECT * FROM users WHERE id = 1")
# => "a0ead580058af585"
PgQuery.fingerprint("SELECT * FROM users WHERE id = 2")
# => "a0ead580058af585"
PgQuery.fingerprint("SELECT * FROM users WHERE email = $1")
# => "e213d9d32c7097d5"

What else can we use fingerprints for? One use case that we’ve heard about from pganalyze customers, is to use query fingerprints to help identify the same query on both the application side and the database.

Specifically, by using pg_query in application side tracing to tag a query, and then, when looking at a slow trace, using that data in pganalyze to find more detailed information about database-side performance. This also inspired our recent integration with OpenTelemetry, which solves the same use case in a slightly different way.

Utilizing deparsing to upgrade queries to Postgres 16 SQL/JSON syntax

Now to something new in the Postgres 16 release! In Postgres 16, one of the bigger syntax changes was the addition of SQL/JSON. And pg_query fully supports that, both for parsing, as well as deparsing (which allows you to turn a syntax tree back into a SQL statement).

We can use the pg_query deparser to write the equivalent of a codemod for SQL statements, that rewrites the legacy syntax into the more standard SQL/JSON syntax.

For example, imagine we have many places where we build JSON objects manually in SQL using the “json_build_object” function, and wanted to replace that with the new JSON_OBJECT syntax:

q = PgQuery.parse("SELECT json_build_object('key1', 1, 'key2', 'val');")
q.walk! do |node|
  next unless node.is_a?(PgQuery::Node) && node.node == :func_call
  func_name = node.func_call.funcname[0].string.sval
  if func_name == 'json_build_object'
    exprs = node.func_call.args.each_slice(2).map do |key, value|
      PgQuery::Node.from(
        PgQuery::JsonKeyValue.new(
          key: key,
          value: PgQuery::JsonValueExpr.new(raw_expr: value)
        )
      )
    end
    node.inner = PgQuery::JsonObjectConstructor.new(exprs: exprs)
  end
end
q.deparse
# => "SELECT JSON_OBJECT('key1': 1, 'key2': 'val')"

Each release, we test the pg_query deparser for completeness with the full set of Postgres regression tests, and be it SQL/JSON, or other new syntax, you can rest assured that pg_query supports it.

Alternate parse modes to work with PL/pgSQL expressions

Since Postgres 14, PL/pgSQL expressions are now parsed through the regular “raw_parser” functionality, by passing a special mode flag that then allows for PL/pgSQL specific syntax.

We didn’t support this in pg_query before, but thanks to a contribution by Landan Cheruka, there is now a way to parse PL/pgSQL expressions directly with pg_query.

Let’s first utilize parse_plpgsql to parse a function definition, the example taken from the Postgres documentation:

CREATE OR REPLACE FUNCTION cs_fmt_browser_version(v_name varchar,
                                              	  v_version varchar)
RETURNS varchar AS $$
BEGIN
  IF v_version IS NULL THEN
	RETURN v_name;
  END IF;
  RETURN v_name || '/' || v_version;
END;$$;

{
  "PLpgSQL_function": {
    "datums": [
      { "PLpgSQL_var": { "refname": "v_name", "datatype": { "PLpgSQL_type": { "typname": "UNKNOWN" } } } },
      { "PLpgSQL_var": { "refname": "v_version", "datatype": { "PLpgSQL_type": { "typname": "UNKNOWN" } } } },
      { "PLpgSQL_var": { "refname": "found", "datatype": { "PLpgSQL_type": { "typname": "UNKNOWN" } } } }
    ],
    "action": {
      "PLpgSQL_stmt_block": {
        "body": [
          {
            "PLpgSQL_stmt_if": {
              "cond": {
                "PLpgSQL_expr": { "query": "v_version IS NULL", "parseMode": 2 }
              },
              "then_body": [
                {
                  "PLpgSQL_stmt_return": {
                    "expr": {
                      "PLpgSQL_expr": { "query": "v_name", "parseMode": 2 }
                    }
...
            "PLpgSQL_stmt_return": {
              "expr": {
                "PLpgSQL_expr": { "query": "v_name || '/' || v_version", "parseMode": 2 }
...

In this function parse tree, you can see the different PLpgSQL_expr expressions, but the actual expression is just text. We can now use the new pg_query_parse_opt function to turn that text into a parse tree:

#include <pg_query.h>
#include <stdio.h>
#include <stdlib.h>

int main() {
  PgQueryParseResult result;

  result = pg_query_parse_opts("v_name || '/' || v_version", PG_QUERY_PARSE_PLPGSQL_EXPR);

  if (result.error) {
	printf("error: %s at %d\n", result.error->message, result.error->cursorpos);
  } else {
	printf("%s\n", result.parse_tree);
  }

  pg_query_free_parse_result(result);

  return 0;
}

And that gives us a regular parse tree to work with:

{
	"version": 160001,
	"stmts": [
    	{
        	"stmt": {
            	"SelectStmt": {
                	"targetList": [
                    	{
                        	"ResTarget": {
                            	"val": {
                                	"A_Expr": {
                                    	"kind": "AEXPR_OP",
…

We’re still in the process of updating language bindings to support optionally using these parse modes, and would be curious to hear about more use cases for working with PL/pgSQL and pg_query.

A shout-out to the community

pg_query wouldn’t be the same without the community!

We want to expressly call out:

Lele Gaifax for maintaining the Python binding “pglast” and proactively testing libpg_query PRs
Landan Cheruka for adding support for alternate parse modes
Anuraag Agrawal for contributions to enable use in WebAssembly (see pg_query_go without cgo)
Mehmet Emin KARAKAŞ for the many deparser improvements over the years
Philipp Steinrötter for creating the Postgres Language Server based on pg_query.rs, and giving lots of good feedback on how things could work better
And everyone else who contributed to libpg_query and related projects!

Looking ahead, we’re also looking forward to continued conversations with the Postgres community on how we could upstream parts of pg_query as a core part of Postgres, so a query parsing library could be provided directly as part of Postgres.

In conclusion

We’re excited about the new pg_query version, and we’re always happy to hear about new use cases you find for using it to work with Postgres queries. If you have ideas on how pg_query could be better, feel free to open an issue on GitHub.

And if you’ve benefited from pg_query in the past, and have not yet tried out pganalyze to optimize your Postgres performance, you can try out pganalyze with our free 14-day trial.