Introducing pg_query: Parse PostgreSQL queries in Ruby

In this article we'll take a look at the new pg_query Ruby library.

pg_query is a Ruby library I wrote to help you parse SQL queries and work with the PostgreSQL parse tree. We use this extension inside pganalyze to provide contextual information for each query and find columns which might need an index.

At the end of this article you'll also find monitor.rb - a ready-to-use example that filters pg_stat_statements output and restricts it to only show a specific table.

Existing Solutions to Parse SQL Queries

xckd comic on regular expressions

After a longer period of research on this problem, we've come to a few realizations:

  • Obviously, using regular expressions for parsing any complex language is a bad idea.

  • None of the existing parsers work really well, or are maintained. For example sqlparse is focused on re-indenting and beautifying SQL - not for actually working with the query.

  • Writing and maintaining our own SQL parser is a bad idea. SQL is complex, even for simple things like SELECT. And don't get me started on Common Table Expressions, sub-queries and other fun features.

Our conclusion: The only way to correctly parse all valid SQL queries that PostgreSQL understands, now and in the future, is to use PostgreSQL itself.

And in general, PostgreSQL turns out to have a pretty good SQL parser - other SQL databases even use it as a reference implementation.

So we've pretty much determined that we wanted to use the PostgreSQL parser itself - but how do we access it?

Accessing the PostgreSQL Parser

Lets get the PostgreSQL server source, go down the rabbit hole and find what we need:

/*
 * raw_parser
 * Given a query in string form, do lexical
 * and grammatical analysis.
 *
 * Returns a list of raw (un-analyzed) parse trees.
 */
List *
raw_parser(const char *str)
{
	...
}

This is the C function that takes a query and returns a parse tree as C structs.

Luckily this function is fairly independent, it does not need pg_catalog access (tables, indices, statistics, etc) since it runs before the query is rewritten, planned and executed:

Diagram of query execution flow in Postgres

Unfortunately raw_parser(...) is not exposed or included in any of the PostgreSQL libraries - and its quite difficult to extract the parser from PostgreSQL without taking a whole lot of other code with you.

The pgpool project has actually done this, but they do need to update that code for every new major release. We've therefore turned to a slightly different approach:

We use the PostgreSQL server code directly - by statically linking the code into our own shared library. Through a bit of linking magic, we simply call the internal parser functions, and expose that function through a Ruby interface, to be used like this:

require 'pg_query'

pp PgQuery.parse("SELECT 1")
#<PgQuery:0x007f8cdaa8f8b8
 @parsetree=
  [{"SELECT"=>
     {"distinctClause"=>nil,
      "intoClause"=>nil,
      "targetList"=>
       [{"RESTARGET"=>
          {"name"=>nil,
           "indirection"=>nil,
           "val"=>{"A_CONST"=>{"val"=>1, "location"=>7}},
           "location"=>7}}],
      "fromClause"=>nil,
      "whereClause"=>nil,
      "groupClause"=>nil,
      "havingClause"=>nil,
      "windowClause"=>nil,
      "valuesLists"=>nil,
      "sortClause"=>nil,
      "limitOffset"=>nil,
      "limitCount"=>nil,
      "lockingClause"=>nil,
      "withClause"=>nil,
      "op"=>0,
      "all"=>false,
      "larg"=>nil,
      "rarg"=>nil}}],
 @query="SELECT 1",
 @warnings=[]>

The result is a PostgreSQL parse tree as used by PostgreSQL internally.

Parsing Normalized Queries

Now, to the interesting part. Assume we collect pg_stat_statements queries like this one:

SELECT "users".* FROM "users" WHERE "users"."id" = ?

Note that the actual value has been replaced by the ? character. Unfortunately, the PostgreSQL parser can't parse queries normalized in this manner. It would simply return a syntax error.

At first, we simply replaced all occurences of ? with $0 (a parameter reference) before parsing, so that the query can be parsed correctly.

There are however a few problems with that kind of "dumb" string replacement - most prominentely: We're breaking all operators containing ?, like for example those for JSONB in 9.4.

Our improved solution to this: We've patched the PostgreSQL parser to support ? as a parameter reference (identical with $0).

require 'pg_query'

pp PgQuery.parse("SELECT * FROM x WHERE y = ?")
#<PgQuery:0x007f8cdaaaae10
 @parsetree=
  [{"SELECT"=>
     {"distinctClause"=>nil,
      "intoClause"=>nil,
      "targetList"=>
       [{"RESTARGET"=>
          {"name"=>nil,
           "indirection"=>nil,
           "val"=>{"COLUMNREF"=>{"fields"=>[{"A_STAR"=>{}}], "location"=>7}},
           "location"=>7}}],
      "fromClause"=>
       [{"RANGEVAR"=>
          {"schemaname"=>nil,
           "relname"=>"x",
           "inhOpt"=>2,
           "relpersistence"=>"p",
           "alias"=>nil,
           "location"=>14}}],
      "whereClause"=>
       {"AEXPR"=>
         {"name"=>["="],
          "lexpr"=>{"COLUMNREF"=>{"fields"=>["y"], "location"=>22}},
          "rexpr"=>{"PARAMREF"=>{"number"=>0, "location"=>26}},
          "location"=>24}},
      "groupClause"=>nil,
      "havingClause"=>nil,
      "windowClause"=>nil,
      "valuesLists"=>nil,
      "sortClause"=>nil,
      "limitOffset"=>nil,
      "limitCount"=>nil,
      "lockingClause"=>nil,
      "withClause"=>nil,
      "op"=>0,
      "all"=>false,
      "larg"=>nil,
      "rarg"=>nil}}],
 @query="SELECT * FROM x WHERE y = ?",
 @warnings=[]>

Unfortunately, right now, this parser change limits the usage of ? in operators to those in core - specifically JSONB and gemetric operators. If you use third-party extensions or custom operators that contain ?, pg_query likely won't be able to parse those queries.

The Result

As a proof of concept, I wrote monitor.rb, a Ruby script that shows the current information stored inside pg_stat_statements in a top-like manner, filtered by a specific table:

monitor.rb -d sampledb -t users
AVG     | QUERY
--------------------------------------------------------------------------------
1.5ms   | SELECT "users".* FROM "users"
0.1ms   | SELECT "users".* FROM "users" WHERE "users"."id" = ? ORDER BY "users"."id" ASC LIMIT ?
0.1ms   | UPDATE "users" SET "fullname" = $1, "updated_at" = $2 WHERE "users"."id" = ?
0.0ms   | SELECT "users".* FROM "users" WHERE "users"."id" = $1 LIMIT 1

This could be easily extended to highlight queries accessing large tables, potentially missing indices, etc.

Going Forward

As you can see, PostgreSQL parse trees are quite useful - and there are many more analysis/grouping options that could be explored.

If you enjoyed reading this, please give pg_query a try. Simply install it using:

gem install pg_query

During installation of the library a full PostgreSQL server is compiled, so it might take 5-10 minutes. Using a gem cache is advised for deployment.

Interested in support for other languages? Drop me a line and I'd love to chat how we can add support for Python, Perl, you name it.

Furthermore, we'll try to get some of our patches upstream for PostgreSQL 9.5 - this specifically relates to our changes in outfuncs.c, supporting additional query nodes and JSON output. Your help and feedback is appreciated.

And of course, if you build something cool with this, let us know! :)


Enjoy blog posts like this?

Get them once a month to your inbox