5mins of Postgres E36: Optimizing Postgres query throughput in Python using pipeline mode and psycopg 3.1
Today, we're going to talk about the new psycopg driver for Python. We're also going to talk about pipeline mode in psycopg and how it works.
Let's dive in!
First of all, I want to give a quick shout-out to Daniele Varrazzo who has been maintaining pyscopg and has started the new psycopg3 project that aims to bring a modern Postgres driver to Python. They've recently released the psycopg 3.1 release. There are a couple of interesting announcements in here, including things like supporting CockroachDB and better cursor support. But I want to focus today on pipeline mode.
Pipeline mode is a contribution to psycopg by Denis Laxalde from the Dalibo team. In this blog post, he describes why pipeline mode matters, the performance improvements it brings, and how you can use it in Python.
If you are not familiar with pipeline mode in Postgres, we talked about this on Episode 32 of 5mins of Postgres a couple of weeks ago, and, really, the main difference between using pipeline mode and regular query execution, is that pipeline mode will send multiple queries to the server without having to wait for results.
The main benefit of this is that you're saving on network latency, which if you're in the cloud and maybe you're talking across availability zones, network latency quickly adds up. If you have a really fast query, that query itself might execute under one millisecond, but the network latency could be as much as two or three milliseconds!
The main benefit of pipeline mode is that you're saving on network latency.
In psycopg, with these recent changes, we simply have to call "connection.pipeline". Then things that are done inside this block will run in pipeline mode instead of running as regular queries.
The example we see in Denis’ blog post is using a table and using inserts. Pipeline mode isn't limited to doing writes, you can also use it with reads, but oftentimes writes are going to be more independent. You can't say: send this query, and then when I have the result, do this other thing. The idea behind pipeline mode is that you have independent statements.
The idea behind pipeline mode is that you have independent statements.
Looking at another example in the blog post: for each value that gets inserted we take the factorial. If you had all the data already available, you should just do a multi-row insert, that's going to be much more efficient. But, sometimes you have data that you're calculating as you're inserting, and so having to calculate everything upfront would be quite expensive. You want to have that step-by-step process that pipeline mode enables.
In Denis’ example program, there are two modes, there's the pipeline mode and the regular mode. In both cases we're creating a factorial of a certain number and then we're executing an
INSERT statement into the database.
Before we jump into how that compares performance-wise, a quick note on the background of how this works. We can look at the actual communication between a client and a server using the trace functionality in libpq. And libpq is the underlying C library that most drivers in Postgres, including psycopg use to talk to the database. There is this PQtrace function, a C function, that psycopg can use. When you enable tracing, it will give you an output, and in this output you'll see the notion of
- "F" meaning frontend, or you, the client, and
- "B" meaning backend or the server.
For example, in one case in Deni’s article,we look at the regular query protocol interaction without pipelining. There, the frontend says "Sync", then it waits for the server to respond, and then the client sends additional data.
The idea is that you have this challenge-response interaction between the frontend and the backend. In psycopg, we can just use the trace function to enable this tracing. This lets us compare how this output looks like between pipelining and regular runs.
Without pipelining, you
- first have a "Parse" message that parses the query without actual values
- then have the "Sync" message,
- then the server tells you "Yup, I parsed this query, I'm ready for your data".
- then you send the server a "Bind" message,
- then you tell the server to execute that data.
- the server says, "Okay, got it"
- there's no result to be returned because it's just a simple insert
- the server then tells you "Command complete, I executed this command".
- then you send the next
INSERTstatement, the next data.
As you can see, there is a lot of waiting for the server to return data back to the client, and this obviously adds up.
If we have pipeline mode, this looks much more efficient. We have the same messages pretty much, but then, instead of doing all the "Sync" and all this waiting, we do "Parse" and we right away do a "Bind". We then execute and then do another "Bind" and execute, and we keep doing that.
Now, at some point, the server is going to say: oh, my output buffers are busy, I need to start returning you some data, even though you didn't wait for it. The server will start returning you data and tell you
CommandComplete, and so you could actually work with these results.
What Denis observed was that, even on localhost, even if there is no network latency, just the interprocess overhead, added up enough that this made a 2x difference, twice as fast, when running it with pipelining! You can imagine that for fast queries, this could be a 10x difference in the cloud.
It's a huge difference for some workloads.
What Denis observed was that, even on localhost [...] this made a 2x difference, twice as fast, when running it with pipelining!
As Denis notes here, this required months of work and a lot of thinking and testing. I'm really thankful that folks are out there working on these hard problems for Postgres client drivers so that ultimately we can have faster applications.