5mins of Postgres E32: Benchmarking and improving query network latency using tc, COPY and pipeline mode
Today, we're gonna talk about network latency, how it impacts your query performance and how to improve it by using pipeline mode or COPY in Postgres. We also take a look at traffic control with the Linux tc tool.
Let's dive in!
In this blog post by Lawrence Jones, he describes an issue that he had run into back when he was working at a company called GoCardless a couple years ago. They were migrating from IBM Softlayer to Google Cloud and as part of this migration they also had to migrate their database. Of course, when you're a payments company it's very important to have high uptime and any migration of that order of magnitude is really hard to do and hard to do safely.
As part of this migration, part of the application would be on Google Cloud, but the database would still be running on IBM Softlayer. That, of course, has a lot of implications around latency. They measured around 0.5 milliseconds when they were accessing the database from within the Softlayer data center, going from a local application server directly to that database in the same data center. Versus if they went from GCP to IBM Softlayer, then that would actually take 10 milliseconds, which is a lot slower. This is a challenge.
This is really a problem with what Lawrence calls a “chatty ORM”. Ruby on Rails, in this particular case, likes to treat the database as if it can just make function calls. Like the database is just sitting next to it being able to answer questions at any time.
They were trying to determine how they can support a higher latency, at least for some time. They ended up doing testing, using a tool called tc, short for traffic control, in Linux. tc adds a latency to network packages. You can simulate a slow network like they would have when they go across clouds, and they would simulate that on the Postgres server itself.
Over a period of a week, they would keep adding one millisecond of latency until they got to 10 milliseconds and they would try to see what falls over. And of course they saw various issues happening. They saw long transactions causing problems, certain requests taking too long.
This is a really good practice if you're trying to simulate what would happen if you didn't have the application database sitting on the same network.
If you want to know just the raw timing, one useful tool in psql, the Postgres client, is to use \timing. \timing tells you the measurement of time as seen by the client, not as seen by the server. You turn it on, you run a query, and then it gives you the time that that query took from the perspective of psql.
\timing tells you the measurement of time as seen by the client, not as seen by the server.
Notably, as described in this mailing list thread from about 10 years ago, the timing command shows you the total time it took to execute a query, and fetch the results back to the client. If you run an
EXPLAIN ANALYZE for example,
EXPLAIN ANALYZE also shows you execution time and that's actually output by the server. The server summarizes how long the query took and so
EXPLAIN ANALYZE will not include this round trip time. If you're doing performance benchmarking, be sure you know what you want.
If you want the network latency included, then \timing is good. If you don't want network latency included, then do some method of collecting the server-side time.
Let's say you want to improve how your application works with network latency. The first thing to know is if you're doing a lot of
INSERT statements, doing individual inserts can be really bad for performance. Let's say you're doing a loop and you keep inserting new data. If you're doing all these individual statements, the client will have to wait for their query to complete. You have all these round trip times as the query executes, and then the next query executes. That takes a long time, but, really, you're just waiting for the network.
The alternative to this is using the COPY command, where you're just continuously sending data to the server and you're not waiting for the results. It's much more efficient from that perspective and there's also other aspects that benefit from
COPY. In this example,
COPY was more than 10 times faster than individual
In this example, COPY was more than 10 times faster than individual INSERT statements.
The other thing to know about is pipeline mode in libpq. This was added in Postgres 14 and it was added to the client library. You don't have to use Postgres 14 on the server, but you have to make sure you use a new enough client library on the client side. If you use libpq 14 or newer and your programming language driver, that usually uses libpq, supports this as well, then you can use pipeline mode in libpq.
Here's this useful blog post from Cybertec two years ago, where they describe the client sending commands. If you don't have pipeline mode, what happens is you send the query, the result comes back, you send the next query. Versus if you use pipeline mode, you're sending the query and then before the result comes back, you're sending the next query. There's this continuity of queries being sent, and then you are waiting for the results in order. You do have to be smart on the client side, but it does allow you to queue up multiple commands to be processed, and so this is something that you could use to work around higher network latency, or just in general to increase the system throughput.
If you use pipeline mode, you're sending the query and then before the result comes back, you're sending the next query