The different trade-offs of Distributed Postgres architectures
Happy new year! We hope you are doing well and are excited to be back with our “5mins of Postgres” video series. In E97 we're going to talk about distributed Postgres architectures.
This is a blog post by Marco Slot on the Crunchy Data blog. For context, Marco is a principal author of the Citus extension for Postgres and he recently joined Crunchy Data to continue working on distributed Postgres.
Of course, ideally, we'd like to have all the benefits with none of the downsides, but oftentimes with distributed systems one of the most important downsides besides the cost of running extra hardware, is going to be latency. The best latency you'll get is if you don't distribute your system at all. Once you start distributing your system, you essentially step up a ladder of how much latency is acceptable to you. This oftentimes can be quite harsh for an application using Postgres because most applications using Postgres work with Postgres in a sequential way and Postgres itself, many times works sequentially as well with the disk that it accesses.
What this means is, the more latency you have between your client and your Postgres, or your Postgres and another Postgres it needs to talk to, or the disk, the less throughput you will have, and the worse your application will perform, and the more frustrated your users will be. If we do distributed architectures, we should be very careful with which latency is acceptable to us in which situations.
Marco starts by talking about the most basic form of distributed Postgres.
Most Postgres servers in the cloud are distributed because underneath the hood, they use the block storage APIs by the cloud providers to store the data. In that case, you do actually keep multiple copies of the data on different storage nodes. This means you are already paying a price of latency because you're saying: “I want multiple copies, I want that network attached storage”. There, you are already seeing the downside, which is that on a regular nvme drive you'll get over a million IOPS quite easily and really low disk latency in 10s of microseconds.
But then once you go to network attached storage like EBS, you get 10 thousands of IOPS and you get much higher disk latency. But still, this is something that basically anybody in the cloud relies on in some way or form, because they want the benefit of being able to take snapshots. They want the benefit of: if the database server goes down, they can easily just start up another VM from the same block storage. This is a trade off that many people accept.
Now, there's two things worth mentioning in this context though, which is, first of all, some of the cloud providers, like Amazon have started doing what they call optimized reads. Optimized reads make use of these nvme drives that are attached to the servers and use them for caching. That means, in the cases where the information is locally in that nvme drive, it's going to be much faster than going to the EBS block storage. That's one way you can get the best of both worlds, at least in some cases.
The other improvement is that Postgres itself has a lot of work going on in the Postgres community, around asynchronous I/O. The benefit of asynchronous I/O is that we're increasing the concurrency of our I/O requests. This means we're able to do more things in parallel. And that way we can make better use of a typical block storage in the cloud.
Another component that most people that use Postgres at scale are using, are read replicas. There's not much sense in diving too much into this, but let's just say, if you can solve your problems with read replication that's the best way to do it, right? Don't worry about distributing Postgres in any other way. If your bottleneck is reads, read replicas are the way to go, they are the easiest way to scale out Postgres and keep it simple because you still have a single writer.
Now let's talk about a few special cases. Marco calls this DBMS optimized cloud storage, but really what he's referring to here is what Aurora or AlloyDB or Neon are doing, which is to have a more cloud native architecture for storage that is aware of the fact that the database has these two streams of IO, the WAL stream, and the data directory. They rip out the part of Postgres that is old school and introduce something that's better suited to the cloud.
I think this is quite appealing. If you care about availability and resiliency, this is one of the best ways to get that with some impact to your latency, but it's not that much necessarily. And they do other optimizations to make that work well.
Now, it is important to note that all of these are Postgres forks. You can't do this with a Postgres extension. This is really replacing an important part of Postgres. So it's not standard Postgres, but it is something that will work well from a compatibility perspective for your application.
Now, active-active. What does this mean? I have two nodes. I can write to each of them. They talk to each other, if I write to one, it sends the writes to the other one. And so I don't have to worry about there being some kind of reliability issues, cause I always have two writers or multiple writers. And of course the problem is that most people would like that architecture, but they don't really think of how they need to design their application to make use of that.
With Postgres 16, you can actually use logical replication to build a DIY active-active setup because logic replication can now disambiguate between a write that's coming from the other node versus the node itself. But still, think about the problems, think about your
UPDATE statements that conflict with each other. Today conflict resolution in Postgres isn't really existent. It only works if your data is not overlapping between those two nodes. I would say most of the time don't do active-active, because it's going to be very painful and your application engineers won't be happy.
Sharding is what Citus is doing. The idea behind this is that you're partitioning across servers, and you're splitting up your data between the servers. Now, you still keep the regular Postgres storage. If you use the Citus extension underneath, that's still regular Postgres, you could look at any of those nodes and you'll just find Postgres and you'll find tables with the table name, underscore and then some identifier.
It's really similar to regular Postgres partitioning in that sense, it just goes across servers. What Citus has done with the extension there is to make that really transparent and easy, so you don't have to think about it. Behind the scenes Citus will come up with a query plan that can run across different systems, pull data across these systems and make it work.
The biggest downside here, of course, compared to single node Postgres is that there is added latency because you have additional hops to get the data.
In my experience using Citus, if you can keep data on a single node, that's always going to perform better than if you get data from multiple nodes. With the multitenant use case where you have different customers on different servers, if you inform your queries of that fact, if all your queries have a customer ID attached to them, then Citus will perform really, really well and scale really well. But things starts getting a little bit more complicated if you're doing a lot of queries across all the servers, you start getting into connection management issues and such.
But still, this is a very scalable architecture and I've seen many systems at scale work well with this.
This is the world of Google Spanner or YugabyteDB, or CockroachDB. The trade-off you make there, is that the open source solutions there, they modify Postgres quite heavily, if they are even Postgres at all. CockroachDB is Postgres compatible, but there's nothing of Postgres in CockroachDB. YugabyteDB is Postgres from a source code perspective, but heavily modified. And so you will lack behind major Postgres versions when they're released, because they have to do a lot of work to rebase their engine on top of new Postgres releases.
That said, one of the things that I think distributed SQL engines like YugabyteDB can achieve better is that they actually have a more cloud native distributed architecture, and they're actually able to handle resilience better. If one of the nodes goes down, the way that it can repair these nodes is a little bit more built into the system, I would say. Versus with systems like Citus, there's a little bit more handholding onto the resilience aspect of that. But I do think that ultimately both systems can work well for scaling out.
But, if you care about keeping things standard Postgres, then I personally would choose Citus.
Overall, I would say: do your own benchmarks. It depends on what's most important to you, and choose what works well in your particular workload.