How Figma and Notion scaled Postgres
In episode 71 of “5mins of Postgres” we're going to talk about the growing pains of database architecture at Figma as well as alternate architectures, such as the sharding architecture chosen by Notion.
Let's dive in!
This is a blog post by Tim from the Figma engineering team from earlier this year. In this blog post, they describe how Figma, back in 2020, hit some growing pains where they needed to figure out how to continue scaling. Figma's infrastructure, in particular in regards to metadata like permissions, file information and comments, is all built on Postgres.
Back then, they actually had a single large Amazon RDS database that persisted that type of metadata. The problem that they encountered back then was that they observed upwards of 65% CPU utilization during peak traffic. Initially, they were not really sure how to deal with that.
They started looking into different approaches on how to solve this. First of all, back then they still had some headroom in terms of which instance type to use. They were on a very large, very expensive r5.12xlarge, but they decided to doubled that to an r5.24xlarge – just to maximize that CPU utilization runway, because that really was the big issue.
Once they had done that, they looked into creating read replicas to scale the read traffic, as well as establishing new databases for new use cases in order to not keep adding into that same central database. Lastly, they added PgBouncer as a connection pooler. All of these changes are very common and they all, in my experience, make a lot of sense as you're scaling. But in their situation, even though they made these changes, they still had limitations, because they still had everything going to that central database.
They were thinking about different options on how to deal with this. And as you decide how you can scale Postgres, one of the things you consider is if you should be going to something else: Should we be going to a Postgres compatible solution that lets you just run multiple nodes?
Back then in their assessment, none of the options really made sense for their situation. They didn't want to be the first customers that hit certain scaling issues, they didn't want to lose control with managed solutions. So they didn't really feel comfortable going with some of the options out there, like YugabyteDB or CockroachDB.
What they instead ended up doing is they decided to partition the data. Now, what they mean with partitioning here is not the same as local table partitioning on a single node. What they're referring to is that different tables are on different servers and so they split up the type of data on each database server. They compare this with horizontal sharding.
Now, horizontal sharding would be closer to what Notion did a couple of years ago, where they described their lessons learned from sharding Postgres at Notion. Notion has a similar problem where a lot of data, even more than the data in Figma's case, is stored in Postgres.
Garrett and his team at Notion were trying to figure out how to scale that. Now, Notion's approach back then was really well-described in the above linked blog post, but I think the diagram in their article summarizes it well. In Notion's case, they had one monolith and they essentially moved into, back then, 32 physical database servers. What they set up is that the same kind of schema exists in multiple locations, so you have a block, a space, and a comment, and they are on each physical database server.
In Figma's case, they decided to do something else. They decided they did not want to do horizontal sharding. Instead, what they wanted to do is partition the tables based on certain use cases.
They were trying to assess this based on two criteria.
The first was determining the impact. They wanted to make sure that the tables that they're moving off of that big monolith database into other servers were significant portions of the workload. And second, they wanted to make sure these tables are isolated. As you're doing this, one of the biggest challenges is going to be joins or any other kind of dependencies between tables.
Meaning, in this particular case, what they looked for is those two properties so that they would identify tables that are independent, yet still having a big impact. They ended up sampling
pg_stat_activity in 10 millisecond intervals to identify the CPU waits with each query and then aggregated that information by table name. They were trying to discover which queries are impacting which tables and how much of the overall load that would be, and through that understand which table has the most impact.
Understanding how isolated tables are is tricky. What they ended up doing was to add some application side telemetry. This helped them understand which logic in the application used which models together versus which ones were executed more independently.
Once they figured that out they aimed for a solution that has low downtime.
What I found interesting is that they ended up going with a combination of PgBouncer and logical replication. This is a good example of where logical replication can really shine, as it lets you move data from one server to another while staying online. You don't have to move the whole database, you can actually say: “let me pick this table or even a subset of this table in newer Postgres versions, let me logically replicate that to the other server and then do a cut over”. And so they built some tooling around that.
Ultimately, I think this is very impressive. They ended up being able to move most of these high traffic tables and split things up, and usually when they did these moves, they only saw a 30 second downtime on their availability on these tables that they were moving. 30 seconds is not zero, but for something that is essentially a bespoke tooling around a core Postgres feature such as logical replication, I think that is pretty good. And presumably if you iterated on the tooling further, you could reduce down that 30 second period.
Generally speaking, when you consider how you scale Postgres, these two choices, in the case of Figma, partitioning the data vertically on to different servers, or in the case of Notion doing horizontal sharding on the application side are two popular choices that work well. Both of them require application side changes, but I do think that they show how Postgres can scale even for the most demanding workloads.