Roaring Bitmaps and pgfaceting: Fast counting across large datasets in Postgres
In E47 of 5mins of Postgres, we talk about counting and faceting large result sets using Roaring Bitmaps in Postgres. We are looking at the pg_roaringbitmap extension and the pgfaceting extension which help us improve our query performance from 222 seconds to 155 milliseconds.
- Faster Postgres query performance with pgfaceting and Roaring Bitmaps
- Counting in Postgres
- Faceted search in Postgres
- Compressed bitmaps and the pg_roaringbitmap extension
- Use the pgfaceting PostgreSQL extension to quickly calculate facet counts
- Speed up Postgres query performance with pgfaceting
- What we have discussed in this episode of 5mins of Postgres
Let's jump right in!
In this blog post by Ants from the Cybertec team he describes what faceting in Postgres is. In principle, you can think of it as first doing a base selection, looking at a particular subset of your data, then - out of that subset - you're interested in the various attributes and how often they occur.
For example, looking at a subset of my e-commerce store I'd want to know how many scarves there are, how many gloves there are, etc. This can be used to give our users an opportunity to have a sense for which subsection of the store they want to look at.
We often talk about making counting faster in Postgres. But unfortunately, a lot of the techniques are really about counting all of the rows in a table - there are various ways of estimating that. But, they really don't work for a situation like this, where I want to count a subset of a large table, really fast.
Ants describes the different ways of doing this in Postgres to date. The documents table in his blog post has a couple of things like
category_id, and such, and he describes the different ways of using facets.
There are multiple ways to use facets in Postgres
The easiest way in Postgres to work with facets would be to do a
GROUP BY query. Imagine, that I just want to know how many documents of each type exists: I can do something really simple, like selecting a particular type and doing a
COUNT(*) and then doing a
GROUP BY. This is pretty slow.
If you're running separate queries, the easiest fix would be to use
LATERAL. That way you don't have to load the data again and again, but instead you're doing it just once.
What Ants describes here is a new extension that he created which makes this a lot better.
Before we jump into Ants’ new extension I want to give you background on an important thing to know about: Roaring Bitmaps. If you've not heard of Roaring Bitmaps before, they're essentially a compressed bitmap data structure. If you have a bitmap, you essentially have a series of zeros and ones that indicate a certain state. Roaring Bitmaps are a form of loss-less compression (so you don't lose any of the information here), but they are much faster than something simple, like run-length encoding.
If you have a bitmap, you essentially have a series of zeros and ones that indicate a certain state. Roaring Bitmaps are a form of loss-less compression [...] but they are much faster than something simple, like run-length encoding.
Roaring Bitmaps are really cool. They're an amazing data structure that is used widely. In Postgres, you can use them using the pg_roaringbitmap extension, which bundles one of the official C libraries for Roaring Bitmaps.
Ants created the pgfaceting extension, which builds on Roaring Bitmaps and uses it as the underlying store to be able to count things quickly.
To use pgfaceting, you first need to be able to actually install extensions on your server. This means that today, unfortunately, you cannot do this on services like Amazon RDS or AWS Aurora (yet), hopefully that changes in the future. But, this is something that you could run on your own server today. You need both the pg_roaringbitmap extension and the pgfaceting extension.
If you have both of these installed, you can call
faceting.add_faceting_to_table. In Ants’ example, this would be the documents table. Then you can describe which set of the facets you're interested in.
Behind the scenes Roaring Bitmaps are created, which are then used to query that data very efficiently. The other thing to note here, in this current iteration, is that this is not maintained automatically for new data that is coming in. This means you do actually have to merge changes in by running this maintenance function.
In this current iteration, this is not maintained automatically for new data that is coming in. This means you do actually have to merge changes in by running this maintenance function.
Now, I can query with this faceting data. It can do things like getting the top 10 values for each kind of facet, query other subsets of the data, and more.
Let’s have a look at how fast this is, and this is what really impressed me: the example we had earlier of using a
LATERAL query takes 18 seconds, if we use parallel query, which is pretty fast to work on this 100 million row table.
If we turn off parallel query, then it takes much longer: 222 seconds. Both of these are not something I would run in production, in a web application. It's just too slow. I would have to run this ahead of time and then show a cached value in my web application.
Now, with pgfaceting, what's amazing is that this just takes 155 milliseconds.
It only takes 155 milliseconds to get these facet counts on a hundred million row table for 60% of the dataset. This really makes a huge difference. You could actually run this in a web application and have accurate data for each of the facets, and then be able to show that in your store, for example, or use it for other purposes where you need fast counts on a subset of your data.
I think this is a great example of the Postgres extension system giving you a way to use modern data structures, like Roaring Bitmaps, effectively on your own data.