5mins of Postgres E4: Writing your own custom Postgres aggregates with Rust or raw SQL
Today we'll take a look at how to write custom aggregates in Postgres using either Rust or raw SQL.
Share this episode: Click here to post to twitter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.
Let's get started!
The way to think about aggregates in Postgres: they're essentially functions that take input values and then combine that into a resulting value. For example a sum of integers or a JSON aggregate function, which can take individual rows and aggregates them into one resulting JSON object. Conceptually, this also exists in other languages. In Rust itself, there are iterators and those iterators can also help you collect the data into a particular data type to summarize that data. The way that aggregates work in Postgres is actually pretty straight forward. Surprisingly, a lot of people don't know about this. It shows the power of Postgres and how it allows you to customize each aspect of the database, similar to how it supports custom data types (see our articles on custom data types in Rails and Postgres and in Django and Postgres).
In the article, you can see the simplest version of a sum function using raw SQL. First, creating a function that gets executed each time the state needs to be updated. In this mentioned case, we're just adding the value itself to the input value and then we take the last state as the result.
You can run this yourself. Pretty simple.
It is similar to how you would write this in Rust. You have a variable and then inside the loop, you keep adding to a variable. Then you return the value of that variable. Postgres can do more things than just that though. There's also a way to write custom combine functions that let you handle partial aggregates which then let you use parallel processing in Postgres.
The last thing to mention here on aggregates is final functions. Instead of taking the last state value as the result they let you transform that state that was accumulated over time into a particular value. For example, sum up all the texts values in the state and at the end count the number of distinct values.
What’s cool: The pgx mechanism to write extensions in Postgres now has support for writing your own custom aggregate functions in Rust. With pgx you can now write Rust code to create your custom aggregates in Postgres. It's pretty self-explanatory to follow this along in Ana’s blog post. In it, she shows the simplest example of an aggregate function: the state that keeps track of what gets aggregated and then we also have the state function that gets called for each input value.
A lot of the work here is being done for you.
Within 20 lines of Rust code you got your first aggregate. Really fast, really easy to use. You may also be interested to look at ordered-set aggregates: here's an example implementation of percentiles. There's an existing function in Postgres that does this, but it's a great example of how flexible this API is and what you can now do in Rust.
In her article Ana also referenced two other posts, which I think are nice overviews of how to use aggregates.
In this article, Tim McNamara, a couple of years back, had described how to use raw SQL to implement the harmonic mean and geometric mean in Postgres. Specifically, he shows how he came from Python code to run such type of logic. What he wanted to do is move that into the database.
How can we run this type of code on the database side using custom aggregates? Pretty simple, same
CREATE AGGREGATE and then using raw SQL functions to implement these geometric means and harmonic means. What's nice about it is that you can simplify your queries instead of having to do things inline. Much easier to read and in some cases can be optimized better by Postgres.
Last, but not least: this is, I would say, more advanced. Although, if you're writing aggregates in Rust you are probably at a level where you know a lot. What TimescaleDB wrote about last year: how they used aggregation in Postgres to refine their hyperfunctions design. The state function and those finalize functions, they usually are hidden for you behind the scenes Postgres does that and you don't really interact with the details of the aggregate API. What TimescaleDB has done is that they've extracted that into an explicit design pattern.
Maybe a way to think about this is map and reduce. The idea is that you can have a function that simplifies your data into a certain form that is easier to work with, that you can then get a result for. The reason that they separated this: one of the motivations is they're actually able to optimize things better in Postgres. In their example, they can run things separately: you have each element in the target list calculate its own percentile and doing its own aggregation versus if you have this two step process exposed explicitly, then you can tell Postgres that you can reuse that.
In another example, they have an in-between percentile aggregation logic, and then out of that logic they extract the approximate percentiles with P10, P50 and P90 here. They also described later in the article on how they utilize this to calculate things over time. Sometimes you want to keep accumulating something, but you don't want to accumulate the full raw dataset because it's pretty expensive. With their implementation of continuous aggregates using the same Postgres aggregation API, it's possible to keep a smaller dataset over time.
If you want to learn more about aggregates these are really great things to check out. If you want to use Rust in Postgres then pgx is a great way to do that.
Thank you so much. See you next week.