5mins of Postgres E2: Using unnest(..), generate_series(), and PostGIS
Today, we're gonna look at a few interesting aspects on how to work with data sets in Postgres. First of all, we'll take look at using
UNNEST for performing bulk operations on a larger number of values. Second, we'll take a look at
generate_series and how to generate sample data if you do not have large amounts of input data. Last, but not least, we'll take a look at a PostGIS community example, where somebody had a particular challenge, indexing PostGIS data, and we'll take a look at how a PostGIS core team member helped them resolve the issue.
First, let's start by looking at
UNNEST in Postgres, if you're not familiar, is a function that you input a value in and then returns multiple rows based on the input value. For example,
UNNEST on an array column would then return individual rows for each array value. And so the author in this article, Forbes Lindesay, chose to use
UNNEST to improve the database performance. The idea is that when you're passing in a very large number of values into Postgres, as part of an
INSERT statement and as part of a
VALUES clause, it's actually pretty expensive because Postgres has to parse each value separately and it creates a separate struct on the Postgres side for each individual entry in that
If you have thousands of input values, then
UNNEST can actually perform noticeably better than a
VALUES clause. In the article, they're passing in just two values:
They're passing all the email values in that first parameter, and then all the favorite colors in that second parameter. And then they use unnest to generate the rows and that has noticeably better performance characteristics. A similar technique you can use is when you use
UPDATE. They pass in the two parameters and then they use it in a sub SELECT to update everything that matches those rows. That, of course, also works with
DELETEs. If you want to implement this technique obviously would be ORM and application dependent. So it may only make sense if you have a particular hot code path that runs into issues today, then it might be worth looking at
UNNEST. Or in some cases, just using an array versus using separate constants. It's a typical problem that people run into with long IN lists and Postgres does perform better if you're sending an array where the values are all combined into a single value that you send.
Now, let's say you don't actually have a large data set, but you want to come up with an example for your test cases, or to experiment how your application works with different data.
generate_series in Postgres is actually a pretty useful function. Similarly to how
UNNEST works, what Ryan from TimescaleDB describes in his article is the
generate_series function that you can use as a set returning function.
The place where in the select usually the table names go: instead of that, you put the function call and you tell the
generate_series function the start point, the end point, and then how much distance there is between each point. In Ryan’s example, that returns a list of five rows. That was a pretty simple data set to start with. What's nice about that is that it's a short query to create that. And that can be much more sophisticated and Ryan goes into a lot of details in his article.
For example, if you add a random function and you want to have something that's ever increasing, like web traffic, that grows more and more each day, you can use
generate_series for that, and it's a really short function definition too. It's really useful for test data that's being created. In the same article there's a lot more exploration of what you can do. For example, you can create a wave using cosine functions to have a cyclical pattern. So, if something is like a week to week type data that you want to emulate, then
generate_series can help you illustrate that. Really nice article, really detailed. I certainly learned something from this article and want to use
generate_series more now.
Last, but not least: What's nice with Postgres in general and PostGIS as well is that people just ask questions out in the open and then people who work on the actual software respond back.
In this case, somebody had a question about a PostGIS data set they were working with: the ground elevation of Europe.
They had a data set that would have all the individual elevation points across Europe in two kilometer intervals and they wanted to have a line from A to B and then understand each elevation point between those points, to build a height profile. They were trying to emulate what you might know from Google Maps, where you enter a destination and then it shows you the elevation profile, how to get to that destination. And so Paul Ramsey, from the Crunchy team, who is somebody who works a lot on PostGIS, went through and tried to reproduce the problem, tried to reproduce the data set and with the help of the original poster loaded that data set into his own database and then used that data set to build a simple table structure, having the geography type as a latitude / longitude value, then created an index, a GIST index, on that latitude / longitude column, then describes a few more intricacies of dealing with geographical data. Do you actually want a straight line or do you want a line that's on a sphere? Earth is curved, right? It's not a straight line like on a map.
This is a really good article if you want to also understand the basic PostGIS operators, like the nearest distance operator, for example. Those are things that then get used in the article to come to the result. The final query takes 14 milliseconds. That's certainly a sizable enough data set that the index actually matters to help get a fast performance here. Pretty simple index definition, but what's most important with PostGIS is that you actually understand the different functions and different operators that allow you to model your query in a way that uses the index effectively.
That's it for today. Thank you so much for listening to 5 minutes of Postgres and talk to you next week.