5mins of Postgres E2: Using unnest(..), generate_series(), and PostGIS

By Lukas Fittl

January 27, 2022

Today, we're gonna look at a few interesting aspects on how to work with data sets in Postgres. First of all, we'll take look at using UNNEST for performing bulk operations on a larger number of values. Second, we'll take a look at generate_series and how to generate sample data if you do not have large amounts of input data. Last, but not least, we'll take a look at a PostGIS community example, where somebody had a particular challenge, indexing PostGIS data, and we'll take a look at how a PostGIS core team member helped them resolve the issue.

Share this episode: Click here to post to twitter, or sign up for our newsletter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.

Transcript
What we have discussed in this episode of 5mins of Postgres

Transcript

First, let's start by looking at UNNEST.

Using unnest(..) for data loading:

UNNEST in Postgres, if you're not familiar, is a function that you input a value in and then returns multiple rows based on the input value. For example, UNNEST on an array column would then return individual rows for each array value. And so the author in this article, Forbes Lindesay, chose to use UNNEST to improve the database performance. The idea is that when you're passing in a very large number of values into Postgres, as part of an INSERT statement and as part of a VALUES clause, it's actually pretty expensive because Postgres has to parse each value separately and it creates a separate struct on the Postgres side for each individual entry in that VALUES clause.

If you have thousands of input values, then UNNEST can actually perform noticeably better than a VALUES clause. In the article, they're passing in just two values:

They're passing all the email values in that first parameter, and then all the favorite colors in that second parameter. And then they use unnest to generate the rows and that has noticeably better performance characteristics. A similar technique you can use is when you use UPDATE. They pass in the two parameters and then they use it in a sub SELECT to update everything that matches those rows. That, of course, also works with DELETEs. If you want to implement this technique obviously would be ORM and application dependent. So it may only make sense if you have a particular hot code path that runs into issues today, then it might be worth looking at UNNEST. Or in some cases, just using an array versus using separate constants. It's a typical problem that people run into with long IN lists and Postgres does perform better if you're sending an array where the values are all combined into a single value that you send.

How to generate sample data with generate_series()

Now, let's say you don't actually have a large data set, but you want to come up with an example for your test cases, or to experiment how your application works with different data. generate_series in Postgres is actually a pretty useful function. Similarly to how UNNEST works, what Ryan from TimescaleDB describes in his article is the generate_series function that you can use as a set returning function.

The place where in the select usually the table names go: instead of that, you put the function call and you tell the generate_series function the start point, the end point, and then how much distance there is between each point. In Ryan’s example, that returns a list of five rows. That was a pretty simple data set to start with. What's nice about that is that it's a short query to create that. And that can be much more sophisticated and Ryan goes into a lot of details in his article.

For example, if you add a random function and you want to have something that's ever increasing, like web traffic, that grows more and more each day, you can use generate_series for that, and it's a really short function definition too. It's really useful for test data that's being created. In the same article there's a lot more exploration of what you can do. For example, you can create a wave using cosine functions to have a cyclical pattern. So, if something is like a week to week type data that you want to emulate, then generate_series can help you illustrate that. Really nice article, really detailed. I certainly learned something from this article and want to use generate_series more now.

Using PostGIS to work with elevation profiles

Last, but not least: What's nice with Postgres in general and PostGIS as well is that people just ask questions out in the open and then people who work on the actual software respond back.

In this case, somebody had a question about a PostGIS data set they were working with: the ground elevation of Europe.

They had a data set that would have all the individual elevation points across Europe in two kilometer intervals and they wanted to have a line from A to B and then understand each elevation point between those points, to build a height profile. They were trying to emulate what you might know from Google Maps, where you enter a destination and then it shows you the elevation profile, how to get to that destination. And so Paul Ramsey, from the Crunchy team, who is somebody who works a lot on PostGIS, went through and tried to reproduce the problem, tried to reproduce the data set and with the help of the original poster loaded that data set into his own database and then used that data set to build a simple table structure, having the geography type as a latitude / longitude value, then created an index, a GIST index, on that latitude / longitude column, then describes a few more intricacies of dealing with geographical data. Do you actually want a straight line or do you want a line that's on a sphere? Earth is curved, right? It's not a straight line like on a map.

This is a really good article if you want to also understand the basic PostGIS operators, like the nearest distance operator, for example. Those are things that then get used in the article to come to the result. The final query takes 14 milliseconds. That's certainly a sizable enough data set that the index actually matters to help get a fast performance here. Pretty simple index definition, but what's most important with PostGIS is that you actually understand the different functions and different operators that allow you to model your query in a way that uses the index effectively.

That's it for today. Thank you so much for listening to 5 minutes of Postgres and talk to you next week.