Collations in Postgres
In episode 61 of “5mins of Postgres” we’re talking about collations in Postgres. We’ll specifically take a look at glibc and ICU collations, and problems that can arise when using different operating system releases.
Let's have a look!
In this blog post by Jeremy from the AWS team, he talks about a hypothetical situation where he is a linguistics researcher that works with the Balti language of northern Pakistan.
The reason that Jeremy picked this language is because it had some recent collation changes, which we'll talk more about in a moment. In order to do the linguistics research work here, Jeremy sets up a new EC2 instance with Ubuntu and installs Postgres 15.
An important aspect is that Jeremy chooses the ICU collation provider. In his article, Jeremy inserts some test data into the database and then is able to run some queries. He poses the question: “What if I want to scale this out?” So he adds a hot standby to his setup that is replicating from the primary.
An important note here is that he uses Ubuntu LTS for the hot standby and he uses the same Postgres version. Now, queries are faster, but then he discovers a problem.
He's horrified to discover that his hot standby here seems to have lost a huge amount of data. He's running COUNT(*) and then, suddenly, on the hot standby, it's not returning any data!
On the primary, everything looks fine, so he stops using the hot standby server because clearly there is some kind of problem. Luckily, the primary is still safe. This is something that you can replicate yourself if you follow Jeremy's instructions.
The main thing to understand is the reason that this data seemingly was lost is because of a difference in the Ubuntu release that Jeremy used here.
Essentially, not being super careful, he used one LTS version at one time and then used another LTS version the other time. The primary is Ubuntu 20.04, the secondary is Ubuntu 22.04. Postgres wants you to use both the same Postgres version, as well as the same OS.
Postgres wants you to use both the same Postgres version, as well as the same OS.
Oftentimes, end users don't necessarily look at the Postgres logs, but if you were to look at the Postgres log, you would have actually seen a very helpful message that says
database research_texts has a collation version mismatch.
Database was created using collation version 153.14 but the operating system provides 153.112.
So there is a newer version on the operating system side.
Rebuild all objects in the database that use the default collation and run ALTER DATABASE REFRESH COLLATION VERSION, or build Postgres with the right library version.
Collations are a complex topic. There has been a lot of writing about this over the years. The way that Jeremy summarizes this is that collation is essentially putting words in the right order. It's sorting strings so that they follow the rules of local languages, of local practices, so that the database returns the sorted data in a way that you expect it.
Postgres itself does not actually handle collation versioning. Instead, Postgres outsources this. Postgres either uses the glibc library, or the ICU library for handling collations and sort order. These libraries are part of the operating system, and this is where the problem comes in.
When you upgrade those external libraries, then the ordering rules change. The big issue you have is that there is inconsistency between the new ordering in the library logic and what's on disk. This is rare in practice, luckily, but when it happens, it can cause wrong query results, it can cause duplicate data violating unique constraints, and more. In short: It really sucks. You expect there to be one row and suddenly there are three rows! You’re wondering "which one is the right row?" I've seen this myself on a database and it's not good.
Postgres itself does not actually handle collation versioning. Instead, Postgres outsources this.
This is a problem that has existed for Postgres for many years. In a sense, here we should be in a better place as this is using ICU and is using explicit collation versioning, but you don't really escape the problem that you have to think about collation.
What Jeremy has done is to create a GitHub repo where he looks at both the glibc, as well as the ICU sort order changes over time. He's done a very thorough analysis on the different glibc versions, for example, bundled in Ubuntu and how they've changed their sort order over time. He did the same thing for ICU versions. If you're interested in this topic or if you're trying to understand better if this could have affected you when you upgraded our OS, then this is a good resource.
If you are more interested in how collaction actually works, why collations can be a problem, and more, Peter Eisentraut wrote a great article on how collations work.
Jonathan Katz from the AWS team, also talked about collation changes on AWS Aurora and RDS. What's important to note is that you don't necessarily escape this problem when you're on AWS, but one thing that they've introduced recently is that they're actually pinning the glibc collation version. They've essentially said: the collation part of glibc, "we'll freeze it to be 2.26-59", which is one very particular glibc version, and they won't upgrade that, which means that when you're using Amazon RDS and Aurora, and you're using a glibc collation in your database, you can be reasonably sure that this won't break. I think this is an interesting aspect of how to solve it, but it doesn't solve it if you're using ICU collations. It is good to have some awareness of this when you're using the more advanced ICU features in Postgres.