Waiting for Postgres 17: The new built-in C.UTF-8 locale

By Lukas Fittl

March 22, 2024

In today’s E107 of “5mins of Postgres” we discuss a recently committed change to the Postgres 17 development branch that adds a built-in collation provider to Postgres, as well as a new built-in C.UTF-8 locale that allows fast binary sorting, whilst supporting Unicode-aware operations on texts.

Share this episode: Click here to share this episode on LinkedIn or on X/Twitter. Feel free to sign up for our newsletter and subscribe to our YouTube channel.

The new builtin collation provider in Postgres 17
- What is a collation, and a locale?
- The new "C.UTF-8" locale
The benefits of using binary-sorted indexes
In conclusion
What we have discussed in this episode of 5mins of Postgres

Transcript

We'll start with this commit by Jeff Davis.

The new builtin collation provider in Postgres 17

So Jeff, last week added a new "builtin" provider to Postgres 17 that you can use instead of "glibc" or "icu". And to be clear, when we talk about Postgres 17, this is the upcoming Postgres release that's most likely going to come out in September or October this year. Even though this is committed features can still change. And so sometimes these things may not be in a final release, but our hope of course, is that that's the case.

What is a collation, and a locale?

Now maybe quickly to explain what a collation is collations or sometimes also referred to as locales are essentially ways how Postgres decides how to do sorting as well as other operations on text.

Imagine you have two texts values, and you're trying to decide which one comes before the other. One good example would be is you have a German umlaut and you have a word that starts with an umlaut, should that word sort after the letter "Z" in the English alphabet, or should it come after the corresponding version of the character without the umlaut in the alphabet.

Now, this built-in collation provider is nice, but it's not really changing that much beyond what you have today when you use the "C" locale of the "glibc" provider.

And essentially it's just matching that, it makes it very clear that this isn't really tied to glibc as much as it is a built-in provider, as part of Postgres.

The new "C.UTF-8" locale

Now, what's very exciting though, is the new "C.UTF-8" locale. And so, this is similar to the "C" locale, in the sense that this is focused on binary sorting. So you've got a lot of the performance benefits, similar to the "C" locale, but then it has a few usability improvements that are more specific to Unicode. So, for example, if you're uppercasing an umlaut it's going to correctly do this, versus the "C" locale doesn't know how to do that.

The benefits of using binary-sorted indexes

Now let's briefly talk about the trade-offs here. So Daniel Vérité, who also reviewed these patches that Jeff committed, wrote a good post about the benefits of using binary-sorted indexes. What he's comparing here is the linguistic comparison to byte-wise or binary comparison. So, ease of use, clearly linguistic is better. Makes more sense to humans. Performance on the other hand, it's going to be better on the binary sorts and we'll come back to another test on that in a moment. It's also more portable. These locales can behave differently between different glibc versions. This is part of the reason why Postgres added ICU, which is a more portable library for handling Unicode including sorting.

Why don't more databases use binary sorts?

Why don't most people actually use binary sorts? Because they are quite beneficial for many, many use cases.

Really the problem here is more of a historic one. So as Daniel illustrates here, all of this flows down from this environment setting that is set by the operating system, which usually is something like "en_US.utf8" if you're in the US for example.

And then initdb will take over that setting, it will initialize the template databases in Postgres with this, the user created databases will inherit this from template databases, your columns will inherit it from the database and your indexes will inherit it from the column. And so the problem is, you may not have intentionally chosen a certain locale, that's just to, because your system was set up in a US-centric way, for example. You might see that flow down here and you get all this overhead of a linguistic sorted locale, when really you're not needing that.

The C vs C.UTF-8 locale

And so what Daniel does call out here is that the "C" locale has a few trade-offs. If you, for example, say you want to do simple string manipulation. You want to capitalize the initial letters of each word. If you have the "C" locale, and you take a word like "élysée", it's going to capitalize it incorrectly. If on the other hand we used the "C.utf8" locale, that's going to work correctly.

Now, as Daniel points out here, "C.utf8" as he's written it here, that's actually the glibc version of this, and it's only available if you're running an operating system with glibc 2.35 or newer.

However, with Postgres 17, we do not need a new glibc, because the built-in provider is going to be available regardless of your glibc version and also on other systems. It's portable, which is a big benefit of having this built in.

The performance benefits in practice

Now. I want to briefly illustrate to you the performance benefits here. So here in my local machine, I have Postgres 17 development branch built.

Right now, I'm on a "C.UTF-8" locale, and I'm going to run a simple SELECT statement.

This one is generating a million numbers and it's converting them to text, it's ordering them as text and it's just giving me the first result. So I expect this to return "1", which it does, but it takes 200 milliseconds to do so, because it's doing a lot of unnecessary sorting. Clearly the time we're spending here is in sorting.

Now, the baseline here is on this new "C.UTF-8" locale I'm getting 193 milliseconds, if I connect to the "C" locale, I get a similar number, 196 milliseconds. But then if I connect to my original database, It gets a lot slower:

postgres=# SELECT i::text FROM generate_series(1, 1000000) x(i) ORDER BY 1 LIMIT 1;
 i 
---
 1
(1 row)

Time: 461.278 ms

So in this system, which is, this is actually quite surprising, but I think it's a little bit of edge case here, it's twice as slow to do the exact same thing.

And really the reason is because as part of sorting Postgres has to do string comparison and the string comparison in this case is expensive because it has to be Unicode aware. It has to be aware of the German umlauts and whatnot, and because of that, it's going to be much slower.

Now there's a trick here that I can do. And so inside our query results, if we see the sort having overhead what we can try, is explicitly changing the collation of a particular column reference as we're using it. Right now, I'm on the slow version, slow database. And so to my column reference, I'm adding this "COLLATE C" step here:

postgres=# SELECT i::text COLLATE "C" FROM generate_series(1, 1000000) x(i) ORDER BY 1 LIMIT 1;
 i 
---
 1
(1 row)

Time: 189.731 ms

And what this does now is, before we doing the sort, it converts it to the "C" collate. And so when Postgres decides, how do I compare these two elements in my sort, it's going to use the much faster version. And so just be aware that this is something that Postgres has to do, you can help it along, if you do not need linguistic order in your results.

Does this matter on a managed DBaaS, like Amazon RDS?

Now, one last note: you might wonder, You're using a database as a service provider, like Amazon RDS. Do you even need to worry about this stuff? And the answer is yes, you do. If I look at an RDS database here that I have just running, this RDS database by default, will have the "en_US.UTF-8" locale.

Now, if I want to run the test that I ran locally. You can see that right now, this runs in 400 milliseconds. If I make the change where I add the explicit "COLLATE C" step here, this runs in 320 milliseconds.

Even in the cloud, this makes a difference.

It will be workload dependent, but it makes sense even today ahead of Postgres 17, to think about when can you use the "C" locale? Now, I cannot use the "C.UTF-8" locale because the glibc is too old, but once Postgres 17 comes out, I'll be able to get the best of both worlds by having Unicode aware upper and lower functions, but still having performant binary sorting.

In conclusion

And so I would say, well done, Jeff, on making this change, also thank you to everyone who reviewed this. I think this was not an easy change to make, it seems a bit obscure, but this is actually a very important change in Postgres 17, and I hope that going forward, we'll see more systems adopt the new "builtin" provider as the default, because I personally think we should default to binary sorts, and only if you need the linguistic search order, you should switch it out on a case by case basis.

I hope you learned something new from E107 of 5mins of Postgres. Feel free to subscribe to our YouTube channel, sign up for our newsletter or follow us on LinkedIn and X/Twitter to get updates about new episodes!