5mins of Postgres E17: Demystifying Postgres for application developers: A mental model for tables and indexes
Today, we're going to talk about demystifying database performance for developers. This blog post by Christopher Winslett talks about how to think about your database as an application developer and what kind of mental model you can have to think effectively about how to use indexes and other data structures in your database. We will talk about Postgres Tables, Indexes, Index Cardinality, Table scans, and more!
We talk about a mental model for how to think about databases as an application developer and explain Postgres Tables, Indexes, Index Cardinality & Table scans.
Let's jump in!
For many developers databases are basically magic. This blog post tries to break that illusion. I like this notion here of databases are just like any other code - they have algorithms and processes. These algorithms and processes are meant to improve performance, but generally you need to understand the limitations, you need to understand how things work to make effective use of it. It's not like a database just does the work for you, you actually have to interact with it and optimize it.
It's not like a database just does the work for you, you need to understand how things work to make effective use of it.
Christopher uses the example of a physical library in this blog post. When you look at a programmer's bookshelf for example, and you have, in his example, six books, you have "Learning Go" at position 1, "Learning SQL" at position 4 and such. The most basic thing you can do to organize your bookshelf is to assign each book a number. Then, in a separate data structure, you can say:
- here are my books
- here is each book's ID
- here's a title
- here is its publish year
- its ISBN and such.
This is the most basic example of a table.
Oftentimes, when we talk about performance, we talk about indexes. In this example, here with books, a good example of indexes would be an index on publish year. I might have the publish year 2008 and the publish year 2008 points to shelf position 6, instead of the index itself having a full copy of the book, which would be pretty expensive. If you had a library, you don't want to keep copying your books. Programming languages are another good example here, and ISBN indexes of course are a great example. If I want to purchase a certain book, if I have its ISBN, I could tell that to a bookshop and they could find it.
Let's say we want to add "Programming Rust" into our library. Now, if I add "Programming Rust" into this library and I add it in the middle of my index, that can be more expensive because I have to move values around to make space.
A great example of this is B-tree indexes in Postgres. B-tree indexes are sorted, if you were adding something in the middle of a sorted index, of course that might mean that Postgres has to split a page or split the parent pages, there's all kinds of effort involved around adding a value in the middle of a B-tree index. Versus if you add something at the end, then it's much cheaper because you're just extending the index, you're just adding more things. This is part of the reason why random UUIDs are bad in Postgres, you should use either a ULID or a sequence value if the value is indexed (you can read more about this in our most recent newsletter issue here).
Next we're going to talk about index cardinality.
- High cardinality means that you would have a very small number of records stored by index value.
- Low cardinality would be if you have a lot of records.
Generally speaking, efficient indexes are high cardinality indexes.
A good index will have a small number of table values per index value and return a very specific result. A good example for a typical high cardinality index are primary keys. An example of a low cardinality index would be an index on a tag. Let's say you have 10 different tags in your system, then that means you have a lot of duplicate values in your index.
Important to note: Postgres 13 added B-tree deduplication (as we talked about in our Lessons Learned from Running Postgres 13 blog post) . Some of those best practices are less important now. But even with B-tree deduplication you want to avoid having a lot of pointers back to your table. It's still much better if you have a more specific high cardinality index.
When you think about indexes, you also think about table scans. An example here would be: we have a publisher index, we can look up what books were published in 2021. But if you're also saying how many of those were the first edition, then the publisher index is only going to give us the position for all those books, but in order to be able to answer which one of those books was the first edition we have to go to our bookshelf, we have to pull the books and figure out, is this the first edition or not. Then we can find here, out of those three, only one is the first edition, as we had to do the filtering here.
In Postgres, the fact that you have an index scan might not be enough, if there is a filter, that might mean that a lot of work was done even after you were using an index.
When you think about databases, it's important to think about scale. If I have a library of a hundred books, everything's going to be fast and efficient. I can probably just look through them. If somebody asks me, do you have "Programming Go", I can just go look at my bookshelf until I find "Programming Go". Versus if I'm the Library of Congress and I have 40 million books, then I really need indexes. I need an efficient lookup system. I need to understand what things people are searching for, so that I have my indexes ready. Then I can just do a query right there and answer that question.
Last, but not least, it's important to experiment. The same way you optimize your application code, if you optimize databases, don't just assume things, try things out. Use tools like EXPLAIN or try out indexes. That's how to make the most of your database system.