Storing and querying vector data in Postgres with pgvector
In today’s episode 74 of “5mins of Postgres” we're going to talk about vectors in Postgres. We are having a look at the claim that vectors are the new JSON in Postgres, are talk about Andrew Kane’s pgvector, and what work is needed in the Postgres core to support vectors and ML and AI work in Postgres better.
Let's jump right in!
This is a blog post by Jonathan Katz from the AWS team, also a prominent Postgres community member, who talks about the popularity of vector data types in Postgres and what that means for the community.
If you're not familiar, vectors are a very frequently used way of storing and transmitting data in the context of AI and ML based systems – for example for embeddings and other use cases. There are specialized vector databases that are made for just working with vector data, and Postgres is probably not gonna compete with them and actually be faster than them, but that's not really the goal here.
The goal here is that you often want to work with embeddings within the context of your existing application, so it's very convenient to have Postgres help you with that.
Let's take a little bit of a look back to understand why we're comparing vectors and JSON. About 10 years ago JSON support was first added to Postgres. Before that, you could store JSON in a text field, but that wasn't being validated - so it wasn't really useful. Now, that was in Postgres 9.2. Postgres 9.4, on the other hand, added the JSONB data type.
What was added in this release with JSONB was a way for Postgres to see structure in the data, and to be able to index the data. The indexing has helped make Postgres competitive with the popular JSON focused databases at the time. For example, MongoDB would've been your popular comparison back 10 years ago.
I think what we've seen in the 10 years since then is that a lot of cases where people might have actually gone and used MongoDB, they actually ended up using Postgres because for the part of the data that has that kind of extensibility requirement, that schema-less requirement, they were able to use JSONB and then for the other parts of the data, they were able to use relational data all within one system with Postgres.
That's really the power of Postgres, to support extensible data types like JSONB.
The argument that Jonathan is making is that this is happening again with vectors in Postgres. Instead of having to utilize a specialized vector database, you can actually store vector data in Postgres.
There are two existing ways in core Postgres to store this type of data, because ultimately vectors are a set of floats.
- The array data type
- The cube data type
But really, what matters is how you work with that set of floats. For example, the problem with the array data type is that even though it supports multidimensional data, it doesn't really support calculating the distance between two arrays.
You could do a bunch of helper functions, but it's actually really inconvenient to query that data. You can't really define an operator in an efficient way. Similarly, the cube data type, which is also in core Postgres, could be used to handle this and actually does support GIST indexes and KNN type queries. However, it has a limitation of 100 dimensions, which in practice today is too limiting for a lot of use cases. The question is: With those limitations in place, how can we do better?
About two years ago, Andrew Kane started an extension called pgvector, and I think this is a great example where Postgres is extensible without requiring changes to the core database system.
You are actually able to add a specialized vector data type with a specialized index format that is just focused on vector data. Kudos to Andrew Kane for doing all the hard work! Andrew worked for a long time on this before this got popular and we've seen in the last couple of months that a lot of providers have added support for this, everybody's talking about pgvector in the context of Postgres and AI and ML. I think it shows that Andrew did a good job in investing early in this and doing the hard parts.
pgvector also supports an index type called IVF Flat, which is a particular method of vector indexing.
What it helps you do is doing efficient nearest neighbor type searches. There are a couple of intricacies about how you index these, and when you're actually using the index, there is a particular value that you have to set when you're querying. The idea is that you actually set that as part of your query process with a local
SET command in your connection.
Now, Jonathan gave a lightning talk at PGCon, which luckily I was in the room for, and I'm happy to see that Jonathan wrote about this in this post, and the lightning talk had a couple of other points that I think are worth mentioning.
One of the main points Jonathan talked about were the current limitations around vectors in Postgres. One of the challenges is the Postgres page format. Postgres has 8 kB pages in most situations. These pages can store vector data.
The problem is your data, in some cases, cannot exceed that page size. For example, if you have a 1,536 dimensional vector - that can fit in the page. It's about 6 kB worth of data. However, if you have a 2048 dimensional vector - that does not fit into a page.
Now generally speaking, Postgres solves this using
TOAST. Postgres TOAST is the oversized attribute storage in Postgres, which splits up the data into multiple chunks, stored in pages, and that way is able to exceed that 8 kB limit.
However, the problem is that you cannot
TOAST an index. You might have seen this before. If you have a B-Tree index on a text column that is very large, Postgres will actually give you an error when you try to insert a very large text value into such a B-Tree index, because that would cause it to just not be able to store it physically. There is no way to keep that data on a single index page.
What can Postgres do better here? Jonathan talked about adding native support directly in Postgres. There is a benefit to having some of these things in core, like JSONB is. There could be ways of managing that data beyond a single page. There's more work to do around indexing methods, query plan costing, and things like that.
What is really nice to see is that there's a healthy discourse in the community about how this could be improved, and I would expect that in the next Postgres development cycle for Postgres 17, we'll see some more potential core improvements being discussed or also contributing back to pgvector and more clearly seeing what pgvector would need from the core community.