5mins of Postgres E42: A surprising case of very large integers causing a Sequential Scan

Today, we talk about a surprising case of a very large integer causing a Sequential Scan in Postgres. There are cases when Postgres decides to not use an index, but instead opts for a Sequential Scan. This can cause some very real performance problems. Let’s have a look at why it does this and ways to resolve this issue!



Share this episode: Click here to share this episode on twitter, or sign up for our newsletter and check out the newsletter archive and subscribe to our YouTube channel.


Transcript

Let's have a look!

Circumvent bigints being cast to numeric values to avoid slow sequential scans in Postgres

This blog post by Jeremy Evans, who is the author of the Sequel library for Ruby, talks about how when you pass in an integer that is larger than 64 bit, Postgres suddenly decides not to use an index and instead uses a Sequential Scan.

When you're passing an integer inside a SQL query and that integer is larger than the size of the bigint type in Postgres, Postgres treats that value as a numeric value because bigint would not fit. So far so good.

Now, the problem is: when Postgres makes a comparison between a bigint value and the numeric value, Postgres will cast the bigint value to numeric. As it doesn't know if the numeric fits into integer or bigint, instead of casting that one to the bigint, it will actually cast the column you're comparing with to be numeric and then do a comparison.

The effective problem that this creates surfaces when you have an integer or bigint column in your table. For example, a column is named id and you have a regular index on the id column and you issue a SQL query that says SELECT * FROM table WHERE id might equal a very large number. Postgres will not use any index on the table because the index is on that particular column being a bigint and it couldn't use it with a numeric!

This might seem like an edge case, but we'll get to a very real world situation in a moment.

Using the generate_series function in Postgres

Jeremy illustrates this by using the generate_series function. It is always a great tool to have in your tool belt for creating test data. And then he just runs an EXPLAIN ANALYZE on that data.

Note: He doesn't just use an empty table, because if you've used an empty table, you would of course get a Sequential Scan, because that's going to be faster, but this is not what's happening here.

The generate_serices function is a great tool to have in your tool belt for creating test data.

First, he uses something that is just inside the bigint range, one less than the maximum size. And then he uses one that is exactly outside of the range, which is plus one.

The number that is inside the bigint range uses an Index Only Scan. This is very fast, around 0.1 milliseconds.

Then, he runs another query with a larger number. This results in a Parallel Sequential Scan, which means it scans the whole table, to then be able to filter by casting this to numeric and then doing a numeric comparison.

This results in performance that's 1000x worse! We’re still looking at a pretty small table in Jeremy’s example, so imagine this in a case where you have tens or even hundreds of gigabytes of data that gets scanned in the Sequential Scan. This could be a huge problem! And, as Jeremy mentions, this could also be a potential denial of service vector.

Now, how can you mitigate this problem?

Avoiding these slow Sequential Scans

One way of solving this is using bind parameters.

Bind parameters in Postgres

Instead of having Postgres do the conversion on the server side, you're explicitly telling Postgres that you’re passing it a bigint and that if something doesn't fit into the bigint you want it to raise an exception.

Explicit casts in Postgres

You could do an explicit cast. For example, you can just cast your number - as you're passing it in - into an integer. Then, if it was too large, Postgres would just complain and not do that Sequential Scan.

Oftentimes that's challenging though, as Jeremy points out, because if you have an ORM (like the one that Jeremy himself writes) it's actually hard to have this kind of behavior in the ORM, because there are certain cases where you want to have that implicit casting behavior.

From an ORM perspective, Jeremy addressed this in two ways.

  1. He changed the Sequel ORM in Ruby to explicitly check the sizes, preventing bad data being passed in.
  2. He switched to bind parameters

Using the pg_auto_parameterize extension in Sequel

The second thing mentioned above is to switch to using bind parameters, and in Sequel that's a choice. You have to use the pg_auto_parameterize extension. I would recommend you take a look at this when you're using Sequel, because it will help you prevent these types of problems in the future.

Now, you may think that a lot of ORMs already use bind parameters, but unfortunately that is not necessarily the case. It definitely helps to double check whether your ORM directly uses values in your queries or uses them as bind parameters.

It definitely helps to double check whether your ORM directly uses values in your queries or uses them as bind parameters.

This might seem like an edge case. Is it actually a problem?

The team at Reddit also uses Postgres. Here, they describe a case where many years ago they had a case where in the URL you would have IDs, and these IDs were base36 encoded, which they would translate back into an integer and then look up in the database.

They would have bots coming in, which were using random strings that became very, very large integers. Reddit were running into this exact problem on their production site! They were getting a Sequential Scan, which took a long time, ate their disk bandwidth, all these problems! Their fix was, of course, the simple validation that they should have been doing the whole time. But again, this was surprising behavior. It was surprising why Postgres didn't handle that for them.

In their case, they were using the SQLAlchemy ORM in Python.

To conclude: be careful with this. This can be an actual real world problem if you have user supplied data, that then is directly used in an integer type comparison.

Thanks so much for listening to this episode of 5mins of Postgres. Subscribe to our YouTube channel and follow us on Twitter to hear about upcoming episodes!

What we have discussed in this episode of 5mins of Postgres


Enjoy blog posts like this?

Get them once a month to your inbox