Check out our free resources, like our pganalyze eBooks and sign up to our newsletter.

Indexing Engine: "Good Enough" Index Selection Algorithm

Overall, the pganalyze Indexing Engine looks at index selection as a set covering problem, with the goal of covering all scans on the table with an index that is within 50% of the best possible index.

The "Good Enough" index selection algorithm takes the set of potential indexes, as well as the set of scans, and their potential cost improvements (as determined by What If? analyis), and produces zero or more index recommendations for a given table.

Elimination of worst candidates

Before we run through the set covering algorithm that selects the indexes to use, we first eliminate all indexes from consideration for a scan that are not within 50% of the best possible result.

This early reduction enables later stages of the algorithm to view this as a greedy set covering problem - any index that remains after this initial elimination would be acceptable and "good enough".

The index/scan matrix

The starting point to index selection, looks like this for a simple table with 4 existing indexes, 14 index candidates and 15 scans (extracted from queries):

I1 (0.11)6.1
I2 (0.14)68.1
I3 (0.15)68.1
I4 (0.10)7.2
I5 (0.08)33.6
I6 (0.09)43.3
I7 (0.13)77.6
I8 (0.12)77.6
I9 (0.14)31.142.731.131.131.1
I10 (0.15)46.144.346.146.146.1
I11 (0.08)63.346.
I12 (0.09)63.346.
I13 (0.16)63.346.
I14 (0.17)63.346.

You can think of X.. and I.. being mapped to an index like CREATE INDEX possible_index_1 ON table USING btree (some_column).

You can think of S.. being mapped to a scan like SELECT * FROM table WHERE some_column = $1.

Each cell for a potential index (I..) represents the cost improvement over the current baseline (either a SeqScan or existing sub-optimal Index Scan).

Each cell for an existing index (X..) shows "XXX" for indexes that are the best overall, and "(x)" for other existing indexes that match the scan.

Some columns have no cost improvements, typically you will see this when the existing indexes are covering the scan sufficiently.

You can see the calculated Index Write Overhead for each potential index in parenthesis (e.g. I11 has 0.08 Index Write Overhead).

Note that cells that are not within 50% of the best possible improvement (e.g. S1/I2, S1/I5 and I8/I6) have already been marked as not being "good enough" and are shown crossed out.

Set covering algorithm

Based on this input, the Indexing Engine runs a greedy Set Covering Algorithm to determine one or more "good enough" indexes.

It stats with the potential index that matches the most scans, and continuining this until all scans are covered. If two indexes match the same number of scans, the index with the lower Index Write Overhead is chosen.

In this example, the algorithm picks as following:

  1. I11 (covers S2, S5, S10, S11, S15, and has lower write overhead compared to I9,I10,I12,I13,I14)
  2. I9 (covers S6, S12, S14 -- note that S5 and S11 are no longer counted since I11 covers them)
  3. I6 (covers S1)
  4. I2 (covers S8)

This result is deterministic. When the algorithm runs again with the same inputs, it would produce the same outcome.

These recommendations are then shown in the Index Advisor for your assessment, benchmarking and, if they are a good fit, for applying them to production

Human review is encouraged. For example, you may chose not to create I2 and I6 in this example, if they represent infrequent queries where you find a sequential scan acceptable.

If you'd like to see this kind of debug output for your own recommendations, you can reach out to us by email (or through the in-app support function) and we'd be happy to share and explain how the algorithm made its decision.

Couldn't find what you were looking for or want to talk about something specific?
Start a conversation with us →