Constraint Programming in Action: Optimizing Postgres Index Selection

By Philippe Olivier

August 01, 2024

In a previous article we introduced constraint programming (CP), a declarative paradigm to solve discrete optimization problems. We discussed some theory, and considered a realistic example by constructing a work schedule for a small store. In this follow-up article, we’ll now take our understanding of constraint programming, and apply it to a real-world example: Selecting optimal indexes in Postgres.

The model presented in this article is similar to the CP model that was presented at PGDay Chicago 2024, and represents a subset of the model in production use with the pganalyze Indexing Engine.

Our aim is to help the broader software engineering community use constraint programming effectively, as well as further the discussion of automating index selection for Postgres.

The Index Selection Problem
The Dilemma of Many Objectives
The Hierarchical Method
Additional Features
Learn more about Constraint Programming in pganalyze
In Conclusion
References

The Index Selection Problem

We are trying to find the optimal selection of indexes according to some given criteria. We consider the workload not as discrete queries, but rather as a set of scans. A scan represents an activity on a table, and the cost of a scan is the estimated effort for finding the matching rows. For example, the SQL statement SELECT * FROM my_table WHERE x = $1 would require one scan on my_table with the filter x = $1. You can learn more about the concept of scans here.

Indexes reduce the costs of various subsets of scans. The cost of a scan is the lowest cost afforded to it among the selected indexes, or its sequential read cost if the scan is not included in any of the selected indexes. The sequential read costs are generally high, and should be avoided. In the example below, we want to minimize the scan costs, as well as the number of indexes used.

Let's start by creating some scans and indexes:

cost_read = [35, 22, 37, 42]

cost_indexes = [[15,   None, 12,   27  ],
                [None, 14,   None, 14  ],
                [13,   11,   21,   None],
                [21,   11,   18,   None]]

num_scans = len(cost_read)
num_indexes = len(cost_indexes)

The sequential read cost of scan s is defined by cost_read[s]. The cost offered by index i to scan s is defined by cost_indexes[i][s] (a value of None means that index i does not cover scan s).

Let's create a basic model using CP-SAT, as in our previous article:

from ortools.sat.python import cp_model

model = cp_model.CpModel()

Among the indexes we can choose from, which ones should be selected?

This can be described with one boolean variable per index, indicating if that index is selected in the solution:

model.x = [model.new_bool_var(f"x_{i}")
           for i in range(num_indexes)]

If model.x[i] is equal to 1, this means that index i is selected (and conversely, a value of 0 means that the index is not selected).

We also need to track the costs of the scans, which we can do using one integer variable per scan. In the previous article, we always described the domains of our integer variables as a range of values (e.g., from 0 to 20). Here, however, the cost of a given scan is already known to be either the cost offered to it by one of the indexes, or its sequential read cost.

For example, the possible costs of the first scan above are [35, 15, 13, 21]. We can take advantage of this to define the domains as tightly as possible. First, we make lists of the possible costs that can be assigned to each scan:

possible_costs = [[cost_read[s]] + [cost_indexes[i][s]
                                    for i in range(num_indexes)
                                    if cost_indexes[i][s] is not None]
                  for s in range(num_scans)]

Then, we use these to define the domains of the variables tracking the costs of the scans:

model.scan_cost = [model.new_int_var_from_domain(cp_model.Domain.from_values(possible_costs[s]),
                                                 f"scan_cost_{s}")
                   for s in range(num_scans)]

Here, from_values(list) creates a domain from the values in list, and new_int_var_from_domain(domain) creates a new integer variable with the domain domain. We want the value of model.scan_cost[s] to indicate the cost of scan s.

We will recall that the actual cost of a scan is the best cost offered to it by any selected index, or its sequential read cost if no selected index covers the scan. Accordingly, we can constrain the values of model.scan_cost using add_min_equality(target, list). This constraint assigns to integer variable target the lowest value among the integer variables found in list:

for s in range(num_scans):
    model.add_min_equality(model.scan_cost[s],
                           [cost_indexes[i][s] * model.x[i] +
                            (1 - model.x[i]) * cost_read[s]
                            for i in range(num_indexes)
                            if cost_indexes[i][s] is not None])

The way the above constraint is constructed might appear a bit enigmatic, so let's take a closer look. If index i is selected in the solution, then model.x[i] is equal to 1. This means that the expression cost_indexes[i][s] * model.x[i] + (1 - model.x[i]) * cost_read[s] becomes cost_indexes[i][s] * 1 + (1 - 1) * cost_read[s], which evaluates to cost_indexes[i][s] (the cost of scan s offered by index i).

Conversely, if index i is not selected in the solution, then model.x[i] is equal to 0. This means that the same expression instead becomes cost_indexes[i][s] * 0 + (1 - 0) * cost_read[s], which evaluates to cost_read[s] (the sequential reading cost of scan s). Hence every index offers one of two costs for each scan: either the index cost for that scan (if the index is selected), or no cost at all (which is equivalent to the sequential read cost).

And finally, as we mentioned before, the constraint add_min_equality(model.scan_cost[s], [...]) assigns to model.scan_costs[s] the lowest cost among the list [...] that we have thus constructed.

With the main components of the model now in place, we move on to the objective.

The Dilemma of Many Objectives

We mentioned earlier that our two optimization criteria were to minimize the scan costs, and to minimize the number of indexes. We can optimize the first criterion by minimizing the sum of model.scan_cost, and the second by minimizing the sum of model.x. Before incorporating these objectives into our model, let's first wrap it in a function in order to build it whenever we need it:

def build_model():
    model = cp_model.CpModel()
    ...
    model.x = ...
    ...
    model.scan_cost = ...
    ...
    return model

We also create a function that takes a model as input, and returns some meaningful information about the solution:

def solve_model(model):
    solver = cp_model.CpSolver()
    status = solver.solve(model)
    print(f"{'Status':12}{solver.status_name(status)}")
    print(f"{'Objective':12}{solver.objective_value}")
    print(f"{'Costs':12}{sum([solver.value(model.scan_cost[s]) for s in range(num_scans)])}")
    print(f"{'Indexes':12}{sum([solver.value(model.x[i]) for i in range(num_indexes)])}")

We now try each objective in turn, independently. First, we minimize the scan costs:

model = build_model()
model.minimize(sum(model.scan_cost))
solve_model(model)

Status      OPTIMAL
Objective   50.0
Costs       50
Indexes     4

And now, we minimize the number of indexes:

model = build_model()
model.minimize(sum(model.x))
solve_model(model)

Status      OPTIMAL
Objective   0.0
Costs       136
Indexes     0

We run into an obvious problem. If we want to minimize the scan costs, the best we can do is to use all the indexes available to us. And if we want to minimize the number of indexes used, the best we can do is to use no indexes at all. These two objectives are directly conflicting, since the former pushes for a solution with many indexes, and the latter for a solution with few indexes. What should we do?

One way to resolve this issue would be to combine these two objectives:

model = build_model()
model.minimize(sum(model.scan_cost) + sum(model.x))
solve_model(model)

Status      OPTIMAL
Objective   53.0
Costs       50
Indexes     3

This is slightly better than before, because we were able to get rid of a redundant index while keeping the costs as low as possible. This approach, however, is prone to breaking down in many situations. In the case where there would be many scans, or if the average scan cost was higher, for example, the number of selected indexes would easily be dwarfed by the combined costs of the scans.

The result would end up being similar to minimizing the scan costs without regard for the number of indexes. Weighting the index count in the objective more heavily (e.g., model.minimize(sum(model.scan_cost) + sum(model.x) * 100)) would introduce other problems related to variations in the scan costs, or various difficulties if we were to introduce additional optimization criteria in the future.

After careful consideration, we settled on using the so-called hierarchical method.

The Hierarchical Method

This method allows the user to define the relevance of each objective, and gives them the flexibility to decide on the interaction between these objectives. For example, one could say:

"I don't know what the best possible costs are, but I'm willing to have costs that are not more than 10% worse than those, using as few indexes as possible."

So, how does this work?

First, the user ranks the objectives by importance. In this example, this would be:

Minimize the scan costs
Minimize the number of indexes

Second, the user defines the interaction between the objectives (we call this the tolerance parameter). Tolerance is a measure of the leeway allowed between the best solution found for an objective, and the value imposed for that objective when the next one in line is optimized. In the example above, the tolerance between the two objectives has been set to 10%. Let's see how this works in practice.

The first thing to do is to solve the problem while only considering the most important objective (minimizing the scan costs):

model = build_model()
model.minimize(sum(model.scan_cost))
solve_model(model)

Status      OPTIMAL
Objective   50.0
Costs       50
Indexes     4

The best possible scan costs in this case have a value of 50. We have chosen a tolerance of 10% for this objective. This means that any subsequent solution is not allowed to be more than 10% worse than the value of 50 found here (in other words, not worse than 55). We build a fresh model and add a constraint enforcing this, before solving the problem while only considering the next objective in line (minimizing the number of indexes):

model = build_model()

# Constraint enforcing the tolerance between the two objectives
model.add(sum(model.scan_cost) <= 55)

# Optimize for the current objective
model.minimize(sum(model.x))

# Solve this new model
solve_model(model)

Status      OPTIMAL
Objective   2.0
Costs       55
Indexes     2

This new solution is interesting, as our costs are not too far off from the best that they could be, and we only use two out of the four indexes.

Additional Features

The hierarchical method of the previous section can easily scale to additional optimization criteria. All it would take would be to add new objectives to the list, and define a tolerance parameter between these new objectives. These criteria don't necessarily have to be viewed as objectives, but can directly be used as constraints. For instance, suppose that we have a tight resource budget (storage space, I/O, etc) that can accommodate a single index, no matter what. We could do:

model = build_model()

# We can only use one index
model.add(sum(model.x) == 1)

# Minimize the scan costs
model.minimize(sum(model.scan_cost))

# Solve the model
solve_model(model)

Status      OPTIMAL
Objective   76.0
Costs       76
Indexes     1

There are many other ways to expand this model to fit particular use cases.

Learn more about Constraint Programming in pganalyze

At pganalyze, we apply constraint programming to index selection, and have expanded the model by adding goals for Postgres-specific concepts such as Update Overhead that can be introduced by indexes, since an index blocks the Postgres HOT optimization.

If you want to learn more about the CP model employed by pganalyze to optimize for the selection of indexes, we recommend watching this webinar recording (56min) that goes into details on how it works:

In Conclusion

Index selection is a real-world example of a problem well-suited for constraint programming. Thinking through hundreds of possible index combinations on a given table makes manual analysis an impossible task. A custom CP model presents a better way, by stating the intent of what should be optimized, and letting a solver pick the optimal solution.

In this article we’ve shown what we’ve learned when building such a model, like the difficulties stemming from multi-objective optimization, and described the approach that we use to mitigate these, as well as discussed the possibility of adding certain optimization criteria directly as constraints.

If you want to use this approach in practice, we recommend taking a look at the pganalyze Index Advisor, and our free 14-day trial.