PostgreSQL Partitioning in Django
Postgres 10 introduced partitioning to improve performance for very large database tables. You will typically start to see the performance benefits with tables of 1 million or more records, but the technical complexity usually doesn’t pay off unless you’re dealing with hundreds of gigabytes of data.
Though there are several advantages to partitioning, it requires more tables, which can become cumbersome to work with, especially if you change your data structure in the future. Please note: If you are just starting out with a small database, you probably don't need partitioning.
That said, if you think you may have a legitimate reason to partition your Postgres database and you want to use Django to manage it, this article is the right one for you.
What is Database Partitioning?
You may hear partitioning and think it is similar to sharding, where a database or table is spread out across several different nodes. In fact, partitioning in PostgreSQL involves splitting a single table up into several different tables, but partitioning is performed on the same node. Partitioning allows you to organize the data into subsets that are easier for the query planner to traverse. This can vastly increase the speed of lookups, deletes, and inserts.
To that end, the way you choose to partition the data should reflect the way you want to access your data. In other words, if you are frequently accessing records based on when they were created, you should probably partition based on creation date. If you're regularly grabbing subsets of data based on a region or country, you may want to partition based on the records’ locations.
There are three types of partitioning supported by PostgreSQL:
List Partitioning in Postgres
List Partitioning allows you to explicitly state which values you would like to put into each partition. For example, you could partition a table of North American climate data by country with a United States, Canada, and Mexico partition. Since you can create partitions of a partition, you could further split these tables up by state or province.
Range Partitioning in Postgres
Range Partitions are the most useful and the kind I’ll use most in this tutorial. They allow you to specify partitions based on a range of numbers or dates. A table for storing measurements on an hourly basis might be partitioned by date and time. This would make looking up new measurements or deleting older measurements much faster.
Hash Partitions in Postgres
Hash Partitions split data by specifying a modulus and a remainder for each partition. Each partition will hold the rows for which the hash value of the partition key divided by the specified modulus will produce the specified remainder. This comes in handy if there isn't a clear way to organize the data, or you want a pseudo-random breakdown of your data.
How does PostgreSQL partitioning work?
One important thing to understand about a partitioned table is that the partitions themselves are also tables. They are created individually, and you can query them separately, though you would rarely want to use this feature.
The other thing to understand is that the partitioned table - the table you will split into smaller tables - doesn't hold any data. It exists as a parent to the partitions and a blueprint for the table schema.
Creating a partitioned table in PostgreSQL
The best way to understand partitions and see some of their benefits is to consider an example. Start by setting up a new people
table, which you’ll compare to the partitioned table:
CREATE TABLE people (
id BIGSERIAL PRIMARY KEY,
full_name text NOT NULL,
birth_date date NOT NULL
);
Now create another table with the same columns but partitioned:
CREATE TABLE people_partitioned (
id BIGSERIAL,
full_name text NOT NULL,
birth_date date NOT NULL,
PRIMARY KEY (id, birth_date)
) PARTITION BY RANGE (birth_date);
Here, we've created a RANGE
partition that uses birth dates to delimit records in each partition. You could just as easily do this for created_on
timestamps or an int
column like a measurement value or record ID. Note that we had to define the primary key to include both the id
column and the birth_date
column we are partitioning by, since primary keys always need to include the partition column(s).
Remember, a partitioned table on its own doesn't contain any data. You need to create the tables that represent the partitions themselves. In this case, split the data up into chunks of fifty years:
CREATE TABLE people_partitioned_birthdays_1800_to_1850 PARTITION OF people_partitioned
FOR VALUES FROM ('1800-01-01') TO ('1850-12-31');
CREATE TABLE people_partitioned_birthdays_1850_to_1900 PARTITION OF people_partitioned
FOR VALUES FROM ('1850-12-31') TO ('1900-12-31');
CREATE TABLE people_partitioned_birthdays_1900_to_1950 PARTITION OF people_partitioned
FOR VALUES FROM ('1900-12-31') TO ('1950-12-31');
CREATE TABLE people_partitioned_birthdays_1950_to_2000 PARTITION OF people_partitioned
FOR VALUES FROM ('1950-12-31') TO ('2000-12-31');
CREATE TABLE people_partitioned_birthdays_2000_to_2050 PARTITION OF people_partitioned
FOR VALUES FROM ('2000-12-31') TO ('2050-12-31');
Each partition table is declared a PARTITION OF
the people_partitioned
table and includes the range of values you want to include in that table. It's best to give the tables descriptive names.
Now, you can insert data into the people_partitioned
table:
INSERT INTO people_partitioned (full_name, birth_date) VALUES ('Bob Sponge', '2000-08-21');
If you query the people_partitoned
table, you’ll get the data you just inserted:
SELECT * FROM people_partitioned;
full_name | birth_date
-----------+-----------
Bob Sponge | 2000-08-21
To ensure the record went into the right partition, query the partition table directly:
SELECT * FROM people_partitioned_birthdays_2000_to_2050;
full_name | birth_date
-----------+-----------
Bob Sponge | 2000-08-21
As you can see, records are stored in the individual tables but accessible through the top-level partitioned table as well. This makes accessing the data relatively straightforward as you don’t have to keep track of which data is in each partition.
Comparing Partitioned Postgres Table Performance with Python and Faker
Next, I used Python and Faker to populate each table with ten million rows of random data. To compare the performance on each table, run a SELECT
query for anyone born between 1901 and 1920:
SELECT * FROM people WHERE EXTRACT(year FROM birth_date) > 1901 AND EXTRACT(year FROM birth_date) < 1920;
The query returned 1,313,997 rows of data. Our unpartitioned table ran the query in 4.109
seconds while the partitioned table returned the exact same rows in 2.878
seconds, a difference of 1.23
seconds.
This dataset is relatively small compared to what you would typically see in a partitioned database, but if you had to run a similar query on 500 million rows, you can see how partitioning could make a big difference.
To continue our tutorial, next, delete everybody born in 1990:
DELETE FROM people WHERE EXTRACT(year FROM birth_date) = 1990;
In this case, each table deleted 73,015 rows. The non-partitioned table did it in 00:05.431
seconds and the partitioned table finished deleting the same rows in 00:03.688
seconds - 1.74
seconds faster.
A great use case for partitioning is data that accumulates quickly in large quantities like up-to-the-minute weather data. This data is very relevant near the time it is collected but becomes much less useful a week later. Because the partitions are just tables, you can just drop irrelevant tables, making deletes even faster.
Postgres Data Partitioning in Django
Django's ORM doesn't have built-in support for partitioned tables, so if you want to use partitions in your application, it's going to take a little extra work.
One way to use partitions is to roll your own migrations that run raw SQL. This will work, but it means you're going to have to manually manage the migrations for all changes you make to the table in the future.
Another option is to use a package called django-postgres-extra. Django-postgres-extra offers support for several PostgreSQL features that are not built into Django’s ORM, for example, support for TRUNCATE TABLE
and table partitioning.
After you install the package, add it to your installed apps:
INSTALLED_APPS = [
...
'django.contrib.messages',
'django.contrib.staticfiles',
'psqlextra',
...
]
Next, set your partitioned model to inherit from PostgresPartitionedModel
from psqlextra
. You'll also need to set up a meta class to define what kind of partition you would like to use (Range, List, Hash) and the column you'd like to partition by:
from django.db import models
from psqlextra.types import PostgresPartitioningMethod
from psqlextra.models import PostgresPartitionedModel
class Person(PostgresPartitionedModel):
class PartitioningMeta:
method = PostgresPartitioningMethod.RANGE
key = ["birth_date"]
full_name = models.TextField()
birth_date = models.DateField()
Create the migration with python manage.py pgmakemigrations
. You should get a file that looks something like this:
# Generated by Django 3.1.2 on 2020-10-13 23:34
from django.db import migrations, models
import psqlextra.backend.migrations.operations.add_default_partition
import psqlextra.backend.migrations.operations.create_partitioned_model
import psqlextra.manager.manager
import psqlextra.models.partitioned
import psqlextra.types
class Migration(migrations.Migration):
initial = True
dependencies = [
]
operations = [
psqlextra.backend.migrations.operations.create_partitioned_model.PostgresCreatePartitionedModel(
name='Person',
fields=[
('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
('full_name', models.TextField()),
('birth_date', models.DateField()),
],
options={
'abstract': False,
'base_manager_name': 'objects',
},
partitioning_options={
'method': psqlextra.types.PostgresPartitioningMethod['RANGE'],
'key': ['birth_date'],
},
bases=(psqlextra.models.partitioned.PostgresPartitionedModel,),
managers=[
('objects', psqlextra.manager.manager.PostgresManager()),
],
),
psqlextra.backend.migrations.operations.add_default_partition.PostgresAddDefaultPartition(
model_name='Person',
name='default',
),
]
Next, create some empty migration files - one for each partition.
You can create an empty migration with python manage.py makemigrations --empty yourappname
. Then, use django-postgres-extra
to set up the migrations:
from psqlextra.backend.migrations.operations import PostgresAddRangePartition
class Migration(migrations.Migration):
dependencies = [
('people', '0001_initial'),
]
operations = [
PostgresAddRangePartition(
model_name="person",
name="people_partitioned_birthdays_1800_to_1850",
from_values='1800-01-01',
to_values='1850-12-31',
),
]
Again, you'll need to create one of these for every partition you need. To get our example from the section above to work, I would need five migrations in addition to the one created for the model.
Creating a migration to delete one of your partitions is basically the same:
from django.db import migrations, models
from psqlextra.migrations.operations import PostgresDeleteListPartition
class Migration(migrations.Migration):
operations = [
PostgresDeleteListPartition(
model_name="person",
name="people_partitioned_birthdays_1800_to_1850",
),
]
Now, when you use the Django model, the data will be stored across the partitions, but the ORM will work as you expect it to for any Django application.
Conclusion
In this article, you learned about the different types of partitions available in PostgreSQL. You saw that a partition is just a table that links to a parent table and helps organize data so that it can be accessed faster. Finally, you saw that even though Django's ORM does not natively support partitioning, it is possible to use the feature with the help of the django-postgres-extra package. It is also possible to create your own migrations and set it up that way.
No matter how you decide to go about it, it’s important to remember that you shouldn't use partitions unless you are sure it's the right move for your project. Partitions will make individual tables smaller but give you more tables to manage and for Postgres to search. They are typically best used for larger tables more than 100 GB in size.
Share this article: If you liked this article you might want to tweet it to your peers.
About the Author
Josh is a former educator turned developer with a proven ability to learn quickly and adapt to different roles. In 2018 he changed careers from education to tech and has been excited to find that his communication and presentation skills have transferred over to his new technical career. He's always looking for a new challenge and a dedicated team to collaborate with.