When you have a production system with CI/CD letting engineers deploy multiple times a day, you do not want to have downtime everytime you do so, yet many running Django applications let this happen whenever they run migrations. In this piece we delve into why, then how it can be mitigated.
A Django Deploy Is Not Atomic
Django is a web framework that encompasses many things, but for this piece it is most important to know that it includes an ORM (object-relational mapping) that links the code (Python) and the relational database (e.g. SQLite or Postgres).
This presents an interesting conundrum when releasing a new version of your Django app if it includes migration (e.g. database schema updates). We cannot change over the Django app, i.e. the source code, and apply the migration to the relational database at the exact same time unless you do some significant infrastructure shenanigans that probably will not scale.
So we’re going to have a time when the Django app and the database are running on different “versions”, which I call the transition state, as opposed to the more common steady state where both are on the same version. There are two ways to sequence the transition assuming we are moving from version A to version B:
- Migrate the database before the app, meaning database version B will have to run against app version A.
- Migrate the database after the app, meaning database version A will have to run against app version B.
It is of course possible to not have strict sequencing (e.g. run database migration in parallel, not before or after), however as we will see later on, this makes life very difficult.
Why Is This A Problem?
Imagine we have version A which has a model Widget
and in version B we wish to add a field/column widget_description
which is not allowed to be None
/NULL
and with default unknown
. Assuming Widget
s are being read, update and inserted all the time, this is going to lead to server errors, perceived as downtime, regardless of what we do.
Let’s consider if we migrate the database before. This means database version B has to run against app version A. If a row is inserted, this will fail because widget_description
will be unset and so default to NULL
but this is not a valid value for the field. You might counter that a default is set, however this is implemented at code level, not the database level (see open ticket 470 for the Django project, opened 17 years ago) and so this does not help you.
Let’s consider if we migrate the database after. This means database version A will have to run against app version B. This runs into a less subtle problem. When version B tries to interact with the table for Widget
, this will almost always fail, because it will reference the column widget_description
that does not exist as Django for example by default gets all the fields/columns of a model.
What To Do Instead
To make migrating before work, one can do a three-step process:
- Deploy app and database with
widget_description
field as nullable in version B. - Back-fill all the
NULL
data. - Deploy app and database with
widget_description
non-nullable in version C.
(The back-filling could be a migration in step 3.)
Making migrating after work can be done, but would be quite disruptive:
- Deploy app and database with
widget_description
field in version B but do shenanigans to work around the problem - Deploy app and database with
widget_description
field in version C with no shenanigans
For the shenanigans you might try deferring to load the field (defer("widget_description")
), when updating specify the fields (.save(update_fields=(...))
), however I am unclear how you can get around this problem for creation, and it seems hard to get right. There is however a generic way suggested by the Django documentation, using an unmanaged model:
As before, this would require to backfill the NULL
s and have a version C where one starts using the field in earnest.
For some operations migrating before naturally makes more sense (adding things), and for others migrating after (removing/reducing things), and so we support both.
The Solution: The Hammer
Sometimes these problems are superficial, if you are working to build a feature in many deploys and the models within them have no objects created then the problems above disappear, or if you are making an MVP. However, if you are working on a live production system that is serving a number of requests measured per second, not minute, hours or days you probably should care about this.
The problem here is hard to generalize, and so our solution was not to just write documentation for developers on how to get around this, but provide a way to have your CI system check for problems. We call this The Hammer.
The Hammer, would, if you are migrating the database before the app, it would test database version B against app version A and vice versa if migrating the database after the app.
Technically Niggly Bits
Conceptually this is easy to do, however there are some practical annoyances to work out. Let’s assume in this section that we are migrating the database before the app.
To create the correct test databases, what we have to do is to run pytest
on the Django app version B:
- Specify a single test that is decorated with
@pytest.mark.django_db
--reuse-db
so that the database is not removed at the end of the test (--create-db
should be added if the test database might already exist) There might be an easier way to achieve this by introspecting howpytest
creates the database, but the advantage is that this is robust to internalpytest
changes.
Now all that needs doing is to run pytest
on the Django app version A with --reuse-db
so that it does not change the database.
So far so good, but how do we achieve this when the app is containerized and this is running in CI/CD, say CircleCI? You will need to run two containers, and coordinate between them, but CircleCI has a suggestion to use dockerize
which I combined with netcat
:
The developer should choose between migrating before or after, to do so, one wants to pick something that is immutable and for the workflow intended, does not change between the squashed commit and the last commit on the PR. The thing I settled on was the branch name, and for a commit on the main branch HEAD~2
(or HEAD~1
if it does not exist, and HEAD
if that does not (it is the initial commit)). Yes, this breaks if you have multi-branch merges, branches were merged the other way around et cetera, but they are not in our workflow.
We were already using dynamic configuration and so it was relatively easy to get these SHA1s and inject them as parameters (we actually use pipeline.app_revision
and pipeline.db_revision
to keep the complexity out of the resulting workflow). It is unclear to me whether pipeline.git.base_revision
would work for HEAD~2
/HEAD~1
, and people online have reported many problems, so I did not use it.
Downsides of The Hammer
The Hammer, as one might suspect from the name, is pretty crude, and so it has some pitfalls:
- If the migration(s) are not atomic, it does not test any transition state within them
- It does not give diagnostic messages to the developer on what they should do instead, so documentation still necessary
- It will not catch problems if you do not actually deploy on every commit in your main branch
Final Thoughts
Alternatively, one can trust your software engineers. Personally, I think this is important in any organization, but on the other hand, I think automating what can be automated is important, so even if you don’t make this failure a blocker for merger, it is a great tool to make the developer (and reviewer) aware of the problem.
Apparently there is a tool that can, conceptually, do some of the things we need at the database level (in that case, Postgres I think), but unfortunately I do not know the name! Please write in if you know. That seems cool, however for Django the big selling point is meant to be the integration of database management and the code, and I think adding something like this will be like worst of both worlds, but I am happy to hear thoughts.
This is also why it is so hard! It is because the Django application and the relational database are (too?) tightly coupled. If you compare this to the situation of static files in Django and other web frameworks, they have a simple way around this: append unique identifiers (hash normally) to each instance of a file, but that does not work for something like a relational database where you have to read, write and update. It could potentially work with an ORM fundamentally, but I think one would have to rethink how it is architected from the ground up to support this.