Enforcing Zero Downtime Django Migrations

When you have a production system with CI/CD letting engineers deploy multiple times a day, you do not want to have downtime everytime you do so, yet many running Django applications let this happen whenever they run migrations. In this piece we delve into why, then how it can be mitigated.

A Django Deploy Is Not Atomic

Django is a web framework that encompasses many things, but for this piece it is most important to know that it includes an ORM (object-relational mapping) that links the code (Python) and the relational database (e.g. SQLite or Postgres).

This presents an interesting conundrum when releasing a new version of your Django app if it includes migration (e.g. database schema updates). We cannot change over the Django app, i.e. the source code, and apply the migration to the relational database at the exact same time unless you do some significant infrastructure shenanigans that probably will not scale.

So we’re going to have a time when the Django app and the database are running on different “versions”, which I call the transition state, as opposed to the more common steady state where both are on the same version. There are two ways to sequence the transition assuming we are moving from version A to version B:

Migrate the database before the app, meaning database version B will have to run against app version A.
Migrate the database after the app, meaning database version A will have to run against app version B.

It is of course possible to not have strict sequencing (e.g. run database migration in parallel, not before or after), however as we will see later on, this makes life very difficult.

Why Is This A Problem?

Imagine we have version A which has a model Widget and in version B we wish to add a field/column widget_description which is not allowed to be None/NULL and with default unknown. Assuming Widgets are being read, update and inserted all the time, this is going to lead to server errors, perceived as downtime, regardless of what we do.

Let’s consider if we migrate the database before. This means database version B has to run against app version A. If a row is inserted, this will fail because widget_description will be unset and so default to NULL but this is not a valid value for the field. You might counter that a default is set, however this is implemented at code level, not the database level (see open ticket 470 for the Django project, opened 17 years ago) and so this does not help you.

Let’s consider if we migrate the database after. This means database version A will have to run against app version B. This runs into a less subtle problem. When version B tries to interact with the table for Widget, this will almost always fail, because it will reference the column widget_description that does not exist as Django for example by default gets all the fields/columns of a model.

What To Do Instead

To make migrating before work, one can do a three-step process:

Deploy app and database with widget_description field as nullable in version B.
Back-fill all the NULL data.
Deploy app and database with widget_description non-nullable in version C.

(The back-filling could be a migration in step 3.)

Making migrating after work can be done, but would be quite disruptive:

Deploy app and database with widget_description field in version B but do shenanigans to work around the problem
Deploy app and database with widget_description field in version C with no shenanigans

For the shenanigans you might try deferring to load the field (defer("widget_description")), when updating specify the fields (.save(update_fields=(...))), however I am unclear how you can get around this problem for creation, and it seems hard to get right. There is however a generic way suggested by the Django documentation, using an unmanaged model:

class TemporaryWidget(models.Model):
    ...

    class Meta:
        ...
        
        managed = False
        db_table = 'app_widget'


class Widget(TemporaryWidget):
    widget_description = models.TextField(max_length=10)

    class Meta:
        db_table = 'app_widget'


TemporaryWidget = Widget

As before, this would require to backfill the NULLs and have a version C where one starts using the field in earnest.

For some operations migrating before naturally makes more sense (adding things), and for others migrating after (removing/reducing things), and so we support both.

The Solution: The Hammer

Sometimes these problems are superficial, if you are working to build a feature in many deploys and the models within them have no objects created then the problems above disappear, or if you are making an MVP. However, if you are working on a live production system that is serving a number of requests measured per second, not minute, hours or days you probably should care about this.

The problem here is hard to generalize, and so our solution was not to just write documentation for developers on how to get around this, but provide a way to have your CI system check for problems. We call this The Hammer.

The Hammer, would, if you are migrating the database before the app, it would test database version B against app version A and vice versa if migrating the database after the app.

Technically Niggly Bits

Conceptually this is easy to do, however there are some practical annoyances to work out. Let’s assume in this section that we are migrating the database before the app.

To create the correct test databases, what we have to do is to run pytest on the Django app version B:

Specify a single test that is decorated with @pytest.mark.django_db
--reuse-db so that the database is not removed at the end of the test (--create-db should be added if the test database might already exist) There might be an easier way to achieve this by introspecting how pytest creates the database, but the advantage is that this is robust to internal pytest changes.

Now all that needs doing is to run pytest on the Django app version A with --reuse-db so that it does not change the database.

So far so good, but how do we achieve this when the app is containerized and this is running in CI/CD, say CircleCI? You will need to run two containers, and coordinate between them, but CircleCI has a suggestion to use dockerize which I combined with netcat:

jobs:
  test-transition-state:
    docker:
      - image: my_django_app:<< pipeline.git.revision >>
      - image: my_django_app:<< pipeline.git.base_revision >>
        command: pytest --reuse-db my_special_test && nc -l 14058
      - image: postgres
    parallelism: 4
    steps:
      - run:
          name: install dockerize
          command: wget https://github.com/jwilder/dockerize/releases/download/$DOCKERIZE_VERSION/dockerize-linux-amd64-$DOCKERIZE_VERSION.tar.gz && sudo tar -C /usr/local/bin -xzvf dockerize-linux-amd64-$DOCKERIZE_VERSION.tar.gz && rm dockerize-linux-amd64-$DOCKERIZE_VERSION.tar.gz
          environment:
            DOCKERIZE_VERSION: v0.6.1
      - run:
          name: Wait for test databases to be created
          command: dockerize -wait tcp://localhost:14058 -timeout 5m
      - run:
          name: Run tests
          command: |
            set -e
            TEST_FILES=$(circleci tests glob "widget_server/**/test_*.py" | circleci tests split --split-by=timings)
            mkdir -p test-results
            pytest --verbose --junitxml=test-results/junit.xml $TEST_FILES
      - store_test_results:
          path: test-results
      - store_artifacts:
          path: test-results

The developer should choose between migrating before or after, to do so, one wants to pick something that is immutable and for the workflow intended, does not change between the squashed commit and the last commit on the PR. The thing I settled on was the branch name, and for a commit on the main branch HEAD~2 (or HEAD~1 if it does not exist, and HEAD if that does not (it is the initial commit)). Yes, this breaks if you have multi-branch merges, branches were merged the other way around et cetera, but they are not in our workflow.

We were already using dynamic configuration and so it was relatively easy to get these SHA1s and inject them as parameters (we actually use pipeline.app_revision and pipeline.db_revision to keep the complexity out of the resulting workflow). It is unclear to me whether pipeline.git.base_revision would work for HEAD~2/HEAD~1, and people online have reported many problems, so I did not use it.

Downsides of The Hammer

The Hammer, as one might suspect from the name, is pretty crude, and so it has some pitfalls:

If the migration(s) are not atomic, it does not test any transition state within them
It does not give diagnostic messages to the developer on what they should do instead, so documentation still necessary
It will not catch problems if you do not actually deploy on every commit in your main branch

Final Thoughts

Alternatively, one can trust your software engineers. Personally, I think this is important in any organization, but on the other hand, I think automating what can be automated is important, so even if you don’t make this failure a blocker for merger, it is a great tool to make the developer (and reviewer) aware of the problem.

Apparently there is a tool that can, conceptually, do some of the things we need at the database level (in that case, Postgres I think), but unfortunately I do not know the name! Please write in if you know. That seems cool, however for Django the big selling point is meant to be the integration of database management and the code, and I think adding something like this will be like worst of both worlds, but I am happy to hear thoughts.

This is also why it is so hard! It is because the Django application and the relational database are (too?) tightly coupled. If you compare this to the situation of static files in Django and other web frameworks, they have a simple way around this: append unique identifiers (hash normally) to each instance of a file, but that does not work for something like a relational database where you have to read, write and update. It could potentially work with an ORM fundamentally, but I think one would have to rethink how it is architected from the ground up to support this.