Edit on GitHub

# The Case for a Metadata Table for Database Migrations

### Motivation

As features are added, changed or deleted, the data model used by an application usually changes as well. For most database-backed applications, this means migrations are needed.

With this in mind, the fundamental goal of database migrations:

The current database schema should match the currently running code.

However, achieving this goal is not so easy. A typical database migration might be adding an index to improve performance of a common query:

CREATE INDEX
idx_books_publish_date
ON
books (publish_date DESC, title ASC);


or adding a new column to support a new feature set:

ALTER TABLE
books
latest_edition TIMESTAMP;


In PostgreSQL, migrations like these are easy to write in an idempotent way

CREATE INDEX IF NOT EXISTS
idx_books_publish_date
ON
books (publish_date DESC, title ASC);
---
ALTER TABLE
books
latest_edition TIMESTAMP;


But not all migrations are so simple. Some migrations involve backfilling information:

UPDATE
books
SET
latest_edition = publish_date;


and running a migration like that multiple times would be problematic (e.g. if the data has already been backfilled and the new code allows publish_date and latest_edition to start to diverge).

This is where a metadata table for migrations comes in. By keeping a clear record of which migrations were run (and when), we can ensure each migration will be run exactly once during development and more crucially, in production databases.

### Guards

One alternative to a migrations table: introduce queries that check if a migration was already performed and guard against running it twice.

Some arguments against this strategy:

• Doing this for every migration introduced in the codebase is a lot of extra work
• This introduces an opportunity to make mistakes by writing buggy guards; for teams with a mix of experienced and inexperienced database users, writing guards correctly represents a real challenge
• In order to do "exactly once" checks on some migrations (e.g. the backfill one above), it may be necessary to introduce phantom columns to the database as breadcrumbs indicating the migration occurred
UPDATE
books
SET
latest_edition = publish_date,
latest_edition_backfill = TRUE
WHERE
latest_edition_backfill = FALSE;

• Application invariants may change over time and invalidate guards that were historically correct

### In Practice

It's useful to have an example to understand how a migrations table is constructed and how it fits into development and deployment. I'll use the example migrations from my golembic tool as our example.

To get started, the development database needs to be running. In this stage, there should be no tables yet because no migrations have run:

$make start-postgres Network dev-network-golembic created. Container dev-postgres-golembic started on port 18426. ...$ docker ps
CONTAINER ID   IMAGE                  COMMAND                  CREATED          STATUS          PORTS                     NAMES
480e5fce3e10   postgres:13.1-alpine   "docker-entrypoint.s…"   54 seconds ago   Up 52 seconds   0.0.0.0:18426->5432/tcp   dev-postgres-golembic
 make psql
Running psql against port 18426
...
golembic=> \dt
Did not find any relations.
golembic=> \q


Using golembic, we can run the migrations in the development database:

$make run-postgres-cmd Applying c9b52448285b: Create users table Applying f1be62155239: Seed data in users table Applying dce8812d7b6f: Add city column to users table Applying 0430566018cc: Rename the root user [MILESTONE] Applying 0501ccd1d98c: Add index on user emails (concurrently) Applying e2d4eecb1841: Create books table Applying 432f690fcbda: Create movies table  At this point, the migrations will have created the three tables needed by the application: books, movies and users. Observe via make psql: golembic=> \dt List of relations Schema | Name | Type | Owner --------+---------------------+-------+---------------- public | books | table | golembic_admin public | golembic_migrations | table | golembic_admin public | movies | table | golembic_admin public | users | table | golembic_admin (4 rows)  In addition to the application tables from the migrations, there is also a new metadata table golembic_migrations. This contains a record of the migrations that have already been run: golembic=> SELECT * FROM golembic_migrations; serial_id | revision | previous | created_at -----------+--------------+--------------+------------------------------- 0 | c9b52448285b | | 2021-10-29 03:57:57.92535+00 1 | f1be62155239 | c9b52448285b | 2021-10-29 03:57:57.936784+00 2 | dce8812d7b6f | f1be62155239 | 2021-10-29 03:57:57.945764+00 3 | 0430566018cc | dce8812d7b6f | 2021-10-29 03:57:57.954536+00 4 | 0501ccd1d98c | 0430566018cc | 2021-10-29 03:57:57.963776+00 5 | e2d4eecb1841 | 0501ccd1d98c | 2021-10-29 03:57:57.986174+00 6 | 432f690fcbda | e2d4eecb1841 | 2021-10-29 03:57:57.9968+00 (7 rows) golembic=> \q  Using this record, the next time the migrations command runs, the records in golembic_migrations can be used to determine that no new migrations need to run: $ make run-postgres-cmd
No migrations to run; latest revision: 432f690fcbda


### Prior Art

Although it may seem like an advertisment for golembic, the point of this post was to make a case for a migrations table. To see golembic isn't alone, note that TypeORM takes a similar approach. For example:

dunder_mifflin=> SELECT * FROM migrations;
id |   timestamp   |                     name
----+---------------+-----------------------------------------------
1 | 1568674788742 | initial1568674788742
2 | 1571783618350 | recipient1571783618350
3 | 1572313444993 | recipientEnum1572313444993
4 | 1572479127851 | ticket1572479127851
5 | 1572990972108 | ticketIndex1572990972108
6 | 1573594420208 | uniqueConstraint1573594420208
(6 rows)

dunder_mifflin=> \q


(There are lots of ORMs and migrations tools other than TypeORM that also take the same approach, I just had this example handy.)