Database Migration Rules Inside a Technical Spec

Database Migration Rules Inside a Technical Spec
Spec Coding Editorial Team · Spec-first engineering notes

Most production outages I have watched came from a migration that looked fine in review and melted the database at 2pm on a Tuesday. The fix is a technical spec that forces specific answers about expand-contract, rollback, and deploy order before anyone writes a single ALTER TABLE.

Published on 2026-03-01 · Updated 2026-05-06 · 8 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Field note: the migration line I want before SQL exists

Before a migration script exists, I want one sentence about the rollback limit. Some migrations can be rolled back mechanically. Some can only be forward-fixed after data changes. That distinction belongs in the spec.

Migration rollback note:
- Expand phase: reversible by dropping nullable column.
- Backfill phase: not fully reversible after rows are transformed.
- Contract phase: old code remains compatible for 7 days.
- Stop trigger: lock wait above 5 seconds for 3 consecutive checks.

Every Migration Spec Must Declare a Phase Model

My default rule: no migration ships as a single step if the column, table, or constraint is touched by live production code. The spec must state the phase model explicitly: phase 1 adds new structure, phase 2 dual-writes or backfills, phase 3 removes the old structure. This is the expand-contract pattern, sometimes called parallel-change, and I reject any spec that tries to compress it into one deploy.

Concrete example: we renamed users.email to users.email_address last quarter. The naive spec was one migration and one code change. The shipped spec was three releases over nine days. Phase 1 added the new column and a trigger copying writes. Phase 2 switched reads to the new column behind a flag and backfilled historical rows. Phase 3 dropped the trigger, the old column, and the flag. Each phase was independently revertible.

Backward Compatibility Windows Must Be Named in Days

A migration spec that says "old code will continue to work" is not a spec. It is a wish. I require every spec to state how many days the old application code must remain compatible with the new schema, and the reverse. The answer is almost never zero.

The floor is one full deploy cycle plus one rollback window. If deploys take an hour and rollback is held for 24 hours, the window is at least 25 hours. If canaries run a week, the window is a week. The spec names this number so reviewers can check whether the phase plan respects it. A spec that drops a column the same day code stops writing to it gets sent back.

Online DDL Is a Decision, Not a Default

The spec must tell me, per statement, whether the DDL is safe online. On Postgres, adding a nullable column is free. Adding a NOT NULL column with a non-constant default to a 50M-row table rewrites the entire table under ACCESS EXCLUSIVE and takes the site down. The spec distinguishes these.

Large Tables Force Batched Backfills

If the table has more than roughly one million rows, the spec must describe the backfill strategy, not just mention one exists. I want batch size, cursor pagination key, expected runtime, lock behavior per batch, and whether the job is resumable after a crash.

Example from a recent spec: the 50M-row orders backfill used id-cursor pagination with 5,000-row batches, a 100ms sleep between batches, a checkpoint table recording the last processed id, and a kill switch readable from Redis. Expected runtime was 14 hours. The spec listed the replica lag threshold (10 seconds) that would auto-pause the job. Without this, the backfill hits replica lag and someone stops it by hand at 3am.

The Rollback Plan Is the Spec, Not an Appendix

I reject the migration spec if the rollback plan is a single line that says "revert the migration." Rollback is the primary artifact. For every phase, the spec tells me what reverting means, whether data written in the new shape is recoverable, and how long rollback remains available before the next phase makes it irreversible.

Some migrations are genuinely irreversible once data is written: a type change that loses precision, a dedup that deletes rows, a denormalization that overwrites. The spec must say so and name the point of no return. For these, I require a snapshot captured immediately before the irreversible step, with the restore procedure tested against the staging replica.

Deploy Order Dependency Is a First-Class Section

Half the migration incidents I have debugged came from the wrong deploy order. The spec must answer: does code deploy before the migration, after, or both across phases? For expand-contract the answer is always "both," but the spec lists each deploy individually and considers the failure mode of each order. If code deploys first and the migration fails, what breaks? If the migration deploys first and the code rollout stalls, what breaks? If both answers are missing, the plan is not ready.

Acceptance Criteria in Given/When/Then

I want migration acceptance criteria written in the same Given/When/Then format we use for features, because they are testable the same way.

- Given the phase 1 migration has run on staging with a prod-sized snapshot
  When the existing application code is deployed against the new schema
  Then all existing read and write paths succeed with zero error-rate delta

- Given the backfill job has processed all rows
  When we run the verification query comparing old and new columns
  Then zero mismatched rows are returned and the result is logged to the release ticket

- Given we are inside the backward compatibility window
  When we execute the rollback procedure for the current phase
  Then the application returns to the prior schema state within 10 minutes with no data loss

Tests Require Prod-Sized Data and Dry Runs

A migration tested only against an empty staging database is not tested. My rule: every spec names the dataset used for the dry run, and that dataset is within an order of magnitude of production row counts for the affected tables. The dry run is executed twice: once to measure duration and lock behavior, once to execute the documented rollback. Both runs are timestamped and attached to the release ticket. If the rollback dry run was skipped, the spec has not been tested, no matter how many times the forward migration ran.

Red Flags That Block Spec Review

These are the patterns I refuse to approve without a rewrite:

None of these are stylistic preferences. Each maps to a postmortem I have written. The spec is the cheapest place to catch them.

Spec Writing Block to Copy

Use this when a ticket sounds clear but still needs acceptance language. It forces the author to name the actor, trigger, result, and evidence.

Spec writing review block: Database Migration Rules Inside a Technical Spec

Decision to make:
- Write database migration rules into your spec, including expand-contract steps, compatibility windows, online DDL, and rollback plans.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

Writing boundary: avoid vague verbs; every criterion needs a visible pass or fail signal.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Flagship Use Path

This is one of the primary Spec Coding references for Database migration spec. Use it with a real ticket, pull request, or release review instead of treating it as background reading.

Flagship review path:
- Open this page during planning or review.
- Copy the relevant artifact into the work item.
- Replace example values with your system, owner, and failure mode.
- Block implementation if the evidence line is still blank.

Second-pass reviewer note: data safety needs timing

I checked that the article names time, lock risk, and reversibility. Migration advice becomes generic when it ignores how long the change runs and which phase cannot be undone.

Migration review:
- Estimate table size and backfill duration.
- Name the longest expected lock.
- Split expand, backfill, contract.
- State exactly when rollback stops being realistic.

Editorial Review Note

Reviewed Apr 29, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.

Keywords: expand-contract migration · online DDL · backward compatibility window · migration rollback plan · batched backfill · gh-ost · pt-online-schema-change

Topic Path

This article belongs to the AI Coding Governance track. Start with the hub, then use the checklist, template, or tool below on a real project.

Editorial Note

Last reviewed Apr 29, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.