Database schema change: from risky ticket to migration packet

This case turns a vague "add account status" request into a phased schema change with backfill evidence, dual-read behavior, rollback limits, and clear AI coding boundaries.

RiskLocks, backfill, stale reads
BoundaryExpand, backfill, switch, cleanup
EvidenceCounts, plans, rollback notes

The request before the spec

Weak ticket

Add account_status to accounts.

We need active, paused, and closed accounts.
Update the API and make sure reports still work.

Spec-first rewrite

Feature: Account status schema migration
Owner: Platform Data
Status: Ready for review

Context:
- accounts has 18M rows.
- status is currently inferred from closed_at and billing_pause_until.
- reporting reads accounts through nightly export.

Goal:
- Add account_status without blocking account writes.
- Backfill status from existing fields.
- Switch reads after parity checks pass.

Non-goals:
- No billing state redesign.
- No report schema change in this release.
- No removal of closed_at or billing_pause_until yet.

The migration phases

1. Expand

Add nullable account_status and supporting index without changing application reads.

2. Backfill

Fill status in batches and record row counts, mismatch counts, and runtime.

3. Switch

Read from account_status only after parity checks pass in staging and production sample.

4. Cleanup

Remove old inference later, after reporting and rollback windows close.

The packet reviewers actually need

spec.md

Status rules:
- closed_at not null -> closed
- billing_pause_until future -> paused
- otherwise -> active

Compatibility:
- Existing reads keep old inference during expand and backfill.
- API response may include account_status only after switch flag is enabled.
- Reports continue using nightly export fields in this release.

tasks.md

- [ ] Add nullable account_status column.
- [ ] Add non-blocking index if query plan requires it.
- [ ] Write idempotent batched backfill.
- [ ] Add parity check comparing old and new status.
- [ ] Switch API read behind feature flag.
- [ ] Document cleanup follow-up.

acceptance-criteria.md

- Given an account with closed_at
  When backfill runs
  Then account_status is closed.

- Given the switch flag is off
  When API reads account status
  Then it uses existing inference.

- Given parity mismatch > 0.1%
  When release review runs
  Then switch is blocked.

evidence.md

Required:
- migration dry-run output
- batch size and lock notes
- query plan before/after
- parity check result
- API flag test
- rollback boundary after switch

Reviewer walkthrough

The first review question is whether the migration can run without blocking production writes. A code generator can add a column and update models, but it will not know the table size, lock behavior, index strategy, or batch limits unless the spec names them. That context belongs in the packet before any migration file is generated.

The second review question is whether the new field really matches current behavior. The old status may be implicit, messy, and spread across fields. The spec turns that implicit rule into testable mapping logic, then asks for a parity check before reads switch to the new column.

The third review question is rollback. Before the switch, rollback usually means disabling the new reader and rerunning or pausing the backfill. After the switch, rollback may require a different playbook because downstream consumers may have observed account_status. The packet should name that boundary clearly.

Scope check

Reject billing redesign, report schema changes, and removal of old fields. They are follow-up work after parity and consumer review.

Evidence check

Require backfill dry-run output, query plan checks, mismatch count, and tests that prove the feature flag controls reads.

Release check

The stop signal is elevated write latency, lock wait, backfill error rate, or parity mismatch above the threshold named in the spec.

AI coding guardrail

Implement only the account_status migration packet.

Allowed:
- migration file for nullable column
- batched backfill script
- status mapping tests
- parity check
- API read flag
- documentation for rollback boundary

Do not:
- remove closed_at
- remove billing_pause_until
- redesign billing status
- change reporting schema
- add admin status editor
- rewrite account model unrelated to status reads

The guardrail prevents a common AI-coding failure: taking a schema request as permission to "clean up" the domain model. A safer migration deliberately keeps old fields alive until the new field proves parity and downstream consumers have had time to adjust.

How to adapt this case

Use this case for schema changes where the new column represents behavior that already exists implicitly: account status, subscription tier, permission state, usage category, onboarding phase, or payment risk. The pattern is to expand first, backfill safely, compare old and new logic, switch reads behind a flag, and delay cleanup until rollback risk is lower.

Before copying the packet, replace the row count, lock-risk note, status mapping, and consumer list. The row count matters because a migration that is safe for 50,000 rows may be unsafe for 18 million. The consumer list matters because reporting, exports, search indexes, and customer support tools often read data differently from the API.

When using AI, do not ask for "the migration" until the packet states allowed files and stop signals. Ask the assistant to generate the migration and tests in separate tasks. If it changes old inference logic before the switch flag exists, send it back to the spec instead of accepting the diff.

Anti-patterns to reject

Required column on day one

A non-null column with immediate writes may lock or fail existing paths. Expand with nullable first unless the table is proven small and isolated.

No parity threshold

"Looks right" is not enough. The spec should name a mismatch threshold that blocks the switch.

Cleanup in the same release

Removing old fields before downstream consumers settle makes rollback more expensive and incident response slower.

What to watch after launch

A schema migration should keep producing evidence after merge. The safest teams treat the first release window as part of the spec, not as an informal monitoring chore.

Lock and latency

Watch write latency, lock waits, and backfill error rate while batches run. These are stronger signals than "migration completed."

Parity drift

Run the old-versus-new mapping query after switch, especially if old fields can still change during the release window.

Use this pattern before your next schema migration

Generate a packet, add the row count and stop signals, then keep implementation tasks small enough for review.

Editorial note

This case is a teaching scenario based on common migration failures: locking risk, missing backfill counts, premature read switches, and cleanup that removes rollback options too early.