Quality Gates for AI-Assisted Development Specs

The quality gates that stop AI-assisted code from shipping unreviewed: pre-prompt spec check, post-generation diff review, test-evidence verification, and the human sign-off rule. I have watched teams skip all four and then spend a weekend rolling back a one-line "harmless" change that was not harmless at all.

FoundationsProcess

Published on 2026-02-25 · Updated 2026-06-02 · 9 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Field note: the gate that stops a pretty but unsafe diff

The common failure is not ugly generated code. It is polished code with no rollback story. A quality gate should ask for evidence that survives beyond the local test run.

Quality gate row:
Risk class: medium
Required before merge: unit test, integration test, rollback note
Required before release: log query and alert threshold
Release owner: on-call lead
Stop signal: error_rate_refund_worker > 1% for 10 minutes

Four Gates, In Order, Non-Negotiable

When I run AI-assisted work, I enforce four gates and I enforce them in order. Gate 1 fires before the model sees a prompt: the spec has to be complete. Gate 2 fires right after generation: the diff has to match what the prompt said it would touch. Gate 3 fires after the code is wired up: tests have to exist and they have to actually fail when the behavior breaks. Gate 4 fires before merge: a named human has to have read the thing, not just clicked approve.

Any team that skips one of these is betting that the other three catch it. My experience is that the gaps line up more often than they miss.

Gate 1: Pre-Prompt Spec Check

Before I let the model generate a single token, an automated check runs over the spec file. It looks for five sections: problem statement, scope, non-goals, acceptance criteria, risks. If any of them is empty, missing, or reduced to a single sentence like "make it work," the runner refuses to call the model. No spec, no generation.

This sounds bureaucratic until you watch it work. The number of "quick tickets" that dissolve the moment someone has to write down a non-goal is astonishing. If a feature cannot survive five headings, it is not ready for AI, and honestly it is not ready for a human either.

The trap is letting Gate 1 become a checkbox. If every spec passes, the gate is not measuring anything. I audit a random sample of passing specs every two weeks and look for the ones that used filler text. Those get kicked back and the author finds out in public.

Gate 2: Post-Generation Diff Review

Right after the model finishes, before any tests run, I check the diff against the prompt. Three questions: Did the diff touch only the files I told it to touch? Did it add any dependencies I did not authorize? Did it create files outside the declared output paths?

I automate all three. A script reads the prompt's declared file list, diffs it against the actual changed files, and flags anything unexpected. New entries in package.json, pyproject.toml, or go.mod trigger a hard stop. Generated files in weird directories trigger a hard stop.

The failure mode I see most often is the model quietly adding a utility dependency because "it would be cleaner." Sometimes it would. But I want that decision surfaced, not smuggled.

Gate 3: Test Evidence, Including the Mutation Check

Tests passing is not evidence. Tests passing and then failing when you intentionally break the code is evidence. I require both.

For every acceptance criterion, there has to be at least one test that maps to it. I write acceptance criteria in Given/When/Then so the mapping is mechanical:

- Given a user with an expired session token
  When they request a protected resource
  Then the response is 401 and no row is written to the audit log

- Given a payment webhook with a replayed event ID
  When the handler receives it
  Then the handler returns 200 and does not re-charge the card

After the tests pass, I run a mutation check on the critical paths. If I can break the function body and the test still passes, the test is a liar. I would rather find that out now than during an incident. Gate 3 fails more often than people expect, almost always because the AI wrote a test that mocks the thing it was supposed to verify.

Gate 4: The "I Read This" Button

Gate 4 is the one that teams rubber-stamp, and it is the one that saves you. A named human has to read the diff end to end and attest to it. Not "approved," not a thumbs-up emoji. A comment that references specific lines, specific choices, or specific risks. Something that proves they were there.

The reason I am strict about this is that Gates 1 through 3 are machine-checkable, which means they are also machine-foolable. A reasonable-looking spec, a diff that stays in its lane, and a test suite that goes green will sail through. None of that catches a silent behavior change where the code is technically correct but no longer does what the business wants.

A PR That Passed 1-3 And Got Caught At Gate 4

Last quarter a PR came through that refactored our pricing rounding logic. Spec was complete. Diff was bounded. Tests passed and survived mutation. It looked perfect.

The reviewer on Gate 4 noticed that one branch now used banker's rounding where the old code used half-up. The tests did not catch it because the test inputs happened to round the same way under both rules. On the numbers that actually flowed through production, about one invoice in two hundred would have moved by a cent. Over a month that was a four-figure accounting reconciliation headache and probably a regulatory finding.

Gates 1, 2, and 3 had no opinion about rounding policy. A human who had priced these products for three years read the diff and said "wait, why did this change." That is the gate.

The Rubber-Stamp Problem

Gate 4 dies quietly. Reviewers see green CI, skim the diff, click approve, and move on. After a few months the gate is a vanity stamp and nobody notices until something gets through.

My countermeasures: rotate reviewers so the same person is not rubber-stamping the same author every week; require the review comment to cite a specific line or decision; audit a sample of approvals monthly and ask the reviewer to walk through what they looked at. If they cannot, that is data.

Measuring gate health is worth doing. I track how many PRs get rejected at each gate, and more importantly, which rejections were valuable versus which were noise. A gate that rejects nothing is either perfect or broken, and it is never perfect.

What CI Enforces Versus What Humans Enforce

Gates 1, 2, and 3 are CI's job. Spec schema validation, diff scope checks, test-to-criterion mapping, mutation sampling - all of it runs without a human in the loop. If a human has to remember to do these, they will forget.

Gate 4 cannot be automated. The whole point is that a human with context looked at the change. What I do automate is the evidence trail: the reviewer has to leave a comment that references the diff, and a bot records it against the PR. If the audit later asks "who signed off on this," there is an answer.

Bypass Policy

Sometimes you need to override a gate. Production is on fire, the spec is a one-liner, and you need the fix out now. I accept that, but with rules: only two named people can bypass, every bypass writes a row to an audit table with the reason, and every bypass generates a follow-up ticket that has to be closed within a week. Without that, "emergency" becomes the default path and the gates stop existing.

The Gate That Probably Should Exist

I do not yet run a post-merge regression gate and I think I am wrong about that. A 24-hour smoke check that compares key metrics against the previous day - error rates, latency percentiles, business metrics like conversion - would catch the silent regressions that tests cannot. I have seen two incidents in the last year that this gate would have caught in hours instead of days. It is on my list.

What To Actually Do Monday

Pick one gate and make it real this week. If you have nothing, start with Gate 1 because it is the cheapest to automate and the highest leverage. If you have Gate 1, harden Gate 3 with a mutation check on your top ten critical files. If you have all four on paper but they are soft, audit last month's approvals and see how many reviewers can still tell you what they looked at. The answer will tell you which gate to fix first.

AI Review Packet to Copy

Use this when the spec is ready but the AI implementation still needs a gate. The packet defines which proof has to appear before a generated diff can be trusted.

AI coding review packet: Quality Gates for AI-Assisted Development Specs

Decision to make:
- The quality gates that stop AI-assisted code from shipping unreviewed: pre-prompt spec check, post-generation diff review, test-evidence verification, and the human sign-off rule.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Case study: a quality gate with thresholds

A team had a quality gate that said "run tests and review carefully." It became useful only after the spec named thresholds, bypass rules, and evidence owners.

AI change gate:
- Applies to: auth, billing, permissions, data migration, public API
- Required before merge:
  - acceptance criteria mapped to tests
  - changed files match allowed_files list
  - reviewer checks forbidden_changes list
  - production metric named for first-hour monitoring
- Bypass:
  - only incident commander can bypass
  - bypass expires in 24 hours
  - follow-up spec review is required

Case study: a quality gate with thresholds

A team had a quality gate that said "run tests and review carefully." It became useful only after the spec named thresholds, bypass rules, and evidence owners.

AI change gate:
- Applies to: auth, billing, permissions, data migration, public API
- Required before merge:
  - acceptance criteria mapped to tests
  - changed files match allowed_files list
  - reviewer checks forbidden_changes list
  - production metric named for first-hour monitoring
- Bypass:
  - only incident commander can bypass
  - bypass expires in 24 hours
  - follow-up spec review is required

Keywords: AI code review · pre-merge gates · acceptance criteria · mutation testing · diff scope check · human sign-off

Gate Rollout Checklist

A quality gate is useful only when it blocks the right change for a clear reason. Roll it out with owners, evidence, and a bypass rule before asking every team to follow it.

Download: ai-quality-gate-policy-v2.md

Assign one owner for each gate: spec completeness, diff scope, test evidence, and human review.
Make machine-checkable gates run in CI so reviewers are not responsible for remembering them.
Require human review comments to cite a line, decision, or risk instead of using a blank approval.
Log every bypass with reason, approver, expiry date, and follow-up ticket.
Audit rejected and bypassed changes monthly to remove noisy checks and strengthen useful ones.

Evidence to Track

Track rejection reason, bypass count, review latency, post-merge defect escapes, and whether a blocked PR later produced a clearer spec. A gate that never rejects anything is not automatically healthy; it may simply be invisible.

Quality Gates for AI-Assisted Development Specs

Field note: the gate that stops a pretty but unsafe diff

Four Gates, In Order, Non-Negotiable

Gate 1: Pre-Prompt Spec Check

Gate 2: Post-Generation Diff Review

Gate 3: Test Evidence, Including the Mutation Check

Gate 4: The "I Read This" Button

A PR That Passed 1-3 And Got Caught At Gate 4

The Rubber-Stamp Problem

What CI Enforces Versus What Humans Enforce

Bypass Policy

The Gate That Probably Should Exist

What To Actually Do Monday

AI Review Packet to Copy

Case study: a quality gate with thresholds

Case study: a quality gate with thresholds

Gate Rollout Checklist

Evidence to Track

About This Article