AI Coding with Test-Evidence Gates

AI Coding with Test-Evidence Gates
Spec Coding Editorial Team · Spec-first engineering notes

AI writes code that reads like it works. Sometimes it even passes the tests the AI wrote for itself. My rule now: I do not trust the code, I do not trust the tests, I trust the evidence that the tests would have caught a real bug. Everything below is how I make that evidence concrete before anything merges.

Published on 2026-03-10 · Updated 2026-05-06 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

The hallucination that passes review

The scariest AI-generated PR I ever approved was a payment decline handler. Every test was green. The diff read cleanly. The assistant had added a full test file alongside the implementation. Two weeks later we found that the decline path had never run in production because the handler returned early on a misread status code, and the tests mocked the gateway to always return approval. The assertions executed. They asserted nothing that mattered.

That is the hallucination problem: AI code passes by eye because it pattern-matches to code that works, and its tests pass because they confirm what the code already does. Review cannot be the gate. Tests are the gate, and the tests themselves need evidence.

The evidence trio I require on every PR

Together they answer the only question that matters: do the tests distinguish correct from incorrect behavior on this diff. A passing build alone does not.

Coverage theater and how to spot it

Line coverage is the most misused metric in software. I have reviewed 97 percent line coverage suites that would not catch a division by zero. The patterns repeat:

The four test types that need to exist

Given/When/Then that a machine can check

I keep acceptance criteria in the PR description in this shape, and I refuse to review until they are there:

Given a customer with a valid card that the gateway will decline
  And an order in the pending state
When the payment handler processes the charge
Then the order moves to payment_failed
  And a decline email is queued
  And no capture request is recorded in the ledger
  And the response body contains the gateway decline code

Four Thens, four assertions, one to one. If the acceptance test file has fewer assertions than the criteria has Thens, one of them is lying.

When the AI writes the tests too

The assistant has every incentive to write tests that confirm the code it just wrote. The only defenses that actually work:

Review checklist and CI gates

My reviewer checklist is short and I will not approve without hitting each item:

Human discipline decays, CI does not. My pipelines fail the build if branch coverage drops, if any new source line is uncovered, if a source file grew by 50 lines without the test file growing, if the mutation score on changed files falls below the floor, or if the PR description is missing a Test-Evidence: tag linking to a CI run ID. That last gate is the highest-leverage one I have; a bot disables merge when it points at a stale run from three force-pushes ago.

Snapshots, flakes, and other lies

Snapshot tests are the most common coverage theater in AI-generated code. The assistant calls toMatchSnapshot, the first run writes a snapshot that becomes the oracle, and the test forever confirms whatever the component produced that first time, bug included. Snapshots are useful only when a human reviews every changed line, so I ban them for behavior and allow them only for pure format fixtures.

Flaky tests get the same blunt policy: a flaky test is a failing test with good PR. Fix within one business day or delete. Never skip. Skipped tests rot into documentation of a feature nobody tests, invisible on the green dashboard. If the thing is genuinely probabilistic, make it deterministic in test mode or move it to a nightly soak suite that reports separately.

What changed after we enforced this

The first month was painful. PR throughput dropped because assistants kept opening PRs that failed the mutation gate. By month three the assistant had learned, through our prompts and pinned examples, to produce tests that survived mutations on the first try. The decline-path bug has not recurred, because the failing-run log requirement made it structurally impossible to merge a test suite that had never seen a failure. Evidence is cheap once you ask for it. Trust is expensive once you lose it.

AI Review Packet to Copy

Use this before an AI-generated diff reaches code review. It turns the prompt, the allowed scope, and the required proof into one reviewable artifact.

AI coding review packet: AI Coding with Test-Evidence Gates

Decision to make:
- Test-evidence gates for AI-generated code: what tests must exist before merge, how to detect coverage theater, and the evidence structure that catches hallucinated implementations.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Editorial Review Note

Reviewed Apr 28, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.

Keywords: test evidence gates · AI-generated code review · mutation testing · coverage theater · CI merge gates · property-based testing · acceptance criteria

Topic Path

This article belongs to the AI Coding Governance track. Start with the hub, then use the checklist, template, or tool below on a real project.

Editorial Note

Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.