AI Coding with Test-Evidence Gates

AI writes code that reads like it works. Sometimes it even passes the tests the AI wrote for itself. My rule now: I do not trust the code, I do not trust the tests, I trust the evidence that the tests would have caught a real bug. Everything below is how I make that evidence concrete before anything merges.

ProcessSpec Writing

Published on 2026-03-10 · Updated 2026-06-02 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

The hallucination that passes review

The scariest AI-generated PR I ever approved was a payment decline handler. Every test was green. The diff read cleanly. The assistant had added a full test file alongside the implementation. Two weeks later we found that the decline path had never run in production because the handler returned early on a misread status code, and the tests mocked the gateway to always return approval. The assertions executed. They asserted nothing that mattered.

That is the hallucination problem: AI code passes by eye because it pattern-matches to code that works, and its tests pass because they confirm what the code already does. Review cannot be the gate. Tests are the gate, and the tests themselves need evidence.

The evidence trio I require on every PR

A passing run log. The CI run ID, linked in the PR description, showing green on the final commit. A URL, not a screenshot.
A failing run log. Proof the same tests actually fail when the implementation is wrong: an earlier commit on the branch with a bug, or a run where a line was commented out. "I ran it locally" is not evidence.
A mutation witness. One deliberate mutation applied to the new code and a CI run where the suite catches it. If flipping >= to > leaves every test green, the tests are not testing.

Together they answer the only question that matters: do the tests distinguish correct from incorrect behavior on this diff. A passing build alone does not.

Coverage theater and how to spot it

Line coverage is the most misused metric in software. I have reviewed 97 percent line coverage suites that would not catch a division by zero. The patterns repeat:

Line coverage without branch coverage. The test walks through the function but never enters the else.
Assertions that compare a value to itself or to a constant the code also returns.
Tests that instantiate an object and never call the method they claim to test.
Mocks configured so broadly the code under test never runs its logic.
expect(result).toBeDefined(). The function returned something. What was it.

The four test types that need to exist

Unit tests for pure logic, including the ugly branches nobody wants to think about.
Integration tests crossing at least one real boundary: database, queue, file system, HTTP server. Not a mock of that boundary.
Failure-path tests that deliberately break the dependency. Timeout, 500, malformed payload, disk full. If the code has a catch, something has to throw into it.
Acceptance tests expressed as Given/When/Then, driven from the outside, covering the user-visible behavior the ticket was written for.

Given/When/Then that a machine can check

I keep acceptance criteria in the PR description in this shape, and I refuse to review until they are there:

Given a customer with a valid card that the gateway will decline
  And an order in the pending state
When the payment handler processes the charge
Then the order moves to payment_failed
  And a decline email is queued
  And no capture request is recorded in the ledger
  And the response body contains the gateway decline code

Four Thens, four assertions, one to one. If the acceptance test file has fewer assertions than the criteria has Thens, one of them is lying.

When the AI writes the tests too

The assistant has every incentive to write tests that confirm the code it just wrote. The only defenses that actually work:

Mutation testing on changed files. Stryker, mutmut, PIT, whatever your language has. Run it on the diff, fail the build if the mutation score on new code drops below a floor. A surviving mutant is a gap the AI missed.
Property-based tests for anything that takes structured input. The AI will write an example test with three rows; a property test with a thousand generated inputs finds the off-by-one.
Adversarial case review. I write down three inputs the AI probably did not consider: empty, enormous, malformed. Then I check whether those inputs appear in the test file. They almost never do on the first pass.

Review checklist and CI gates

My reviewer checklist is short and I will not approve without hitting each item:

For each new assertion, can I name the implementation line it would fail on if I deleted it?
If I rename an internal variable, does the test still pass? It should.
If I invert a business rule, does at least one test turn red? It had better.
Is there a real failure-path test, not one named "handles errors" that only checks a try/catch exists?
Did the test file grow in roughly the shape of the implementation, or is the implementation 300 lines and the tests 40?

Human discipline decays, CI does not. My pipelines fail the build if branch coverage drops, if any new source line is uncovered, if a source file grew by 50 lines without the test file growing, if the mutation score on changed files falls below the floor, or if the PR description is missing a Test-Evidence: tag linking to a CI run ID. That last gate is the highest-leverage one I have; a bot disables merge when it points at a stale run from three force-pushes ago.

Snapshots, flakes, and other lies

Snapshot tests are the most common coverage theater in AI-generated code. The assistant calls toMatchSnapshot, the first run writes a snapshot that becomes the oracle, and the test forever confirms whatever the component produced that first time, bug included. Snapshots are useful only when a human reviews every changed line, so I ban them for behavior and allow them only for pure format fixtures.

Flaky tests get the same blunt policy: a flaky test is a failing test with good PR. Fix within one business day or delete. Never skip. Skipped tests rot into documentation of a feature nobody tests, invisible on the green dashboard. If the thing is genuinely probabilistic, make it deterministic in test mode or move it to a nightly soak suite that reports separately.

What changed after we enforced this

The first month was painful. PR throughput dropped because assistants kept opening PRs that failed the mutation gate. By month three the assistant had learned, through our prompts and pinned examples, to produce tests that survived mutations on the first try. The decline-path bug has not recurred, because the failing-run log requirement made it structurally impossible to merge a test suite that had never seen a failure. Evidence is cheap once you ask for it. Trust is expensive once you lose it.

AI Review Packet to Copy

Use this when passing tests are not enough. The packet asks what defect the test would catch, which fixture proves it, and where the reviewer can see that evidence.

AI coding review packet: AI Coding with Test-Evidence Gates

Decision to make:
- Test-evidence gates for AI-generated code: what tests must exist before merge, how to detect coverage theater, and the evidence structure that catches hallucinated implementations.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Case study: the AI fix that passed tests but failed evidence

An AI-generated refund fix passed the unit suite because the test asserted the HTTP response only. The evidence gate required state, audit, and retry proof before the PR could merge.

Acceptance criterion	Required evidence	Reject if
Duplicate refund request is safe.	Test shows same idempotency key returns original refund_id.	Only the 200 response is asserted.
Provider timeout keeps refund pending.	Fixture shows state=pending_provider and no ledger entry.	The test mocks success after timeout.
Support can trace the action.	Audit row includes actor_id, order_id, provider_status.	Evidence stops at UI screenshot.

Case study: the AI fix that passed tests but failed evidence

An AI-generated refund fix passed the unit suite because the test asserted the HTTP response only. The evidence gate required state, audit, and retry proof before the PR could merge.

Acceptance criterion	Required evidence	Reject if
Duplicate refund request is safe.	Test shows same idempotency key returns original refund_id.	Only the 200 response is asserted.
Provider timeout keeps refund pending.	Fixture shows state=pending_provider and no ledger entry.	The test mocks success after timeout.
Support can trace the action.	Audit row includes actor_id, order_id, provider_status.	Evidence stops at UI screenshot.

Keywords: test evidence gates · AI-generated code review · mutation testing · coverage theater · CI merge gates · property-based testing · acceptance criteria

AI Coding with Test-Evidence Gates

The hallucination that passes review

The evidence trio I require on every PR

Coverage theater and how to spot it

The four test types that need to exist

Given/When/Then that a machine can check

When the AI writes the tests too

Review checklist and CI gates

Snapshots, flakes, and other lies

What changed after we enforced this

AI Review Packet to Copy

Case study: the AI fix that passed tests but failed evidence

Case study: the AI fix that passed tests but failed evidence

About This Article