AI Coding with Test-Evidence Gates
AI writes code that reads like it works. Sometimes it even passes the tests the AI wrote for itself. My rule now: I do not trust the code, I do not trust the tests, I trust the evidence that the tests would have caught a real bug. Everything below is how I make that evidence concrete before anything merges.
The hallucination that passes review
The scariest AI-generated PR I ever approved was a payment decline handler. Every test was green. The diff read cleanly. The assistant had added a full test file alongside the implementation. Two weeks later we found that the decline path had never run in production because the handler returned early on a misread status code, and the tests mocked the gateway to always return approval. The assertions executed. They asserted nothing that mattered.
That is the hallucination problem: AI code passes by eye because it pattern-matches to code that works, and its tests pass because they confirm what the code already does. Review cannot be the gate. Tests are the gate, and the tests themselves need evidence.
The evidence trio I require on every PR
- A passing run log. The CI run ID, linked in the PR description, showing green on the final commit. A URL, not a screenshot.
- A failing run log. Proof the same tests actually fail when the implementation is wrong: an earlier commit on the branch with a bug, or a run where a line was commented out. "I ran it locally" is not evidence.
- A mutation witness. One deliberate mutation applied to the new code and a CI run where the suite catches it. If flipping
>=to>leaves every test green, the tests are not testing.
Together they answer the only question that matters: do the tests distinguish correct from incorrect behavior on this diff. A passing build alone does not.
Coverage theater and how to spot it
Line coverage is the most misused metric in software. I have reviewed 97 percent line coverage suites that would not catch a division by zero. The patterns repeat:
- Line coverage without branch coverage. The test walks through the function but never enters the
else. - Assertions that compare a value to itself or to a constant the code also returns.
- Tests that instantiate an object and never call the method they claim to test.
- Mocks configured so broadly the code under test never runs its logic.
expect(result).toBeDefined(). The function returned something. What was it.
The four test types that need to exist
- Unit tests for pure logic, including the ugly branches nobody wants to think about.
- Integration tests crossing at least one real boundary: database, queue, file system, HTTP server. Not a mock of that boundary.
- Failure-path tests that deliberately break the dependency. Timeout, 500, malformed payload, disk full. If the code has a
catch, something has to throw into it. - Acceptance tests expressed as Given/When/Then, driven from the outside, covering the user-visible behavior the ticket was written for.
Given/When/Then that a machine can check
I keep acceptance criteria in the PR description in this shape, and I refuse to review until they are there:
Given a customer with a valid card that the gateway will decline And an order in the pending state When the payment handler processes the charge Then the order moves to payment_failed And a decline email is queued And no capture request is recorded in the ledger And the response body contains the gateway decline code
Four Thens, four assertions, one to one. If the acceptance test file has fewer assertions than the criteria has Thens, one of them is lying.
When the AI writes the tests too
The assistant has every incentive to write tests that confirm the code it just wrote. The only defenses that actually work:
- Mutation testing on changed files. Stryker, mutmut, PIT, whatever your language has. Run it on the diff, fail the build if the mutation score on new code drops below a floor. A surviving mutant is a gap the AI missed.
- Property-based tests for anything that takes structured input. The AI will write an example test with three rows; a property test with a thousand generated inputs finds the off-by-one.
- Adversarial case review. I write down three inputs the AI probably did not consider: empty, enormous, malformed. Then I check whether those inputs appear in the test file. They almost never do on the first pass.
Review checklist and CI gates
My reviewer checklist is short and I will not approve without hitting each item:
- For each new assertion, can I name the implementation line it would fail on if I deleted it?
- If I rename an internal variable, does the test still pass? It should.
- If I invert a business rule, does at least one test turn red? It had better.
- Is there a real failure-path test, not one named "handles errors" that only checks a try/catch exists?
- Did the test file grow in roughly the shape of the implementation, or is the implementation 300 lines and the tests 40?
Human discipline decays, CI does not. My pipelines fail the build if branch coverage drops, if any new source line is uncovered, if a source file grew by 50 lines without the test file growing, if the mutation score on changed files falls below the floor, or if the PR description is missing a Test-Evidence: tag linking to a CI run ID. That last gate is the highest-leverage one I have; a bot disables merge when it points at a stale run from three force-pushes ago.
Snapshots, flakes, and other lies
Snapshot tests are the most common coverage theater in AI-generated code. The assistant calls toMatchSnapshot, the first run writes a snapshot that becomes the oracle, and the test forever confirms whatever the component produced that first time, bug included. Snapshots are useful only when a human reviews every changed line, so I ban them for behavior and allow them only for pure format fixtures.
Flaky tests get the same blunt policy: a flaky test is a failing test with good PR. Fix within one business day or delete. Never skip. Skipped tests rot into documentation of a feature nobody tests, invisible on the green dashboard. If the thing is genuinely probabilistic, make it deterministic in test mode or move it to a nightly soak suite that reports separately.
What changed after we enforced this
The first month was painful. PR throughput dropped because assistants kept opening PRs that failed the mutation gate. By month three the assistant had learned, through our prompts and pinned examples, to produce tests that survived mutations on the first try. The decline-path bug has not recurred, because the failing-run log requirement made it structurally impossible to merge a test suite that had never seen a failure. Evidence is cheap once you ask for it. Trust is expensive once you lose it.
AI Review Packet to Copy
Use this before an AI-generated diff reaches code review. It turns the prompt, the allowed scope, and the required proof into one reviewable artifact.
AI coding review packet: AI Coding with Test-Evidence Gates Decision to make: - Test-evidence gates for AI-generated code: what tests must exist before merge, how to detect coverage theater, and the evidence structure that catches hallucinated implementations. Owner check: - Product owner: - Engineering owner: - QA or operations reviewer: Scope boundary: - In scope: - Out of scope: - Assumption that still needs approval: Acceptance evidence: - Test or fixture: - Log, metric, or screenshot: - Manual review step: AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion. Reviewer prompt: - What would still be ambiguous to someone who missed the planning meeting? - What evidence would make this safe enough to ship?
Editorial Review Note
Reviewed Apr 28, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.
Topic Path
This article belongs to the AI Coding Governance track. Start with the hub, then use the checklist, template, or tool below on a real project.
Keep Reading
Editorial Note
Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor