AI Coding PR Review with Acceptance Criteria

How to review an AI-generated pull request against acceptance criteria: what to check by eye, what to run, and the failure modes that LLM-authored code slips past a quick skim.

Spec WritingProcess

Published on 2026-03-10 · Updated 2026-06-02 · 8 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Field note: mutate one line before approving

When I do not trust an AI-generated test, I break the implementation on purpose. If the test still passes, the PR is not ready. That tiny ritual catches more fake confidence than a long review comment thread.

Mutation check:
1. Comment out the retry branch.
2. Run the retry test.
3. If it still passes, reject the PR.
4. Ask for a test that fails when the required behavior is removed.

The Mental Shift That Changes Everything

When I review a pull request an engineer wrote by hand, I read for style, idiom, and judgment. When I review one an LLM wrote, I read for one thing only: did it satisfy the acceptance criteria in the spec, without doing anything the spec did not authorize? That is the whole job. The spec is the rubric. If the spec does not cover it, I cannot meaningfully review it, and I send the author back to write a better spec instead of approving something nobody has a way to judge.

I stopped asking "does this code look reasonable?" and started asking "which acceptance criterion does this line exist to satisfy?" Any line that cannot answer is suspect.

Run the Tests, Do Not Trust the Checkmark

CI passes. The green tick is there. That tells me almost nothing. I pull the branch and run the tests locally, or at minimum I open the CI log and read what actually executed. I have lost count of AI-generated PRs with clean green checks because the test file imported the module, constructed the class, and asserted almost nothing — or mocked the one function that would have caught the bug. Watch also for tests that pass while the production path they claim to cover sits behind a feature flag the test never enables. Green. Useless.

Map Every Criterion to a Test

Now I open the spec. For each acceptance criterion, I search the diff for the test that demonstrates it. If I cannot point at a specific test and say "this covers criterion three," criterion three is not covered, regardless of how thorough the PR looks. I reject at this step more than any other. The fix is not to add tests myself; it is to return the PR with the missing criteria listed and let the author try again. This is also where I insist PR descriptions list the criteria closed. If the author cannot write "closes AC-1, AC-2, AC-4" at the top of the PR, they do not know what they built. Neither do I.

Smuggled Scope and Missing Error Paths

LLMs love to be helpful. A spec that asked for a date-parsing function comes back as a date-parsing function plus a new utility file, plus a dependency on a date library the project did not use, plus a refactor of an unrelated helper because the model decided it "could be cleaner." None of that was authorized. I walk the diff for any file, import, or abstraction the spec did not call for, and I cut it or reject the PR. Unauthorized scope is how codebases rot one helpful suggestion at a time.

Then the other direction. LLMs are happy-path machines. I look specifically for: input null, network fails, file missing, user unauthorized, queue full. If the spec lists those cases, I expect tests. Often I find a try/except that swallows everything, logs nothing, and returns a default. That is worse than a crash; it is a crash with the evidence deleted.

Hunt Hallucinations, Then Hand-Mutate

The model confidently calls a method that does not exist, imports a function the module never exported, passes three arguments to something that takes two. Type checkers catch most. Some slip through on code paths the tests never trigger. I grep every imported symbol and method call against the real library, by hand. Ten minutes of suspicion beats a 2am page.

Then the step most reviewers skip and I trust most. I pick one line in the new code and change it to something wrong. Flip a boolean. Return null. Comment out the line. Run the tests. If nothing fails, the tests are decorative. They exercised the code; they did not test it. Two minutes, and it catches the most common AI test failure: tests that assert what the code does rather than what the spec requires.

Acceptance Criteria Belong in Given/When/Then

The review is only as good as the criteria. When a spec uses Given/When/Then, review becomes mechanical:

Given a logged-in user with an expired session token
When they submit the checkout form
Then the request is rejected with a 401
And the client shows the re-authentication modal
And no partial order is written to the database

Four assertions, four tests. If the PR has three, it is not done. Had the criterion been "handle expired sessions gracefully," the PR could be anything and I would have no grounds to push back. Vague criteria produce vague reviews produce shipped bugs.

A PR That Looked Fine

Last month I almost approved a PR adding retry logic to a payment webhook handler. Clean diff. Green CI. Two new tests, both passing. The agent even wrote a tidy PR description. I ran the mutation check out of habit and commented out the retry loop entirely. Both tests still passed. The tests asserted on the shape of the request object, not on whether retries occurred — the happy-path test succeeded on the first attempt, and the failure-path test mocked the HTTP client in a way that silently short-circuited the retry. The webhook would have failed once in production and given up forever. I sent it back with the retry criterion quoted and asked for a test that fails when retries are removed. The second PR took fifteen minutes. The bug would have taken a weekend.

Send It Back, Do Not Fix It Yourself

The tempting move is to patch the small stuff and merge. Every time I do, I teach the team that reviewers clean up after the model, and the next PR arrives with more holes. My default: list the failures against the spec, close the PR, ask for another attempt with a tightened spec if the original was ambiguous. The agent can try again in thirty seconds. I also grep for the recurring offenders on every AI PR — TODOs left in, silent fallbacks to defaults when a config key is missing, bare except blocks that log nothing, hardcoded constants where the spec named an env var. None are style issues. All violate the spec.

Rubber-Stamping Is the Real Risk

Teams drift. The first AI PR gets a careful review. The tenth gets a skim. By the fiftieth, someone is approving on the green check. That is how LLM-authored code quietly takes over a codebase. The discipline is boring: pull the branch, run the tests, check the criteria, mutate a line. Every time. When the spec is sharp, review takes fifteen minutes and I can defend every approval. When it is vague, I refuse to review until it is fixed. Acceptance criteria are not a formality — they are the only reason I can tell signal from confident-sounding noise.

AI Review Packet to Copy

Use this when an AI-assisted PR looks tidy but the reviewer still needs to compare it against written acceptance criteria. The packet keeps the review anchored to behavior, not code style.

AI coding review packet: AI Coding PR Review with Acceptance Criteria

Decision to make:
- How to review an AI-generated pull request against acceptance criteria: what to check by eye, what to run, and the failure modes that LLM-authored code slips past a quick skim.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Case study: acceptance criteria that stopped scope drift

An AI-generated PR added a broad database query while implementing a narrow account filter. The acceptance criteria gave reviewers a precise reason to reject the diff without rewriting it themselves.

Criterion	Diff finding	Reviewer action
Only current account results are visible.	Query filtered by email but not account_id.	Reject with cross-account fixture requirement.
No new admin permissions.	Role check widened to support_admin.	Reject as forbidden scope.
Search remains paginated.	Generated query removed limit.	Require performance evidence before merge.