AI Coding PR Review with Acceptance Criteria
How to review an AI-generated pull request against acceptance criteria: what to check by eye, what to run, and the failure modes that LLM-authored code slips past a quick skim.
Field note: mutate one line before approving
When I do not trust an AI-generated test, I break the implementation on purpose. If the test still passes, the PR is not ready. That tiny ritual catches more fake confidence than a long review comment thread.
Mutation check: 1. Comment out the retry branch. 2. Run the retry test. 3. If it still passes, reject the PR. 4. Ask for a test that fails when the required behavior is removed.
The Mental Shift That Changes Everything
When I review a pull request an engineer wrote by hand, I read for style, idiom, and judgment. When I review one an LLM wrote, I read for one thing only: did it satisfy the acceptance criteria in the spec, without doing anything the spec did not authorize? That is the whole job. The spec is the rubric. If the spec does not cover it, I cannot meaningfully review it, and I send the author back to write a better spec instead of approving something nobody has a way to judge.
I stopped asking "does this code look reasonable?" and started asking "which acceptance criterion does this line exist to satisfy?" Any line that cannot answer is suspect.
Run the Tests, Do Not Trust the Checkmark
CI passes. The green tick is there. That tells me almost nothing. I pull the branch and run the tests locally, or at minimum I open the CI log and read what actually executed. I have lost count of AI-generated PRs with clean green checks because the test file imported the module, constructed the class, and asserted almost nothing — or mocked the one function that would have caught the bug. Watch also for tests that pass while the production path they claim to cover sits behind a feature flag the test never enables. Green. Useless.
Map Every Criterion to a Test
Now I open the spec. For each acceptance criterion, I search the diff for the test that demonstrates it. If I cannot point at a specific test and say "this covers criterion three," criterion three is not covered, regardless of how thorough the PR looks. I reject at this step more than any other. The fix is not to add tests myself; it is to return the PR with the missing criteria listed and let the author try again. This is also where I insist PR descriptions list the criteria closed. If the author cannot write "closes AC-1, AC-2, AC-4" at the top of the PR, they do not know what they built. Neither do I.
Smuggled Scope and Missing Error Paths
LLMs love to be helpful. A spec that asked for a date-parsing function comes back as a date-parsing function plus a new utility file, plus a dependency on a date library the project did not use, plus a refactor of an unrelated helper because the model decided it "could be cleaner." None of that was authorized. I walk the diff for any file, import, or abstraction the spec did not call for, and I cut it or reject the PR. Unauthorized scope is how codebases rot one helpful suggestion at a time.
Then the other direction. LLMs are happy-path machines. I look specifically for: input null, network fails, file missing, user unauthorized, queue full. If the spec lists those cases, I expect tests. Often I find a try/except that swallows everything, logs nothing, and returns a default. That is worse than a crash; it is a crash with the evidence deleted.
Hunt Hallucinations, Then Hand-Mutate
The model confidently calls a method that does not exist, imports a function the module never exported, passes three arguments to something that takes two. Type checkers catch most. Some slip through on code paths the tests never trigger. I grep every imported symbol and method call against the real library, by hand. Ten minutes of suspicion beats a 2am page.
Then the step most reviewers skip and I trust most. I pick one line in the new code and change it to something wrong. Flip a boolean. Return null. Comment out the line. Run the tests. If nothing fails, the tests are decorative. They exercised the code; they did not test it. Two minutes, and it catches the most common AI test failure: tests that assert what the code does rather than what the spec requires.
Acceptance Criteria Belong in Given/When/Then
The review is only as good as the criteria. When a spec uses Given/When/Then, review becomes mechanical:
Given a logged-in user with an expired session token When they submit the checkout form Then the request is rejected with a 401 And the client shows the re-authentication modal And no partial order is written to the database
Four assertions, four tests. If the PR has three, it is not done. Had the criterion been "handle expired sessions gracefully," the PR could be anything and I would have no grounds to push back. Vague criteria produce vague reviews produce shipped bugs.
A PR That Looked Fine
Last month I almost approved a PR adding retry logic to a payment webhook handler. Clean diff. Green CI. Two new tests, both passing. The agent even wrote a tidy PR description. I ran the mutation check out of habit and commented out the retry loop entirely. Both tests still passed. The tests asserted on the shape of the request object, not on whether retries occurred — the happy-path test succeeded on the first attempt, and the failure-path test mocked the HTTP client in a way that silently short-circuited the retry. The webhook would have failed once in production and given up forever. I sent it back with the retry criterion quoted and asked for a test that fails when retries are removed. The second PR took fifteen minutes. The bug would have taken a weekend.
Send It Back, Do Not Fix It Yourself
The tempting move is to patch the small stuff and merge. Every time I do, I teach the team that reviewers clean up after the model, and the next PR arrives with more holes. My default: list the failures against the spec, close the PR, ask for another attempt with a tightened spec if the original was ambiguous. The agent can try again in thirty seconds. I also grep for the recurring offenders on every AI PR — TODOs left in, silent fallbacks to defaults when a config key is missing, bare except blocks that log nothing, hardcoded constants where the spec named an env var. None are style issues. All violate the spec.
Rubber-Stamping Is the Real Risk
Teams drift. The first AI PR gets a careful review. The tenth gets a skim. By the fiftieth, someone is approving on the green check. That is how LLM-authored code quietly takes over a codebase. The discipline is boring: pull the branch, run the tests, check the criteria, mutate a line. Every time. When the spec is sharp, review takes fifteen minutes and I can defend every approval. When it is vague, I refuse to review until it is fixed. Acceptance criteria are not a formality — they are the only reason I can tell signal from confident-sounding noise.
AI Review Packet to Copy
Use this before an AI-generated diff reaches code review. It turns the prompt, the allowed scope, and the required proof into one reviewable artifact.
AI coding review packet: AI Coding PR Review with Acceptance Criteria Decision to make: - How to review an AI-generated pull request against acceptance criteria: what to check by eye, what to run, and the failure modes that LLM-authored code slips past a quick skim. Owner check: - Product owner: - Engineering owner: - QA or operations reviewer: Scope boundary: - In scope: - Out of scope: - Assumption that still needs approval: Acceptance evidence: - Test or fixture: - Log, metric, or screenshot: - Manual review step: AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion. Reviewer prompt: - What would still be ambiguous to someone who missed the planning meeting? - What evidence would make this safe enough to ship?
Flagship Use Path
This is one of the primary Spec Coding references for AI PR review. Use it with a real ticket, pull request, or release review instead of treating it as background reading.
- Start here when: a pull request contains AI-authored code and the reviewer needs a stable review surface.
- Copy this: the acceptance-criteria-to-diff review table.
- Evidence to attach: tests, screenshots, logs, or manual notes attached to each criterion.
- Pair it with: AI Coding Governance Hub and Acceptance Criteria Hub.
Flagship review path: - Open this page during planning or review. - Copy the relevant artifact into the work item. - Replace example values with your system, owner, and failure mode. - Block implementation if the evidence line is still blank.
Second-pass reviewer note: criteria are the review surface
I checked that the article keeps returning to the same standard: every AI-generated change should map to an acceptance criterion and one piece of evidence. Without that map, the reviewer is grading style.
PR review row: - Criterion: - Files changed to satisfy it: - Test or evidence: - Manual check if not automated: - Out-of-scope changes to remove before merge:
Editorial Review Note
Reviewed Apr 29, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.
Topic Path
This article belongs to the Acceptance Criteria track. Start with the hub, then use the checklist, template, or tool below on a real project.
Keep Reading
Editorial Note
Last reviewed Apr 29, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor