Structured Prompts for Acceptance Criteria Generation
Prompt structures that generate usable acceptance criteria: the five-slot template, failure-mode seeding, and the review loop that stops hallucinated scenarios from reaching QA.
The Prompt I See Most Often Is the One That Fails
Nine times out of ten, the prompt I see an engineer paste into an assistant is "Given this feature spec, generate 5 bullet points of acceptance criteria." The output is confident, well-formatted, and almost entirely useless. You get bullets like "system handles errors gracefully" and "UI is responsive," which overlap and cannot be translated into a test. I want to walk through the prompt shape that actually produces testable criteria, the failure seeding that stops the happy-path default, and the review loop that catches hallucinated scenarios before QA wastes a week on them.
The Five-Slot Template Is the Whole Game
Every criterion worth shipping fills five slots: actor (who initiates), preconditions (state before the trigger), trigger (the specific event), expected outcome (the observable result), and observable side-effect (what changed elsewhere that proves the outcome happened). I force the prompt to produce all five per criterion. The side-effect slot catches laziness: if the model cannot name an audit log, a database row, an emitted event, or a changed metric, the behavior is not verifiable and the criterion is not ready.
Failure-Mode Seeding Beats the Happy Path
Left alone, LLMs default to the sunny route and the free-form bullet prompt lets them hide behind fluency. I append a required section to every criteria-gen prompt: "List three ways this feature breaks in production. For each, generate the criterion that would have caught it." This single instruction roughly doubles criteria count and shifts the distribution toward cases QA actually cares about. I pair it with boundary seeding: every prompt specifies null, empty, max, min, duplicate, and out-of-order as default input shapes the model must consider or justify skipping. I also force each criterion through an example pair: "Given input X, outcome Y. Given input not-X, outcome Y-prime." This flushes out criteria that only work in one direction. "A user with a valid token can reset their password" is half a criterion; the other half is "a user without a valid token cannot, and the system responds with a specific non-leaking error."
A Concrete Before/After: Password Reset
Here is the spec: "Users who forget their password can request a reset link via email. Link expires after 15 minutes. Used links cannot be reused."
First attempt, with the naive "generate 5 bullet points" prompt:
- User can request a password reset.
- Email is sent with a reset link.
- Link expires after 15 minutes.
- Used links cannot be reused.
- System handles errors gracefully.
Every bullet is a restatement of the spec. None are testable as written. "Gracefully" is meaningless. There is no actor, no precondition, no side-effect. Now the same spec with the five-slot template plus failure-mode seeding:
Given an unauthenticated user whose email matches an active account When they submit the reset form Then the system sends exactly one reset email within 30 seconds And writes one row to password_reset_requests with a 15-minute expiry Given a reset link issued 16 minutes ago When the user clicks it Then the system shows "link expired" and increments expired_link_attempts And does not reveal whether the account exists Given a reset link already consumed once When the user clicks it a second time Then the system rejects the request and logs a reuse_attempt event And the original password remains unchanged Given an email that does not match any account When a reset is requested Then the system returns the same success message as the valid case And writes nothing to password_reset_requests
The second version is testable 1:1. Each block has an actor, a precondition, a trigger, an outcome, and a side-effect. The fourth block only appeared because failure-mode seeding forced the question "how does this leak account existence?" The naive prompt never surfaced it.
The Review Loop That Catches Hallucinations
AI drafts criteria. A human scores each one testable or not-testable in under ten seconds. For every not-testable, the AI rewrites with: "rewrite this so a tester can pass or fail it without asking a question." Three iterations is the ceiling; if a criterion cannot survive three rewrites, the spec is broken and you stop rewriting criteria and go fix the spec. The other half of the loop is hallucination detection: map each criterion to a specific clause in the source spec. If nothing matches, reject it. Models happily invent rate limits, retention policies, and SLAs that were never in the document. Trace-to-source is the only defense.
Storing Prompts Like Code and Knowing What to Never Ask
I keep a versioned prompt library, one file per feature type: CRUD, workflow, integration, auth, migration, batch job. Each prompt declares the slot template, the failure-mode seeds specific to that type, and the boundary shapes that matter. Integration prompts include "network partition, timeout, partial success" by default. CRUD prompts include "concurrent write, soft delete, orphan reference." When a prompt produces a criterion that slips to production, the prompt gets updated, not just the criterion. A few things never go in any prompt: do not ask the model to estimate business value, impact, or priority - it will make up numbers that are wrong in the confident way hardest to catch. Do not ask for criteria and implementation in the same prompt; the model will optimize criteria to match the easier implementation.
When the Spec Fights the Criteria
The most useful signal from this process is when generated criteria contradict the spec. If the AI, constrained to the five slots and mapped back to source, produces a criterion the spec author disagrees with, the spec is underspecified. The contradiction is the artifact. I bring it back with one question: "which side is correct?" Nine times out of ten, the answer reveals a decision nobody had made yet.
The Meta Criterion
If the template works, it should work on itself:
Given a feature spec and the five-slot criteria-gen prompt When an engineer runs the prompt and applies the review loop Then every output criterion has a filled actor, preconditions, trigger, outcome, and side-effect And every criterion maps to a specific clause in the source spec And any criterion failing either check is rejected or sent back for rewrite
That is the bar. Anything below it is theater dressed up as process. Ship the template, ship the seeding, run the loop, and your QA team stops chasing hallucinations that were never in the spec to begin with.
Review drill
Use structured prompts to draft acceptance criteria, then review the output like any other spec artifact. The prompt should force context and evidence into the answer instead of asking for generic scenarios.
- Context: include role, workflow, state, constraints, and known non-goals before asking for criteria.
- Shape: ask for Given/When/Then, API examples, error cases, or release checks depending on the artifact being tested.
- Review: delete duplicated criteria, merge overlapping cases, and add missing edge cases from the actual system.
Keep the final criteria in the spec, not in the chat transcript. The prompt is a drafting aid; the reviewed criteria are the source of truth.
AI Review Packet to Copy
Use this before an AI-generated diff reaches code review. It turns the prompt, the allowed scope, and the required proof into one reviewable artifact.
AI coding review packet: Structured Prompts for Acceptance Criteria Generation Decision to make: - Prompt structures that generate usable acceptance criteria: the five-slot template, failure-mode seeding, and the review loop that stops hallucinated scenarios from reaching QA. Owner check: - Product owner: - Engineering owner: - QA or operations reviewer: Scope boundary: - In scope: - Out of scope: - Assumption that still needs approval: Acceptance evidence: - Test or fixture: - Log, metric, or screenshot: - Manual review step: AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion. Reviewer prompt: - What would still be ambiguous to someone who missed the planning meeting? - What evidence would make this safe enough to ship?
Editorial Review Note
Reviewed Apr 28, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.
Topic Path
This article belongs to the Acceptance Criteria track. Start with the hub, then use the checklist, template, or tool below on a real project.
Keep Reading
Editorial Note
Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor