Structured Prompts for Acceptance Criteria Generation

Structured Prompts for Acceptance Criteria Generation
Spec Coding Editorial Team · Spec-first engineering notes

Prompt structures that generate usable acceptance criteria: the five-slot template, failure-mode seeding, and the review loop that stops hallucinated scenarios from reaching QA.

Published on 2026-03-10 · Updated 2026-05-06 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

The Prompt I See Most Often Is the One That Fails

Nine times out of ten, the prompt I see an engineer paste into an assistant is "Given this feature spec, generate 5 bullet points of acceptance criteria." The output is confident, well-formatted, and almost entirely useless. You get bullets like "system handles errors gracefully" and "UI is responsive," which overlap and cannot be translated into a test. I want to walk through the prompt shape that actually produces testable criteria, the failure seeding that stops the happy-path default, and the review loop that catches hallucinated scenarios before QA wastes a week on them.

The Five-Slot Template Is the Whole Game

Every criterion worth shipping fills five slots: actor (who initiates), preconditions (state before the trigger), trigger (the specific event), expected outcome (the observable result), and observable side-effect (what changed elsewhere that proves the outcome happened). I force the prompt to produce all five per criterion. The side-effect slot catches laziness: if the model cannot name an audit log, a database row, an emitted event, or a changed metric, the behavior is not verifiable and the criterion is not ready.

Failure-Mode Seeding Beats the Happy Path

Left alone, LLMs default to the sunny route and the free-form bullet prompt lets them hide behind fluency. I append a required section to every criteria-gen prompt: "List three ways this feature breaks in production. For each, generate the criterion that would have caught it." This single instruction roughly doubles criteria count and shifts the distribution toward cases QA actually cares about. I pair it with boundary seeding: every prompt specifies null, empty, max, min, duplicate, and out-of-order as default input shapes the model must consider or justify skipping. I also force each criterion through an example pair: "Given input X, outcome Y. Given input not-X, outcome Y-prime." This flushes out criteria that only work in one direction. "A user with a valid token can reset their password" is half a criterion; the other half is "a user without a valid token cannot, and the system responds with a specific non-leaking error."

A Concrete Before/After: Password Reset

Here is the spec: "Users who forget their password can request a reset link via email. Link expires after 15 minutes. Used links cannot be reused."

First attempt, with the naive "generate 5 bullet points" prompt:

Every bullet is a restatement of the spec. None are testable as written. "Gracefully" is meaningless. There is no actor, no precondition, no side-effect. Now the same spec with the five-slot template plus failure-mode seeding:

Given an unauthenticated user whose email matches an active account
When they submit the reset form
Then the system sends exactly one reset email within 30 seconds
And writes one row to password_reset_requests with a 15-minute expiry

Given a reset link issued 16 minutes ago
When the user clicks it
Then the system shows "link expired" and increments expired_link_attempts
And does not reveal whether the account exists

Given a reset link already consumed once
When the user clicks it a second time
Then the system rejects the request and logs a reuse_attempt event
And the original password remains unchanged

Given an email that does not match any account
When a reset is requested
Then the system returns the same success message as the valid case
And writes nothing to password_reset_requests

The second version is testable 1:1. Each block has an actor, a precondition, a trigger, an outcome, and a side-effect. The fourth block only appeared because failure-mode seeding forced the question "how does this leak account existence?" The naive prompt never surfaced it.

The Review Loop That Catches Hallucinations

AI drafts criteria. A human scores each one testable or not-testable in under ten seconds. For every not-testable, the AI rewrites with: "rewrite this so a tester can pass or fail it without asking a question." Three iterations is the ceiling; if a criterion cannot survive three rewrites, the spec is broken and you stop rewriting criteria and go fix the spec. The other half of the loop is hallucination detection: map each criterion to a specific clause in the source spec. If nothing matches, reject it. Models happily invent rate limits, retention policies, and SLAs that were never in the document. Trace-to-source is the only defense.

Storing Prompts Like Code and Knowing What to Never Ask

I keep a versioned prompt library, one file per feature type: CRUD, workflow, integration, auth, migration, batch job. Each prompt declares the slot template, the failure-mode seeds specific to that type, and the boundary shapes that matter. Integration prompts include "network partition, timeout, partial success" by default. CRUD prompts include "concurrent write, soft delete, orphan reference." When a prompt produces a criterion that slips to production, the prompt gets updated, not just the criterion. A few things never go in any prompt: do not ask the model to estimate business value, impact, or priority - it will make up numbers that are wrong in the confident way hardest to catch. Do not ask for criteria and implementation in the same prompt; the model will optimize criteria to match the easier implementation.

When the Spec Fights the Criteria

The most useful signal from this process is when generated criteria contradict the spec. If the AI, constrained to the five slots and mapped back to source, produces a criterion the spec author disagrees with, the spec is underspecified. The contradiction is the artifact. I bring it back with one question: "which side is correct?" Nine times out of ten, the answer reveals a decision nobody had made yet.

The Meta Criterion

If the template works, it should work on itself:

Given a feature spec and the five-slot criteria-gen prompt
When an engineer runs the prompt and applies the review loop
Then every output criterion has a filled actor, preconditions, trigger, outcome, and side-effect
And every criterion maps to a specific clause in the source spec
And any criterion failing either check is rejected or sent back for rewrite

That is the bar. Anything below it is theater dressed up as process. Ship the template, ship the seeding, run the loop, and your QA team stops chasing hallucinations that were never in the spec to begin with.

Review drill

Use structured prompts to draft acceptance criteria, then review the output like any other spec artifact. The prompt should force context and evidence into the answer instead of asking for generic scenarios.

Keep the final criteria in the spec, not in the chat transcript. The prompt is a drafting aid; the reviewed criteria are the source of truth.

AI Review Packet to Copy

Use this before an AI-generated diff reaches code review. It turns the prompt, the allowed scope, and the required proof into one reviewable artifact.

AI coding review packet: Structured Prompts for Acceptance Criteria Generation

Decision to make:
- Prompt structures that generate usable acceptance criteria: the five-slot template, failure-mode seeding, and the review loop that stops hallucinated scenarios from reaching QA.

Owner check:
- Product owner:
- Engineering owner:
- QA or operations reviewer:

Scope boundary:
- In scope:
- Out of scope:
- Assumption that still needs approval:

Acceptance evidence:
- Test or fixture:
- Log, metric, or screenshot:
- Manual review step:

AI boundary: generated changes must stay inside the written scope and attach evidence for each acceptance criterion.

Reviewer prompt:
- What would still be ambiguous to someone who missed the planning meeting?
- What evidence would make this safe enough to ship?

Editorial Review Note

Reviewed Apr 28, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.

Keywords: acceptance criteria generation · LLM prompts · five-slot template · failure-mode seeding · Given-When-Then

Topic Path

This article belongs to the Acceptance Criteria track. Start with the hub, then use the checklist, template, or tool below on a real project.

Editorial Note

Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.