Designing Idempotent Workflows with Specs
Idempotency is one of those properties that looks like a one-line detail in the spec and then eats a week of debugging when it's wrong. This is how I write the idempotency section so retries, crashes, and double-clicks all behave the same way in production.
Field note: idempotency is a support policy too
Most teams write idempotency as an API rule and forget the support desk. The spec should say what a human can do while a request is pending, especially when money or inventory is involved.
Support constraint: Given refund R123 is pending provider confirmation When support opens the order admin page Then the refund button is disabled And the page shows the existing refund id And support can add a note but cannot create a second refund
What "idempotent" has to mean in your spec
The textbook definition is "safe to repeat." That's not enough for a spec — it leaves too many choices to the person writing the code. I force the spec to answer three questions before anything else:
- What counts as the same request? A client-supplied key? The hash of a normalized payload? The tuple of (user, resource, timestamp-bucket)?
- For how long is the result cached? 24 hours? Until the entity's state transitions? Forever?
- What does the second call see? The original response body byte-for-byte? A new response with the same effect? An error?
If those three aren't written down, two engineers will implement the same endpoint differently, and the client team will discover it during a PagerDuty alert.
The idempotency key contract
My default pattern is a client-supplied Idempotency-Key header, UUIDv4 recommended, scoped to the authenticated principal. The spec needs to nail down six things about this key:
- Required or optional. For payment-like operations I make it required and reject the request with 400 if absent. For idempotent-by-nature operations (PUT on a full resource), optional is fine.
- Max length and charset. 255 chars, ASCII printable. Reject anything else early.
- Uniqueness scope. Per API key? Per user? Global? I pick "per API credential" by default — it isolates tenants without forcing the client to coordinate.
- Retention window. I write down an exact number. "At least 24 hours" is the smallest I'd go; "7 days" is what I use for money movement.
- Conflict semantics. If the same key arrives with a different request body, return 422 with
conflict_with_stored_request. Never silently succeed. - Concurrent duplicate handling. If two requests with the same key arrive in parallel, one wins; the other gets 409 or blocks on a row lock and then returns the cached response.
State-machine transitions, not boolean flags
Most idempotency bugs I've debugged come from thinking of an operation as "done/not done" instead of as a state machine. A payment doesn't flip from false to true — it moves through pending → authorized → captured → settled, or from pending → failed. Each of those transitions needs to be idempotent individually.
Write the state machine in the spec. Name each state, name the allowed transitions, and for each transition name what a retry does:
pending → authorized : retry returns the same auth_id authorized → captured : retry checks existing capture_id, returns it authorized → voided : retry is a no-op if already voided pending → failed : retry returns the original failure_reason
Reviewers can then ask "what happens if a retry arrives while the state is authorized but the capture worker is mid-flight?" and the answer has to be in the diagram, not in the implementer's head.
Deduplication window vs. idempotency guarantee
These two often get conflated and it's worth being explicit in the spec. A deduplication window is a time-bounded cache: "we remember this key for 24 hours." An idempotency guarantee is a correctness property: "within the window, the operation happens exactly once."
Outside the window, the behavior changes. A fresh request with the same key becomes a new operation. The spec has to say what that means for downstream effects — will the user get charged twice if they retry after 25 hours? Sometimes that's acceptable (tokenized card save). Sometimes it's a compliance problem (payment capture). Name it explicitly.
The tricky cases I force the spec to cover
- Partial failure during side effects. The DB write succeeded, the webhook didn't fire. What does retry do? My default: idempotency layer returns the cached success, a separate outbox retries the webhook.
- Clock skew and out-of-order retries. Request B with key K arrives before Request A with the same K, because of network reordering. The spec must say which one "wins" (first to commit, not first to send).
- Request mutation by middleware. If a proxy adds headers or strips fields, the body hash changes. Spec whether the key is over the raw body, the canonical form, or just the key string itself.
- Expiration of a cached response. After the window closes, does the key become reusable immediately, or is there a grace period where it returns 410 Gone?
- Cross-region replication lag. If a retry lands in a region that hasn't seen the original yet, does it block, fail, or run the operation again? Name the answer.
Acceptance criteria that catch the real bugs
Generic "it's idempotent" acceptance criteria don't catch anything. These do:
- Given a POST /payments request with Idempotency-Key "abc" When the client retries with the same key and identical body Then the response is byte-for-byte identical to the original And no second charge appears in the payment processor - Given a POST /payments request with Idempotency-Key "abc" When the client retries with the same key but a different amount Then the response is 422 with code "idempotency_key_conflict" And the body contains the stored request summary - Given two concurrent POST /payments requests with the same key When both reach the server within 100ms Then exactly one charge is created And both requests receive the same response body
These translate directly into integration tests. If the implementer can't make these pass, the design is wrong — not the test.
Observability the spec needs to require
Idempotency bugs are invisible in normal logs. You charged the customer twice and the request logs show two successful 200s. The spec has to require telemetry that makes duplicates visible:
- A metric for "idempotent replay served from cache" vs. "new operation."
- A counter for idempotency-key conflicts (same key, different body).
- Structured logs that include the key on every side effect, so you can grep for double-writes.
- An alert on unusual replay rates per client — it usually means their retry logic is broken.
What I cut from the spec
I don't specify the storage backend. "Redis with a 24-hour TTL" is an implementation note, not a contract. The spec says "the system MUST return the original response for at least 24 hours" — whether that's Redis, a DB table, or a distributed cache is the team's choice and can change without a contract revision.
I also don't specify the hashing algorithm for body comparison unless the hash itself is exposed to the client. If it's purely internal, the team can pick SHA-256 or xxhash as they please.
One-sentence test for the section
Before I sign off on an idempotency section, I read it and ask: could a new engineer on another team implement this without asking me a single question? If the answer is no, the section isn't done yet.
Production Evidence to Require
Do not approve an idempotency design until the spec says what production evidence proves it is working. For a payment endpoint, I want a dashboard that shows replay count, conflict count, duplicate side-effect count, cache retention age, and downstream outbox retry depth. A release can pass tests and still be unsafe if nobody can see duplicate attempts after launch.
Idempotency review table to attach to the spec
I use this table when a workflow can be retried by a browser, job runner, webhook provider, support agent, or AI tool. If any row is blank, the implementation is not ready for review.
| Retry source | Same key behavior | New key behavior | Evidence |
|---|---|---|---|
| User double-clicks submit | Return original result and do not create a second record. | Reject if the first operation is still pending. | UI test plus audit log query. |
| Webhook provider replays event | Store event_id and skip already-applied side effects. | Treat as a new event only when provider id differs. | Replay test with duplicate delivery. |
| Support retries manually | Show existing operation state before allowing action. | Require manager override for irreversible actions. | Manual runbook screenshot and permission test. |
Topic Path
This article belongs to the API Contracts track. Start with the hub, then use the checklist, template, or tool below on a real project.
Keep Reading
Second-pass reviewer note: idempotency must cover retries and humans
This review adds a missing operational angle. Duplicate prevention is not complete until automated retries and manual actions follow the same state model.
Idempotency review: - What key defines sameness? - How long is the key retained? - What does a replay return? - What can support do while the original request is unresolved?
Editorial Note
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor