Designing Idempotent Workflows with Specs

Designing Idempotent Workflows with Specs
Spec Coding Editorial Team · Spec-first engineering notes

Idempotency is one of those properties that looks like a one-line detail in the spec and then eats a week of debugging when it's wrong. This is how I write the idempotency section so retries, crashes, and double-clicks all behave the same way in production.

Published on 2026-03-01 · Updated 2026-05-11 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Field note: idempotency is a support policy too

Most teams write idempotency as an API rule and forget the support desk. The spec should say what a human can do while a request is pending, especially when money or inventory is involved.

Support constraint:
Given refund R123 is pending provider confirmation
When support opens the order admin page
Then the refund button is disabled
And the page shows the existing refund id
And support can add a note but cannot create a second refund

What "idempotent" has to mean in your spec

The textbook definition is "safe to repeat." That's not enough for a spec — it leaves too many choices to the person writing the code. I force the spec to answer three questions before anything else:

If those three aren't written down, two engineers will implement the same endpoint differently, and the client team will discover it during a PagerDuty alert.

The idempotency key contract

My default pattern is a client-supplied Idempotency-Key header, UUIDv4 recommended, scoped to the authenticated principal. The spec needs to nail down six things about this key:

State-machine transitions, not boolean flags

Most idempotency bugs I've debugged come from thinking of an operation as "done/not done" instead of as a state machine. A payment doesn't flip from false to true — it moves through pending → authorized → captured → settled, or from pending → failed. Each of those transitions needs to be idempotent individually.

Write the state machine in the spec. Name each state, name the allowed transitions, and for each transition name what a retry does:

pending → authorized  : retry returns the same auth_id
authorized → captured : retry checks existing capture_id, returns it
authorized → voided   : retry is a no-op if already voided
pending → failed      : retry returns the original failure_reason

Reviewers can then ask "what happens if a retry arrives while the state is authorized but the capture worker is mid-flight?" and the answer has to be in the diagram, not in the implementer's head.

Deduplication window vs. idempotency guarantee

These two often get conflated and it's worth being explicit in the spec. A deduplication window is a time-bounded cache: "we remember this key for 24 hours." An idempotency guarantee is a correctness property: "within the window, the operation happens exactly once."

Outside the window, the behavior changes. A fresh request with the same key becomes a new operation. The spec has to say what that means for downstream effects — will the user get charged twice if they retry after 25 hours? Sometimes that's acceptable (tokenized card save). Sometimes it's a compliance problem (payment capture). Name it explicitly.

The tricky cases I force the spec to cover

Acceptance criteria that catch the real bugs

Generic "it's idempotent" acceptance criteria don't catch anything. These do:

- Given a POST /payments request with Idempotency-Key "abc"
  When the client retries with the same key and identical body
  Then the response is byte-for-byte identical to the original
  And no second charge appears in the payment processor

- Given a POST /payments request with Idempotency-Key "abc"
  When the client retries with the same key but a different amount
  Then the response is 422 with code "idempotency_key_conflict"
  And the body contains the stored request summary

- Given two concurrent POST /payments requests with the same key
  When both reach the server within 100ms
  Then exactly one charge is created
  And both requests receive the same response body

These translate directly into integration tests. If the implementer can't make these pass, the design is wrong — not the test.

Observability the spec needs to require

Idempotency bugs are invisible in normal logs. You charged the customer twice and the request logs show two successful 200s. The spec has to require telemetry that makes duplicates visible:

What I cut from the spec

I don't specify the storage backend. "Redis with a 24-hour TTL" is an implementation note, not a contract. The spec says "the system MUST return the original response for at least 24 hours" — whether that's Redis, a DB table, or a distributed cache is the team's choice and can change without a contract revision.

I also don't specify the hashing algorithm for body comparison unless the hash itself is exposed to the client. If it's purely internal, the team can pick SHA-256 or xxhash as they please.

One-sentence test for the section

Before I sign off on an idempotency section, I read it and ask: could a new engineer on another team implement this without asking me a single question? If the answer is no, the section isn't done yet.

Production Evidence to Require

Do not approve an idempotency design until the spec says what production evidence proves it is working. For a payment endpoint, I want a dashboard that shows replay count, conflict count, duplicate side-effect count, cache retention age, and downstream outbox retry depth. A release can pass tests and still be unsafe if nobody can see duplicate attempts after launch.

Idempotency review table to attach to the spec

I use this table when a workflow can be retried by a browser, job runner, webhook provider, support agent, or AI tool. If any row is blank, the implementation is not ready for review.

Retry sourceSame key behaviorNew key behaviorEvidence
User double-clicks submitReturn original result and do not create a second record.Reject if the first operation is still pending.UI test plus audit log query.
Webhook provider replays eventStore event_id and skip already-applied side effects.Treat as a new event only when provider id differs.Replay test with duplicate delivery.
Support retries manuallyShow existing operation state before allowing action.Require manager override for irreversible actions.Manual runbook screenshot and permission test.
Keywords: idempotency key · deduplication window · state machine spec · retry semantics · API contract

Topic Path

This article belongs to the API Contracts track. Start with the hub, then use the checklist, template, or tool below on a real project.

Second-pass reviewer note: idempotency must cover retries and humans

This review adds a missing operational angle. Duplicate prevention is not complete until automated retries and manual actions follow the same state model.

Idempotency review:
- What key defines sameness?
- How long is the key retained?
- What does a replay return?
- What can support do while the original request is unresolved?

Editorial Note