Spec-First Error Handling Patterns for APIs
Error handling is the part of an API that most specs hand-wave. Everyone agrees it matters, nobody writes it down, and then integration week turns into a debate about whether a timeout should be a 500 or a 503 and whether clients should retry. The spec is supposed to answer those questions before anyone writes a handler.
One Envelope, Every Endpoint
The first decision I fight for is a single error envelope across the whole API. Not a shape per endpoint, not a shape per team. One object, with fields that every client can rely on: a stable error code string, a human message, a retryable boolean, an optional field path for validation failures, and a request id. If one endpoint returns {"error": "..."} and another returns {"message": "..."} and a third returns {"errors": [...]}, every client builds three adapters and drops them into a switch statement.
Writing the envelope into the spec is cheap. Retrofitting it after six endpoints have shipped with different shapes is a quarter of work nobody wants to own.
The 4xx vs 5xx Rule I Actually Use
The litmus test: if the client sends the exact same bytes again and the call succeeds, it was a 5xx. If the client must change something to succeed, it was a 4xx. That is it. A transient database blip is not a 400. A malformed JSON body is not a 503. I put this sentence verbatim in every API spec I write, because the line gets blurry the moment an engineer is tired.
The interesting edge cases get explicit treatment. Conflict on a unique constraint is 409, not 500, because the client caused it. A downstream vendor timing out is 5xx, because retrying the same request can succeed. A rate limit is 429, because the client can succeed later with the same bytes, just not right now.
Validation: Pick 400 or 422 and Stop Arguing
I do not care which one you pick. I care that the spec picks. Teams lose weeks to 400-versus-422 debates that would be resolved in one line of a spec. My default is 422 for semantic validation failures (field is wrong, values conflict, business rule violated) and 400 for syntactic failures (malformed JSON, missing required field, wrong content type). But any consistent rule works. Inconsistency is the bug.
When you document it, write the rule next to the envelope definition so it is impossible to miss. Then every handler points at the same page during review.
Partial Success in Bulk Endpoints
Bulk endpoints are where specs quietly fall apart. You POST 10 items to /bulk-update, three fail validation, seven succeed. The server has three reasonable options, and the spec has to commit to one:
- 207 Multi-Status with a per-item status list. Honest but verbose.
- 200 OK with a results array containing per-item outcomes. Pragmatic, pairs well with the single envelope.
- Atomic 4xx. If any item fails, the whole batch rejects and nothing is persisted. Strongest integrity guarantee, worst ergonomics.
All three are defensible. Picking in the spec means clients build one code path. Not picking means every client builds all three, then files tickets when they guessed wrong. For a POST /items/bulk with three validation failures out of ten, my default response body looks like:
HTTP/1.1 200 OK
{
"request_id": "req_7fK2...",
"results": [
{"index": 0, "status": "ok", "id": "itm_01"},
{"index": 1, "status": "ok", "id": "itm_02"},
{"index": 2, "status": "error", "error": {
"code": "validation_failed",
"message": "price must be non-negative",
"field": "price",
"retryable": false
}},
{"index": 3, "status": "ok", "id": "itm_04"},
...
],
"summary": {"ok": 7, "error": 3}
}
Note the retryable: false on the per-item error. That flag is the second thing I always fight for.
Make Retryable Explicit
Every error in my envelope carries a retryable boolean. Not implied by status code, not inferred from the error message, not guessed from the phase of the moon. If the server says retryable: true, clients retry with backoff. If false, clients surface the error and stop. This one field saves every SDK author from writing a status-code-to-retry-policy table, and it lets the server change its mind later without breaking clients.
The retryable flag also forces a design conversation. If an endpoint returns retryable: true, the spec must also say whether it is idempotent. Retries without idempotency keys are how you charge a customer twice.
Idempotency and Rate Limits Go in the Spec
Any endpoint with side effects that advertises retryable failures needs an Idempotency-Key header contract. The spec states: the key is a client-generated string, honored for 24 hours, and a replay returns the original response with a X-Idempotent-Replay: true header. Write this once, and every POST, PATCH, and DELETE inherits the pattern.
Rate limits get the same treatment. 429 with Retry-After is table stakes. What the spec must also document is the dimension of the limit: per API key, per IP, per organization, per route, per minute, per hour. A client that knows it has 1000 requests per minute per key will build a token bucket. A client that only sees 429s will guess badly.
Auth, Downstream, and Webhook Signals
Three places where the spec must commit and usually does not:
- 401 vs 403: 401 means the caller has not authenticated (missing or invalid token). 403 means the caller authenticated but is not allowed. Getting this wrong trains clients to retry auth flows on permission errors, which is a support nightmare.
- Downstream failures: a payment processor timeout is 503, a payment processor rejecting a card is a 4xx mapped to a domain error code. Never leak the vendor name in the code (
stripe_timeoutbecomespayment_provider_unavailable). The spec should list every downstream dependency and its failure mapping. - Webhook acknowledgements: receivers return 2xx to ack, 4xx to dead-letter, 5xx to retry. The spec names the exact codes and the retry schedule (e.g. exponential backoff over 24 hours, then dead-letter). Receivers that return 200 on failure because "the framework did it" are why webhooks silently break.
Acceptance Criteria in Given/When/Then
The rules above are enforceable only if the spec writes them as acceptance criteria. Here is the block I use for the bulk endpoint above:
- Given a POST /items/bulk with 10 items And 3 items fail server-side validation When the request is processed Then the response status is 200 And the body contains results[] with status "ok" or "error" per item And the summary totals match the results array And each error has code, message, field, and retryable=false - Given a POST /items/bulk during a downstream outage When any item fails due to the outage Then the entire response status is 503 And the envelope error.retryable is true And Retry-After is present - Given a repeated POST with the same Idempotency-Key within 24h When the original request succeeded Then the response body matches the original And X-Idempotent-Replay is true
Three blocks, and the implementation, the tests, and the client SDK all have the same reference.
What I Refuse to Leave Out
When I review an API spec, I scan for six things and reject the draft if any are missing: the error envelope, the 4xx/5xx rule, the validation status choice, the partial-success model, the retryable flag, and the idempotency contract. Everything else can be clarified later. Those six can not, because they shape every client that will ever integrate. Writing them down takes an hour. Not writing them down costs the team a quarter.
Error Contract Checklist
Error handling becomes reliable when clients can branch on stable fields instead of parsing prose. Keep the contract small, explicit, and tested from both server and client sides.
Download: api-error-runbook.md
- Define one response envelope with stable fields for code, category, retryability, and correlation ID.
- Separate validation failures, authorization failures, rate limits, dependency failures, and idempotent replays.
- Specify retry behavior for each retryable class, including Retry-After when the server can estimate it.
- Add client examples for at least one terminal error and one retryable error.
- Test idempotency keys, partial failure behavior, and repeated requests before publishing the contract.
Evidence to Track
Track confusing client tickets, uncategorized 5xx responses, retry storms, missing correlation IDs, and SDK branches that cannot distinguish terminal from retryable failures. These are the signals that the error spec is either working or drifting.
Topic Path
This article belongs to the API Contracts track. Start with the hub, then use the checklist, template, or tool below on a real project.
Keep Reading
Editorial Note
Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor