Spec-First Error Handling Patterns for APIs

Spec-First Error Handling Patterns for APIs
Spec Coding Editorial Team · Spec-first engineering notes

Error handling is the part of an API that most specs hand-wave. Everyone agrees it matters, nobody writes it down, and then integration week turns into a debate about whether a timeout should be a 500 or a 503 and whether clients should retry. The spec is supposed to answer those questions before anyone writes a handler.

Published on 2026-03-01 · Updated 2026-05-06 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

One Envelope, Every Endpoint

The first decision I fight for is a single error envelope across the whole API. Not a shape per endpoint, not a shape per team. One object, with fields that every client can rely on: a stable error code string, a human message, a retryable boolean, an optional field path for validation failures, and a request id. If one endpoint returns {"error": "..."} and another returns {"message": "..."} and a third returns {"errors": [...]}, every client builds three adapters and drops them into a switch statement.

Writing the envelope into the spec is cheap. Retrofitting it after six endpoints have shipped with different shapes is a quarter of work nobody wants to own.

The 4xx vs 5xx Rule I Actually Use

The litmus test: if the client sends the exact same bytes again and the call succeeds, it was a 5xx. If the client must change something to succeed, it was a 4xx. That is it. A transient database blip is not a 400. A malformed JSON body is not a 503. I put this sentence verbatim in every API spec I write, because the line gets blurry the moment an engineer is tired.

The interesting edge cases get explicit treatment. Conflict on a unique constraint is 409, not 500, because the client caused it. A downstream vendor timing out is 5xx, because retrying the same request can succeed. A rate limit is 429, because the client can succeed later with the same bytes, just not right now.

Validation: Pick 400 or 422 and Stop Arguing

I do not care which one you pick. I care that the spec picks. Teams lose weeks to 400-versus-422 debates that would be resolved in one line of a spec. My default is 422 for semantic validation failures (field is wrong, values conflict, business rule violated) and 400 for syntactic failures (malformed JSON, missing required field, wrong content type). But any consistent rule works. Inconsistency is the bug.

When you document it, write the rule next to the envelope definition so it is impossible to miss. Then every handler points at the same page during review.

Partial Success in Bulk Endpoints

Bulk endpoints are where specs quietly fall apart. You POST 10 items to /bulk-update, three fail validation, seven succeed. The server has three reasonable options, and the spec has to commit to one:

All three are defensible. Picking in the spec means clients build one code path. Not picking means every client builds all three, then files tickets when they guessed wrong. For a POST /items/bulk with three validation failures out of ten, my default response body looks like:

HTTP/1.1 200 OK
{
  "request_id": "req_7fK2...",
  "results": [
    {"index": 0, "status": "ok", "id": "itm_01"},
    {"index": 1, "status": "ok", "id": "itm_02"},
    {"index": 2, "status": "error", "error": {
      "code": "validation_failed",
      "message": "price must be non-negative",
      "field": "price",
      "retryable": false
    }},
    {"index": 3, "status": "ok", "id": "itm_04"},
    ...
  ],
  "summary": {"ok": 7, "error": 3}
}

Note the retryable: false on the per-item error. That flag is the second thing I always fight for.

Make Retryable Explicit

Every error in my envelope carries a retryable boolean. Not implied by status code, not inferred from the error message, not guessed from the phase of the moon. If the server says retryable: true, clients retry with backoff. If false, clients surface the error and stop. This one field saves every SDK author from writing a status-code-to-retry-policy table, and it lets the server change its mind later without breaking clients.

The retryable flag also forces a design conversation. If an endpoint returns retryable: true, the spec must also say whether it is idempotent. Retries without idempotency keys are how you charge a customer twice.

Idempotency and Rate Limits Go in the Spec

Any endpoint with side effects that advertises retryable failures needs an Idempotency-Key header contract. The spec states: the key is a client-generated string, honored for 24 hours, and a replay returns the original response with a X-Idempotent-Replay: true header. Write this once, and every POST, PATCH, and DELETE inherits the pattern.

Rate limits get the same treatment. 429 with Retry-After is table stakes. What the spec must also document is the dimension of the limit: per API key, per IP, per organization, per route, per minute, per hour. A client that knows it has 1000 requests per minute per key will build a token bucket. A client that only sees 429s will guess badly.

Auth, Downstream, and Webhook Signals

Three places where the spec must commit and usually does not:

Acceptance Criteria in Given/When/Then

The rules above are enforceable only if the spec writes them as acceptance criteria. Here is the block I use for the bulk endpoint above:

- Given a POST /items/bulk with 10 items
  And 3 items fail server-side validation
  When the request is processed
  Then the response status is 200
  And the body contains results[] with status "ok" or "error" per item
  And the summary totals match the results array
  And each error has code, message, field, and retryable=false

- Given a POST /items/bulk during a downstream outage
  When any item fails due to the outage
  Then the entire response status is 503
  And the envelope error.retryable is true
  And Retry-After is present

- Given a repeated POST with the same Idempotency-Key within 24h
  When the original request succeeded
  Then the response body matches the original
  And X-Idempotent-Replay is true

Three blocks, and the implementation, the tests, and the client SDK all have the same reference.

What I Refuse to Leave Out

When I review an API spec, I scan for six things and reject the draft if any are missing: the error envelope, the 4xx/5xx rule, the validation status choice, the partial-success model, the retryable flag, and the idempotency contract. Everything else can be clarified later. Those six can not, because they shape every client that will ever integrate. Writing them down takes an hour. Not writing them down costs the team a quarter.

Error Contract Checklist

Error handling becomes reliable when clients can branch on stable fields instead of parsing prose. Keep the contract small, explicit, and tested from both server and client sides.

Download: api-error-runbook.md

Evidence to Track

Track confusing client tickets, uncategorized 5xx responses, retry storms, missing correlation IDs, and SDK branches that cannot distinguish terminal from retryable failures. These are the signals that the error spec is either working or drifting.

Topic Path

This article belongs to the API Contracts track. Start with the hub, then use the checklist, template, or tool below on a real project.

Editorial Note

Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.