Webhook Consumer Spec: Signature, Retry, and Order
Most webhook consumer specs I review cover the endpoint and almost nothing else: the URL, the JSON shape, and a vague sentence about "verifying the signature." That is not a spec. A consumer spec is a receiver-side contract about what the sender can do under failure, and what your handler must guarantee when the network misbehaves. This is how I write one.
Assume the sender is hostile, flaky, and occasionally duplicates itself
My default mental model is that the sender is a reasonable system on a bad day. It will retry a request I already processed. It will deliver event 42 before event 41 because two workers dequeued out of order. And somewhere on the public internet, a bored attacker will try to replay yesterday's payment.succeeded event.
If the spec does not force the team to decide how each case is handled, the team will decide it accidentally, in a panicky hotfix after a finance report disagrees with the ledger. I reject the spec if it does not answer: who is the signing authority, how long is a signature valid, what happens on a duplicate event ID, and what happens when events arrive out of order.
Signature verification is the first gate, not a checkbox
I write signature verification as an explicit algorithm. For a Stripe-style webhook the spec reads: the sender concatenates timestamp + "." + raw_body, computes HMAC-SHA256 with the current signing secret, and sends it in the Stripe-Signature header. The receiver must read the raw body before any JSON parsing, recompute the HMAC with every active secret, and use a constant-time compare. If every candidate fails, return 400.
Three things junior specs skip: the raw body requirement (a JSON reserializer breaks signatures), timing-safe comparison (regular == leaks secret bits), and secret rotation. My rule is two active secrets at all times, 30-day rotation, receiver MUST accept either during overlap. Without that, rotation is a scheduled outage.
The five-minute rule: replay windows belong in the spec
A valid signature alone does not mean the request is fresh. The spec must define a replay window. My default is 5 minutes: if abs(now - signed_timestamp) > 300 seconds, reject with 400 even if the HMAC matches. Shorter windows break when clocks skew on the sender side; longer windows widen the replay surface. If the business requires exactly-once across the window, add a nonce table keyed by signature or event ID and reject duplicates for the retention period. I usually spec a 24-hour nonce retention on top of the 5-minute timestamp window.
Idempotency is a handler property, not a hope
Every spec I write includes a non-negotiable line: the handler MUST be idempotent with respect to the sender's event ID. The sender will retry; the spec's job is to make the second delivery a no-op.
Concrete design: a webhook_events table with (source, event_id) as the primary key. The handler opens a transaction, inserts the event row (conflict = already seen), performs the business effect, and commits. If the insert conflicts, return 200 without re-running the effect.
For effects that cannot be rolled back with the database (charging a card, sending an email), add a downstream state check. Before creating a Stripe refund in response to charge.dispute.created, the handler queries the refund API for an idempotency key derived from the event ID. If no refund exists, create one. If one exists, log and return 200.
Retry semantics belong to the sender; tolerance belongs to you
The spec must write down the sender's retry schedule. GitHub retries with exponential backoff over roughly 8 hours. Stripe retries up to 3 days. For an internal bus I spec: immediate, 30s, 2m, 10m, 1h, 6h, 24h, dead-letter. What the receiver must tolerate falls out of that schedule: if retries span 24 hours, the dedup table keeps rows for at least 30 days, and the handler must be cheap enough that a duplicate delivery costs no more than the original.
Status codes are a contract with the sender's retry loop
This is the part nobody writes down, and it causes more outages than signature bugs. My spec always includes this table:
- 200 / 204 — Processed. Do not retry.
- 200 with
{"ignored": true}— Received but not acted on (unknown event type, filtered customer, stale version). Do not retry. - 400 — Signature invalid, timestamp outside window, or malformed body. Permanent. Do not retry.
- 422 — Semantic rejection (references a deleted resource). Permanent. Do not retry.
- 500 / 502 / 503 — Handler blew up or a downstream is down. Retry per the sender's schedule.
- 429 — Overloaded. Retry with backoff, honor
Retry-Afterif set.
The one I see misused most: returning 500 when the signature fails. That pins the sender in a retry loop against a request that will never succeed. Signature failure is 400.
Ordering: assume none, carry a version field
I spec every consumer as order-agnostic unless the sender makes a hard guarantee, and almost no sender does. The mitigation is a monotonic version on the underlying resource. When subscription.updated arrives, the handler compares the event's data.version (or updated_at) against the last version persisted. Older, return 200 and ignore. Newer, apply it. Equal, duplicate.
Without that check, an out-of-order pair like "set plan to pro" followed by "set plan to free" can land in reverse order and silently downgrade a paying customer. The version gate makes that impossible by construction.
Acceptance criteria in Given/When/Then
- Given a POST with a valid HMAC-SHA256 signature and a timestamp within 300 seconds
When the event_id has not been seen before
Then the handler inserts (source, event_id) into webhook_events and returns 200
- Given a POST with a valid signature and a known event_id
When the handler runs
Then it performs no downstream writes and returns 200 within 50 ms
- Given a POST whose signed timestamp is more than 300 seconds old
When the handler runs
Then it returns 400 with body {"error": "timestamp_outside_window"}
- Given a subscription.updated event whose data.version is less than persisted
When the handler runs
Then it returns 200 with body {"ignored": "stale_version"} and makes no state change
- Given the downstream database is unreachable
When the handler runs
Then it returns 503 and the sender retries per its schedule
Dead-letter, logging, and the one metric I actually watch
The receiver needs its own dead-letter, independent of the sender's. When the handler returns 5xx three times for the same event ID, I move it to a webhook_dlq table with the raw body, headers, and last error. The sender will eventually stop retrying, and I do not want to lose the payload when it does.
Every log line carries event_id, source, and event_type. The single metric I put on the dashboard is handler completion latency measured from the signed event timestamp, not from HTTP receipt. That captures sender queue lag, network transit, retry delay, and my own processing time in one line. If it crosses a few minutes at p99, something upstream is on fire and I want to know before a customer tells me.
Contract Review Packet to Copy
Use this when the work touches API behavior, schema, events, retries, or consumer expectations. The packet makes compatibility and release evidence explicit.
API contract review packet: Webhook Consumer Spec: Signature, Retry, and Order Decision to make: - How to write a webhook consumer spec: signature verification, replay protection, retry and backoff rules, ordering assumptions, and idempotent handler design. Owner check: - Product owner: - Engineering owner: - QA or operations reviewer: Scope boundary: - In scope: - Out of scope: - Assumption that still needs approval: Acceptance evidence: - Test or fixture: - Log, metric, or screenshot: - Manual review step: Contract boundary: no release without compatibility classification, consumer impact, retry behavior, and rollback notes. Reviewer prompt: - What would still be ambiguous to someone who missed the planning meeting? - What evidence would make this safe enough to ship?
Editorial Review Note
Reviewed Apr 28, 2026. This update added a reusable artifact, checked the article against the related topic hub, and tightened the next-step links so the page works as a practical reference rather than a standalone essay.
Topic Path
This article belongs to the API Contracts track. Start with the hub, then use the checklist, template, or tool below on a real project.
Keep Reading
Editorial Note
Last reviewed Apr 28, 2026: examples, internal links, and reusable review blocks were checked for practical specificity.
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor