Queue Processing Spec: Exactly-Once vs At-Least-Once
When someone on my team says "exactly-once queue," I stop the conversation. The phrase is a marketing abstraction, not a delivery guarantee, and a spec written against it ships bugs that only surface at 3 a.m. This is how I force the spec to name the real guarantee — at-least-once with consumer-side deduplication — and pin down the idempotency, dead-letter, and replay rules that decide whether the consumer behaves.
Review Note
Reviewed May 3, 2026. This article is maintained as a focused companion to the API Contracts Hub. It now includes replay evidence, lag thresholds, and idempotency review gates for queue consumers.
The lie of "exactly-once" and what the spec should say instead
Brokers marketed as exactly-once (Kafka transactions, SQS FIFO, RabbitMQ with publisher confirms plus consumer acks) still deliver the same message twice in recoverable ways. A consumer crashes after side effects but before ack. A broker rebalances mid-batch. A network partition replays an in-flight commit. The "once" applies only to broker bookkeeping, not to observable effects downstream. I write every queue spec with one sentence near the top: delivery is at-least-once; exactly-once behavior is the consumer's responsibility via idempotency keys. That line ends the ambiguity and tells the consumer author what they owe the contract.
Which semantic is worth paying for
I pick a semantic per stream, not per system, because the cost curve is different for each payload type.
- Money moves (charge a card, transfer funds, issue a refund): spec at-least-once delivery with strict consumer idempotency keyed on the charge intent ID. Paying for broker-side exactly-once features is justified here because the downside of duplication is a chargeback.
- Email and notification sends: at-least-once is fine, with a dedup window of 24 hours keyed on (user_id, template_id, trigger_event_id). One duplicate welcome email is an annoyance; a missed one is a support ticket.
- Analytics and metrics: at-most-once is usually the right call. Dropping 0.01% of clickstream events is cheaper than paying for the durability and dedup machinery to guarantee every one. Spec the acceptable loss rate explicitly.
- Audit logs: at-least-once, no dedup, append-only sink. Duplicates are tolerable; gaps are a compliance problem.
Idempotent consumer design: dedup table, natural keys, state-machine guards
My default pattern is a dedup table keyed on a natural message identifier, written in the same database transaction as the side effect. For a payment queue, the idempotency key is the upstream payment intent ID, not the broker message ID — because the broker might republish the same intent under a new message ID after a producer retry. The consumer's first action is INSERT INTO processed_messages (key, processed_at) with a unique constraint; if that insert conflicts, the message is a duplicate and the consumer acks without touching the side effect.
Where a dedup table is too expensive, I use state-machine guards instead. An order transition from pending to shipped is a conditional update: UPDATE orders SET status='shipped' WHERE id=? AND status='pending'. Zero rows updated means the transition already happened and the consumer treats it as a successful no-op. It only works when the side effect maps cleanly to a state transition.
Ordering guarantees and what breaks them
Most brokers offer per-partition or per-key ordering, not global ordering. If the spec says "process events in order," I ask ordered by what key. For account events, the partition key is user_id; events for different users interleave, events for the same user are strictly ordered. This is the only ordering guarantee that survives scale.
What breaks ordering in practice: parallel consumer threads on the same partition, a retry that resubmits after later messages have been processed, and consumer-side batching that commits out of order. I call each of these out in the spec as a forbidden pattern.
Dead-letter rules: retry, DLQ, or drop
The spec has to answer three questions for every failure class: does the consumer retry, does it dead-letter, or does it drop. My rule of thumb:
- Transient infra failure (timeout, 503 from a dependency, connection reset): retry in place with exponential backoff starting at 1s, doubling to a 5-minute cap, for up to 6 attempts. Then DLQ.
- Poison message (malformed payload, schema violation, missing required field): do not retry. DLQ on first failure with the raw payload and the parse error attached as metadata.
- Business rejection (user deleted, account suspended, idempotency conflict): ack and drop with a structured log line. These are not failures; they are expected outcomes.
- Unknown exception: retry up to 3 times, then DLQ. Unknown means I could not classify the failure during spec review, which is itself a flag.
The DLQ is not a graveyard. The spec names an on-call rotation, an alert threshold (I usually start at 10 messages in 15 minutes), and an explicit owner of DLQ triage. A DLQ without triage ownership silently becomes the drop queue.
A payment queue worked example
Here is the contract I wrote for a recent payment-capture consumer. The producer is an order service; the consumer calls a payment gateway and writes the result to a ledger.
- Given an order-captured event with payment_intent_id=pi_abc
When the consumer receives it for the first time
Then it inserts pi_abc into processed_payments with status='in_flight',
calls the gateway, writes the capture result to the ledger,
updates processed_payments.status, and acks
- Given the same event redelivered after a consumer crash
When the consumer attempts to insert pi_abc
Then the unique constraint rejects the insert,
the consumer reads the existing row,
and if status='in_flight' it reconciles with the gateway before acking
- Given a gateway 503 response
When the consumer has retried fewer than 6 times
Then it nacks the message for redelivery with exponential backoff
- Given a gateway response of "card_declined"
When the consumer receives it
Then it writes a declined record to the ledger, acks the message,
and does not DLQ (declines are a business outcome, not a failure)
The reconcile-with-gateway step is the part teams skip. Without it, a consumer that crashed mid-capture can't tell whether the charge landed, and the safest default (retry) becomes the most dangerous (double-charge).
Replay and reprocess procedures
Replay is where spec quality gets tested. The spec has to answer: can an operator safely replay the last 24 hours without causing duplicate side effects? If the answer is no, the consumer is not actually idempotent and the spec is lying. I require a documented replay runbook with the exact command, the blast radius, and the expected no-op rate — typically 100%. Any non-zero side-effect rate during replay is a bug in the dedup layer.
Lag SLOs and alerting thresholds
A queue spec without a lag SLO is incomplete. I write two numbers: steady-state p95 consumer lag and the alert threshold. For a payment queue, p95 under 30 seconds with an alert at 5 minutes sustained for 2 minutes. For an email queue, p95 under 2 minutes with an alert at 15 minutes. The alert threshold is always at least 10x the SLO so routine jitter doesn't page the on-call. The spec also names what the alert means: a single lag alert is investigate, a second inside an hour is rollback the last deploy.
Consumer Review Checklist
The queue consumer is where the promise becomes real. I review it with this checklist, and I expect the spec to name the evidence for each line.
- Idempotency storage: the unique key, retention window, and behavior when the existing record is still in flight.
- Ack timing: exactly when the consumer acknowledges, and which side effects must complete first.
- Retry taxonomy: which failures are retryable, terminal, poison, or business outcomes.
- Replay proof: the command used for replay, the no-op rate expected, and who approves it in production.
- Backpressure rule: what pauses producers when lag crosses the threshold.
If the checklist cannot be answered from the spec, the implementation will invent the answers. That is exactly how duplicate emails, duplicate charges, and silent drops sneak into otherwise ordinary queue work.
Keep Reading
Editorial Note
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor