Event-Driven Systems: Specification Patterns
Most of the event-driven systems I have reviewed did not fail because the broker was slow or the handlers were buggy. They failed because the spec treated every message as if it were the same kind of thing, and left consumers to guess whether a message was a request, an announcement, or a nudge. The patterns below are the ones I now insist on before any producer is allowed to publish to a shared topic.
Review Note
Reviewed May 6, 2026. This focused reference is now promoted as a search-indexable companion to the API Contracts Hub. It includes concrete review artifacts, failure modes, and next-step links for readers applying the topic in practice.
Name Command Events and Fact Events Differently, Always
The single biggest source of integration pain I see is specs that mash commands and facts into one stream. A command event says "please do this" and expects exactly one owner to act. A fact event says "this already happened" and assumes zero or more subscribers will react. Different ownership, different retry semantics, different failure modes.
My specs require command events to use imperative verbs (ShipOrder, ChargePayment) and fact events to use past-tense verbs (OrderShipped, PaymentCharged). If a proposed event cannot cleanly pick a side, the boundary is wrong, not that we need a third category.
Pick Choreography or Orchestration Per Flow, Not Per System
Teams argue choreography versus orchestration as if it were a platform-wide religion. I pick per flow. A flow with branching logic, compensations, and one owner belongs in an orchestrator. A flow where many independent services react to a business fact belongs in choreography.
The spec must say which one and why. For orchestration, name the orchestrator, the state machine, and the per-step timeout. For choreography, name the fact event, the subscribers, and each subscriber's independent deadline. I do not accept "we will emit an event and see what happens" as a design.
Schema Versioning: Additive by Default, Breaking by Policy
Every event schema lives in a registry with a version in the envelope. The policy: additive changes (new optional fields, new enum values behind a flag) ship without a version bump. Breaking changes (removed fields, renamed fields, tightened required constraints) get a new major version and run parallel with the old for a named deprecation window, usually two release cycles.
The spec states the deprecation window, the metric that proves no consumer is still on the old version, and the date the old topic is deleted. Without that, "v2" becomes a permanent parallel universe and the registry fills with zombies nobody dares remove.
State the Eventual-Consistency Boundary Out Loud
Every event-driven spec I write has a section titled "Consistency Boundary" that answers three questions in plain English. Where does the transactional boundary end? How stale can a downstream read be before a user notices? What does the UI show during the gap?
"The order service commits the row and publishes OrderPlaced in the same transaction using an outbox. The inventory projection lags by up to 2 seconds at p99. During that window the order detail page shows a 'reserving stock' placeholder, not a spinner, and not a missing row." That sentence has saved me more arguments than any architecture diagram.
Correlation and Causation IDs, and a Worked Example
Every envelope carries three IDs: event_id (unique per event), correlation_id (shared across the business transaction, stamped by the first producer), and causation_id (the event_id that triggered this one). Handlers propagate correlation_id unchanged and set causation_id to the event they reacted to. These belong in the envelope, not the payload, so tracing survives schema evolution. With all three, a dead-letter investigation is one query.
Here is the flow I use as the concrete example in nearly every spec review. A customer submits an order. The order service writes the row and publishes OrderPlaced (fact) via an outbox. Three services subscribe independently: payments, inventory, and notifications.
Payments charges the card and publishes PaymentCharged or PaymentFailed. Inventory reserves stock and publishes StockReserved or StockUnavailable. Notifications sends a confirmation email after it sees both PaymentCharged and StockReserved with the same correlation_id.
If PaymentFailed arrives, inventory publishes StockReleased as a compensating action, and the order service transitions the order to payment_failed. No distributed transaction, no two-phase commit, just fact events and compensations. The spec names every topic, every handler, every compensation, and the timeout at which the order service marks the order stuck for manual review.
Acceptance Criteria in Given/When/Then
I write acceptance criteria for each handler in Given/When/Then form so QA and SRE read the same source of truth.
- Given OrderPlaced with correlation_id C1 has been published When the inventory handler receives it for the first time Then StockReserved is published with causation_id = event_id of OrderPlaced And the inventory row for the SKU is decremented exactly once - Given the inventory handler has already processed OrderPlaced with event_id E1 When the same event is redelivered (at-least-once retry) Then the handler detects the duplicate via the processed_events table And no additional StockReserved is published And the inventory row is unchanged - Given PaymentFailed arrives with correlation_id C1 after StockReserved C1 When the inventory handler processes PaymentFailed Then StockReleased is published within 5 seconds And the SKU count returns to its pre-reservation value
The idempotency row is non-negotiable. At-least-once delivery is the default in every real broker I have used, and "we will make the handler idempotent later" is how duplicate charges reach production.
Dead Letters, Poison Events, and the Replay Policy
A dead-letter queue is a policy question, not a feature. How many retries before a message goes to DLQ? What backoff? Who gets paged when DLQ depth crosses a threshold? What tool lets an operator inspect, fix, and replay a poisoned event?
My default: three retries with exponential backoff, then DLQ. DLQ depth greater than zero pages the owning team within fifteen minutes. Replay tooling requires a written reason, writes an audit log entry, and refuses to replay events older than the schema deprecation window. Without these rules written down, DLQs become graveyards.
Event Sourcing Is Not Event-Driven, and the Spec Differs
Specs conflate these constantly. Event-driven means services communicate via events. Event sourcing means a service stores its state as a log of events and rebuilds it by replaying them. If a spec says "event sourcing" I expect answers on snapshotting cadence, event upcasters for old versions, and rebuild-from-zero time budget. If it says "event-driven" I expect the command/fact split, consistency boundary, and idempotency story. Mixing the vocabulary hides which concerns were actually thought through.
Observability Metrics That Prove the Flow Is Healthy
Three metrics I require in every event-driven spec: per-event-type consumer lag, per-handler success rate (excluding idempotent duplicates), and end-to-end business latency from first producer publish to last subscriber commit per correlation_id.
Broker throughput and CPU are fine for capacity planning but say nothing about whether the business flow works. End-to-end latency per correlation_id does. When it drifts, a real user is waiting longer than the spec promised, even if every service is green.
Producer Readiness Gate
Before a new producer publishes to a shared topic, I want one page of evidence. The page is boring, which is the point. It keeps event-driven work from becoming a pile of undocumented side effects.
Producer readiness Topic: Event kind: command / fact Owner: Schema registry link: Breaking-change policy: Consumer list: Replay allowed: yes / no Replay owner: DLQ owner: Correlation fields: Idempotency key: Deprecation date for old schema: Dashboards: - consumer lag - handler success rate - business latency by correlation_id
If the producer cannot name its consumers, it is not ready. If replay ownership is vague, it is not ready. If the old schema has no retirement date, the new schema is probably permanent debt. A spec should make those facts visible before the topic exists.
I also ask for one failure trace before launch. Pick a real event, force one consumer to fail, and show the correlation_id through publish, retry, DLQ, operator inspection, and replay. If that trace cannot be followed in logs, the event design is not observable enough for production.
Topic Path
Keep Reading
Editorial Note
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor