Data Sync Spec Between Services
Every cross-service sync spec I've seen fail had the same shape: it documented the happy path and called itself done. The failures come from the parts the author didn't think to write down — ordering, late arrivals, partial failures, and what happens on day two when you need to backfill.
Review Note
Reviewed May 3, 2026. This article is maintained as a focused companion to the API Contracts Hub. It has been expanded with review drills, acceptance criteria, and operator evidence for teams designing cross-service sync.
The first decision: push, pull, or log
The spec has to pick a sync model before anything else, because almost every other answer depends on it. I force authors to commit to one of three:
- Event push. Source emits events; target subscribes. Good when the target needs freshness and the source doesn't want to know who's listening.
- Target pull. Target polls an endpoint or queries a change feed. Good when the target controls its own load and can tolerate staleness.
- Shared log. Source writes to Kafka/Kinesis/etc., multiple targets replay independently. Good when you have three or more consumers or need replay.
Mixing these accidentally is where grief starts. "We emit events AND expose a GET endpoint AND have a CDC stream" often means three partial implementations that agree on nothing. Pick one primary path in the spec and label any secondary path as fallback-only.
Ordering: the part that everyone skips
"Events arrive in order" is a promise almost nobody actually keeps. The spec needs to be specific about the ordering guarantee:
- Per-key ordering. All events for the same entity ID arrive in commit order. This is what you usually want and what partitioned logs give you.
- Global ordering. Very expensive. Only promise it if you truly need it.
- No ordering. Targets must be able to reconcile out-of-order events, usually via a monotonic
versionorupdated_at.
Whatever the spec picks, it must also say what the target does when it receives an event with a version older than what it has. My default answer: log and drop. The spec should make that explicit so reviewers don't assume it "overwrites with the latest" (a bug factory).
Event payload: thin vs. fat
Two valid patterns. The spec needs to pick one and say why:
- Thin events. Event says "entity X changed," target fetches via API. Cheaper queue, always-fresh reads, but requires the source API to be available for the sync to work.
- Fat events. Event includes the full entity snapshot. Target can apply it directly. Sync survives source downtime, but you now ship PII through the event bus and must version the payload.
I lean toward fat events for anything with strict latency SLAs or where the source is a legacy system that falls over. I lean toward thin events when PII or data minimization matters. Whichever you pick, write the rationale in the spec — the next person to change this will appreciate knowing why.
Conflict resolution when both sides can write
If the target is read-only, skip this section. If both sides can write the same field, the spec has to answer: who wins?
- Last-writer-wins with a timestamp. Simple, but breaks on clock skew.
- Source of truth. One side is authoritative; the other is a cache that gets overwritten. Write it down explicitly.
- CRDTs or merge functions. If you actually need this, you already know. Otherwise don't.
- Reject and escalate. Conflicts surface to an operator queue. Slow but safe for money or compliance data.
The test for the section: ask the reviewer to describe what happens when Alice updates the customer's email in Service A at 10:00:00 and Bob updates it in Service B at 10:00:01 and the events cross in flight. If you can't answer from the spec, it isn't done.
The backfill plan is part of the contract
Day one goes live. Day seven, someone notices that 50,000 records from before the sync started are missing in the target. The spec should have already answered: how do we catch up?
- Does the source expose a historical change feed? From when?
- Is there a bulk export endpoint with pagination and cursor semantics?
- How does the target distinguish a backfill event from a live event? (Usually an
origin: backfillflag so telemetry doesn't fire double-write alerts.) - What's the rate limit for backfill so it doesn't starve live traffic?
- How do we verify the backfill succeeded — row count, checksum sample, full reconciliation?
Reconciliation: the unsexy critical section
Every long-running sync drifts. Always. The spec must define a reconciliation job:
- Frequency. Nightly? Hourly? On demand?
- Scope. Full table comparison, or a sample?
- Action on mismatch. Auto-repair by re-syncing the row, or alert and require human decision?
- Success metric. "Less than 0.01% of rows differ" — pick an actual number.
Acceptance criteria that catch real failures
- Given a source emits events for entity X When the target is offline for 30 minutes Then on recovery the target catches up within 5 minutes And no events in that window are permanently lost - Given two events for the same entity arrive out of order When the older event is processed after the newer one Then the target state reflects the newer event And the older event is logged as stale - Given the sync has been running for 24 hours When the reconciliation job runs Then fewer than 0.01% of rows show a diff And all diffs are auto-repaired or surfaced as alerts
The signal I look for in review
The quality signal I use: does the spec describe what the operator sees on day 30 when something is off? If it only describes the happy path on day one, it isn't a sync spec — it's a handoff note that will turn into a 3am page.
Review drill
Review a sync spec by following one record from the source service to every consumer. The weak spots are usually ownership, retries, and what happens when two systems temporarily disagree.
- Source of truth: name the system that wins when values conflict, including any fields that are intentionally derived or cached.
- Recovery: define retry limits, dead-letter handling, reconciliation jobs, and the alert that proves the sync is stuck.
- Consumer impact: show what each downstream service sees during delay, duplicate delivery, deletion, and schema change.
Put the sync contract, replay procedure, and reconciliation owner in the spec. Without those, the first incident becomes the real design document.
Example: for an account email update, the spec should say whether CRM, billing, or identity owns the value, how duplicates are ignored, and which reconciliation job fixes a missed event.
Worked Review Example
For a customer status sync, write the whole lifecycle. The billing system emits status.changed with an idempotency key, CRM stores the latest sequence number, and support tools show a stale-data badge until reconciliation completes. Deletes need the same treatment: soft delete, tombstone event, retention window, and the job that removes orphaned records after consumers acknowledge the change.
Copy This Sync Contract Block
When a spec feels abstract, I ask the author to fill in this block. It forces the decisions that usually stay hidden until the first reconciliation failure.
Sync contract Source of truth: - System: - Fields owned: - Fields derived by consumers: Delivery model: - Primary path: event push / target pull / shared log - Ordering guarantee: - Duplicate handling: - Stale event behavior: Recovery: - Backfill entrypoint: - Replay owner: - Reconciliation frequency: - Mismatch threshold: - Operator alert: Consumer impact: - What users see during delay: - What support sees during drift: - What pauses the rollout:
The block is short enough to fit in a pull request description, but it changes the review. Instead of debating architecture labels, reviewers can point at a blank line and ask for the missing behavior.
One practical review move: run the block against a deleted record, not only an updated record. Deletes expose the weakest assumptions in sync systems. If the source emits a tombstone, the target needs to know whether to hide the record, mark it archived, keep it for compliance, or purge it after a retention window. If the spec cannot answer that, the sync is not finished.
Keep Reading
Editorial Note
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor