Data Sync Spec Between Services

Every cross-service sync spec I've seen fail had the same shape: it documented the happy path and called itself done. The failures come from the parts the author didn't think to write down — ordering, late arrivals, partial failures, and what happens on day two when you need to backfill.

API ContractsCase Studies

Published on 2026-03-01 · 6 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Review Note

Reviewed May 3, 2026. This article is maintained as a focused companion to the API Contracts Hub. It has been expanded with review drills, acceptance criteria, and operator evidence for teams designing cross-service sync.

The first decision: push, pull, or log

The spec has to pick a sync model before anything else, because almost every other answer depends on it. I force authors to commit to one of three:

Event push. Source emits events; target subscribes. Good when the target needs freshness and the source doesn't want to know who's listening.
Target pull. Target polls an endpoint or queries a change feed. Good when the target controls its own load and can tolerate staleness.
Shared log. Source writes to Kafka/Kinesis/etc., multiple targets replay independently. Good when you have three or more consumers or need replay.

Mixing these accidentally is where grief starts. "We emit events AND expose a GET endpoint AND have a CDC stream" often means three partial implementations that agree on nothing. Pick one primary path in the spec and label any secondary path as fallback-only.

Ordering: the part that everyone skips

"Events arrive in order" is a promise almost nobody actually keeps. The spec needs to be specific about the ordering guarantee:

Per-key ordering. All events for the same entity ID arrive in commit order. This is what you usually want and what partitioned logs give you.
Global ordering. Very expensive. Only promise it if you truly need it.
No ordering. Targets must be able to reconcile out-of-order events, usually via a monotonic version or updated_at.

Whatever the spec picks, it must also say what the target does when it receives an event with a version older than what it has. My default answer: log and drop. The spec should make that explicit so reviewers don't assume it "overwrites with the latest" (a bug factory).

Event payload: thin vs. fat

Two valid patterns. The spec needs to pick one and say why:

Thin events. Event says "entity X changed," target fetches via API. Cheaper queue, always-fresh reads, but requires the source API to be available for the sync to work.
Fat events. Event includes the full entity snapshot. Target can apply it directly. Sync survives source downtime, but you now ship PII through the event bus and must version the payload.

I lean toward fat events for anything with strict latency SLAs or where the source is a legacy system that falls over. I lean toward thin events when PII or data minimization matters. Whichever you pick, write the rationale in the spec — the next person to change this will appreciate knowing why.

Conflict resolution when both sides can write

If the target is read-only, skip this section. If both sides can write the same field, the spec has to answer: who wins?

Last-writer-wins with a timestamp. Simple, but breaks on clock skew.
Source of truth. One side is authoritative; the other is a cache that gets overwritten. Write it down explicitly.
CRDTs or merge functions. If you actually need this, you already know. Otherwise don't.
Reject and escalate. Conflicts surface to an operator queue. Slow but safe for money or compliance data.

The test for the section: ask the reviewer to describe what happens when Alice updates the customer's email in Service A at 10:00:00 and Bob updates it in Service B at 10:00:01 and the events cross in flight. If you can't answer from the spec, it isn't done.

The backfill plan is part of the contract

Day one goes live. Day seven, someone notices that 50,000 records from before the sync started are missing in the target. The spec should have already answered: how do we catch up?

Does the source expose a historical change feed? From when?
Is there a bulk export endpoint with pagination and cursor semantics?
How does the target distinguish a backfill event from a live event? (Usually an origin: backfill flag so telemetry doesn't fire double-write alerts.)
What's the rate limit for backfill so it doesn't starve live traffic?
How do we verify the backfill succeeded — row count, checksum sample, full reconciliation?

Reconciliation: the unsexy critical section

Every long-running sync drifts. Always. The spec must define a reconciliation job:

Frequency. Nightly? Hourly? On demand?
Scope. Full table comparison, or a sample?
Action on mismatch. Auto-repair by re-syncing the row, or alert and require human decision?
Success metric. "Less than 0.01% of rows differ" — pick an actual number.

Acceptance criteria that catch real failures

- Given a source emits events for entity X
  When the target is offline for 30 minutes
  Then on recovery the target catches up within 5 minutes
  And no events in that window are permanently lost

- Given two events for the same entity arrive out of order
  When the older event is processed after the newer one
  Then the target state reflects the newer event
  And the older event is logged as stale

- Given the sync has been running for 24 hours
  When the reconciliation job runs
  Then fewer than 0.01% of rows show a diff
  And all diffs are auto-repaired or surfaced as alerts

The signal I look for in review

The quality signal I use: does the spec describe what the operator sees on day 30 when something is off? If it only describes the happy path on day one, it isn't a sync spec — it's a handoff note that will turn into a 3am page.

Review drill

Review a sync spec by following one record from the source service to every consumer. The weak spots are usually ownership, retries, and what happens when two systems temporarily disagree.

Source of truth: name the system that wins when values conflict, including any fields that are intentionally derived or cached.
Recovery: define retry limits, dead-letter handling, reconciliation jobs, and the alert that proves the sync is stuck.
Consumer impact: show what each downstream service sees during delay, duplicate delivery, deletion, and schema change.

Put the sync contract, replay procedure, and reconciliation owner in the spec. Without those, the first incident becomes the real design document.

Example: for an account email update, the spec should say whether CRM, billing, or identity owns the value, how duplicates are ignored, and which reconciliation job fixes a missed event.

Worked Review Example

For a customer status sync, write the whole lifecycle. The billing system emits status.changed with an idempotency key, CRM stores the latest sequence number, and support tools show a stale-data badge until reconciliation completes. Deletes need the same treatment: soft delete, tombstone event, retention window, and the job that removes orphaned records after consumers acknowledge the change.

Copy This Sync Contract Block

When a spec feels abstract, I ask the author to fill in this block. It forces the decisions that usually stay hidden until the first reconciliation failure.

Sync contract

Source of truth:
- System:
- Fields owned:
- Fields derived by consumers:

Delivery model:
- Primary path: event push / target pull / shared log
- Ordering guarantee:
- Duplicate handling:
- Stale event behavior:

Recovery:
- Backfill entrypoint:
- Replay owner:
- Reconciliation frequency:
- Mismatch threshold:
- Operator alert:

Consumer impact:
- What users see during delay:
- What support sees during drift:
- What pauses the rollout:

The block is short enough to fit in a pull request description, but it changes the review. Instead of debating architecture labels, reviewers can point at a blank line and ask for the missing behavior.

One practical review move: run the block against a deleted record, not only an updated record. Deletes expose the weakest assumptions in sync systems. If the source emits a tombstone, the target needs to know whether to hide the record, mark it archived, keep it for compliance, or purge it after a retention window. If the spec cannot answer that, the sync is not finished.

Keywords: cross-service sync · event ordering · backfill plan · reconciliation job · change data capture

Keep Reading

Editorial Note

Author details: Spec Coding Editorial Team
Editorial policy: How we review and update articles
Corrections: Contact the editor