Real-Time Collaboration Spec: Conflict Resolution

Every real-time collaboration spec I have reviewed gets the demo right and the edge cases wrong. Two users typing in the same doc looks magical on a laptop plugged into gigabit ethernet. The spec earns its pay on a flight with flaky wifi, a revoked permission mid-edit, and an undo key pressed by the wrong person at the wrong time.

Case StudiesAPI Contracts

Published on 2026-03-03 · Updated 2026-06-02 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Pick Your Conflict Model Before You Pick Your Database

The decision that shapes the rest of the spec is how concurrent edits converge. Pick one of four and write the tradeoff down:

Operational Transform (OT). Server-authoritative, transforms each op against concurrent ops. Smaller wire payloads, but transform functions are where bugs hide. Google Docs still runs on OT because it was first and rewriting a working engine is a career-ending move.
CRDT (Yjs, Automerge, Loro). Peer-friendly, convergence proven by math. You pay for it with metadata overhead (tombstones, vector clocks) and a less intuitive cursor model.
Server-side locks at paragraph or cell granularity. Trivially correct, feels terrible for prose, fine for spreadsheets. Figma uses per-object locking because users rarely fight over the same rectangle.
Three-way merge on save. Git-style. Honest for long-form writing with few collaborators, dishonest if you marketed "real-time."

The unpopular truth: for a new product with a small team, CRDT via an existing library beats hand-rolled OT every time. The OT debugging tax is real and you do not have the years Google had to pay it down.

Presence Is a Different System, Spec It Separately

Presence (who is online, cursor, selection, avatar color) looks like document state but behaves nothing like it. Ephemeral, lossy, high-frequency, privacy-sensitive. Keep it on its own channel:

Never persisted. A presence record older than 30 seconds is garbage.
Throttled at the client. Cursor at most 20 updates/sec, coalesced server-side to 10 before rebroadcast.
Stripped of PII in logs. A user id is fine; a name plus selection text is a leak waiting to happen.
Degrades independently. If presence fails the document channel keeps editing, and vice versa.

Snapshots, Op Logs, and Compaction

The canonical form is the snapshot plus the ops since it. Write the schedule into the spec, not a wiki, because ops teams need it under pressure:

Snapshot every 500 ops or every 10 minutes of sustained editing, whichever comes first.
Retain the last 30 days of ops uncompacted for audit and undo depth. Older ops compact into the next snapshot.
A document idle for 24 hours gets snapshotted once and its op log compacted to zero. This alone cut our hot storage by 40 percent.

Offline: Minutes, Hours, and Days Are Three Different Problems

"The client went offline" is not one requirement. It is three, and the spec should answer each:

Minutes. Buffer ops in memory, reconnect, replay against the server's current vector. The server accepts or transforms. This is the easy case and the only one most demos cover.
Hours. Persist the op buffer to IndexedDB. On reconnect, fetch the new server snapshot, rebase the local ops, show the user which of their edits survived rebase and which were rejected because their target no longer exists.
Days. The local base snapshot is older than the server's oldest retained op. You cannot rebase; you can only three-way merge. Show the user a diff view and make them the merge author. Do not silently drop their work and do not silently clobber the server.

The Concrete Paragraph Example

Alice and Bob both have "The quick brown fox jumps over the lazy dog" open. Alice's cursor is after "brown" and she types " and fast". Bob selects "lazy" and replaces it with "sleeping". Both ops hit the server within 40ms of each other.

Under OT, the server orders by receipt, transforms Bob's op against Alice's insertion (shifting his delete range by 9 characters), and both clients converge to "The quick brown and fast fox jumps over the sleeping dog". Under CRDT, each character has a stable id, the insertion anchors after the 'n' in "brown", the replacement targets specific "lazy" characters, convergence is automatic. Under locking, whoever grabbed the lock wins. Write this exact example into the spec so reviewers argue about behavior, not diagrams.

Undo Is the Hardest UX Decision in the Product

I have never seen a team get undo right on the first try. The question is whose stack you pop.

Local undo. Ctrl+Z only undoes your own ops. Matches user intuition. Requires that each op be individually invertible against the current document, which is hard when ops were already transformed.
Global undo. Undoes the last op on the document regardless of author. Easy to implement. Destroys trust the first time Alice undoes Bob's paragraph.
Session-scoped undo with ownership. What Google Docs does. Local by default, but ops built on a now-undone op must be rebased or dropped. Spec the drop policy explicitly.

Pick local undo. Write down what happens when a local undo targets content another user has since modified. That paragraph is what reviewers should argue about.

Mid-Session Access Control and Wire Protocol

Permission checks at connection time are not enough. What happens when an admin revokes Bob's edit access while Bob has three unsent ops buffered? My default: server rejects with a typed error, client shows a non-dismissible banner, local copy becomes read-only, unsent ops export to a downloadable file so work is not silently lost.

For transport, WebSocket with a binary frame format (CBOR or MessagePack) is the right default in 2026. SSE is fine for read-only viewers. Long polling exists for corporate proxies that block WebSocket upgrades; test it quarterly or it will rot. One non-negotiable: every message carries a monotonic client sequence number and a server-assigned commit id, reconciled on reconnect. Without those two numbers you cannot debug a desync at 2am.

Observability and Testability

The metrics that actually tell you the system is healthy:

Ops per second per document, p50 and p99. A doc at p99 above 200 ops/sec is almost always a runaway script.
Transform conflict rate. Rising conflict rate precedes user-visible corruption.
Apply latency p99 from client send to client ack, end-to-end, not server-only.
Rebase-on-reconnect count per session. Spikes mean your channel split is leaking.

For tests, I insist on a deterministic simulator: a seeded random schedule of ops from N virtual clients with scripted network partitions. Every production bug came back as a seed, got a failing test, and never shipped again. If you cannot replay a concurrent-edit bug from a seed, you do not have a testable system.

Acceptance Criteria

- Given Alice and Bob are editing the same paragraph
  When both submit overlapping ops within 50ms
  Then both clients converge to identical document state within 200ms
  And the server op log records both ops with monotonic commit ids

- Given Bob has been offline for 90 minutes with 12 buffered ops
  When Bob reconnects and the server snapshot has advanced
  Then the client rebases Bob's ops onto the new base
  And ops whose targets no longer exist are shown to Bob for review, not dropped

- Given an admin revokes Bob's edit permission while Bob has unsent ops
  When Bob's next op reaches the server
  Then the server responds with PERMISSION_REVOKED
  And Bob's client locks to read-only and offers to export buffered ops as a file

Desync Incident Packet

Real-time collaboration bugs are miserable to debug unless the spec already defines the evidence. A good incident packet should let an engineer reconstruct the document state without asking users to describe what happened.

Desync incident packet

Document id:
Client ids:
Server commit range:
Last acknowledged commit per client:
Buffered local ops:
Permission changes during session:
Network partition window:
Deterministic simulator seed:
Expected convergence rule:
User-visible recovery:
- silent rebase
- review conflicted ops
- export unsent work
- block editing until reload

The packet is also a design test. If you cannot fill in the expected convergence rule before launch, you will not be able to explain the desync after launch.

Conflict resolution fixture for reviewers

The easiest way to expose a weak collaboration spec is to ask for one reproducible conflict. This fixture forces the decision into the open.

Fixture:
- User A edits paragraph P from "ship today" to "ship after QA".
- User B goes offline, edits the same paragraph to "ship after legal review".
- User A saves at version 42.
- User B reconnects 4 minutes later with base version 41.

Expected behavior:
- System detects divergent base version.
- User B sees both edits and chooses keep mine, keep latest, or merge manually.
- Audit log records base_version, resolved_version, resolver_id, and chosen action.

Case study: paragraph desync during live editing

A collaboration feature worked in a demo but failed when two users edited the same paragraph while one lost connectivity. The case made the merge policy reviewable.

Condition	Risk	Evidence before release
User A offline for 30 seconds.	Local cursor applies to old document state.	Replay fixture with server revision gap.
User B deletes paragraph.	A's edit resurrects deleted content.	Tombstone test and user-visible conflict marker.
Undo after reconnect.	Undo removes another user's change.	Undo scope test tied to operation ownership.

Keywords: operational transform · CRDT · real-time collaboration spec · presence protocol · offline sync · collaborative undo