Real-Time Collaboration Spec: Conflict Resolution

Real-Time Collaboration Spec: Conflict Resolution
Spec Coding Editorial Team · Spec-first engineering notes

Every real-time collaboration spec I have reviewed gets the demo right and the edge cases wrong. Two users typing in the same doc looks magical on a laptop plugged into gigabit ethernet. The spec earns its pay on a flight with flaky wifi, a revoked permission mid-edit, and an undo key pressed by the wrong person at the wrong time.

Published on 2026-03-09 · Updated 2026-05-06 · 7 min read · Author: Spec Coding Editorial Team · Review policy: Editorial Policy

Review Note

Reviewed May 6, 2026. This article is now part of the public topic path for the Spec-First Development Hub. It was rechecked for concrete examples, internal links, and indexable metadata before returning to the sitemap and feed.

Pick Your Conflict Model Before You Pick Your Database

The decision that shapes the rest of the spec is how concurrent edits converge. Pick one of four and write the tradeoff down:

The unpopular truth: for a new product with a small team, CRDT via an existing library beats hand-rolled OT every time. The OT debugging tax is real and you do not have the years Google had to pay it down.

Presence Is a Different System, Spec It Separately

Presence (who is online, cursor, selection, avatar color) looks like document state but behaves nothing like it. Ephemeral, lossy, high-frequency, privacy-sensitive. Keep it on its own channel:

Snapshots, Op Logs, and Compaction

The canonical form is the snapshot plus the ops since it. Write the schedule into the spec, not a wiki, because ops teams need it under pressure:

Offline: Minutes, Hours, and Days Are Three Different Problems

"The client went offline" is not one requirement. It is three, and the spec should answer each:

The Concrete Paragraph Example

Alice and Bob both have "The quick brown fox jumps over the lazy dog" open. Alice's cursor is after "brown" and she types " and fast". Bob selects "lazy" and replaces it with "sleeping". Both ops hit the server within 40ms of each other.

Under OT, the server orders by receipt, transforms Bob's op against Alice's insertion (shifting his delete range by 9 characters), and both clients converge to "The quick brown and fast fox jumps over the sleeping dog". Under CRDT, each character has a stable id, the insertion anchors after the 'n' in "brown", the replacement targets specific "lazy" characters, convergence is automatic. Under locking, whoever grabbed the lock wins. Write this exact example into the spec so reviewers argue about behavior, not diagrams.

Undo Is the Hardest UX Decision in the Product

I have never seen a team get undo right on the first try. The question is whose stack you pop.

Pick local undo. Write down what happens when a local undo targets content another user has since modified. That paragraph is what reviewers should argue about.

Mid-Session Access Control and Wire Protocol

Permission checks at connection time are not enough. What happens when an admin revokes Bob's edit access while Bob has three unsent ops buffered? My default: server rejects with a typed error, client shows a non-dismissible banner, local copy becomes read-only, unsent ops export to a downloadable file so work is not silently lost.

For transport, WebSocket with a binary frame format (CBOR or MessagePack) is the right default in 2026. SSE is fine for read-only viewers. Long polling exists for corporate proxies that block WebSocket upgrades; test it quarterly or it will rot. One non-negotiable: every message carries a monotonic client sequence number and a server-assigned commit id, reconciled on reconnect. Without those two numbers you cannot debug a desync at 2am.

Observability and Testability

The metrics that actually tell you the system is healthy:

For tests, I insist on a deterministic simulator: a seeded random schedule of ops from N virtual clients with scripted network partitions. Every production bug came back as a seed, got a failing test, and never shipped again. If you cannot replay a concurrent-edit bug from a seed, you do not have a testable system.

Acceptance Criteria

- Given Alice and Bob are editing the same paragraph
  When both submit overlapping ops within 50ms
  Then both clients converge to identical document state within 200ms
  And the server op log records both ops with monotonic commit ids

- Given Bob has been offline for 90 minutes with 12 buffered ops
  When Bob reconnects and the server snapshot has advanced
  Then the client rebases Bob's ops onto the new base
  And ops whose targets no longer exist are shown to Bob for review, not dropped

- Given an admin revokes Bob's edit permission while Bob has unsent ops
  When Bob's next op reaches the server
  Then the server responds with PERMISSION_REVOKED
  And Bob's client locks to read-only and offers to export buffered ops as a file

Desync Incident Packet

Real-time collaboration bugs are miserable to debug unless the spec already defines the evidence. A good incident packet should let an engineer reconstruct the document state without asking users to describe what happened.

Desync incident packet

Document id:
Client ids:
Server commit range:
Last acknowledged commit per client:
Buffered local ops:
Permission changes during session:
Network partition window:
Deterministic simulator seed:
Expected convergence rule:
User-visible recovery:
- silent rebase
- review conflicted ops
- export unsent work
- block editing until reload

The packet is also a design test. If you cannot fill in the expected convergence rule before launch, you will not be able to explain the desync after launch.

Keywords: operational transform · CRDT · real-time collaboration spec · presence protocol · offline sync · collaborative undo

Topic Path

Read the hub first, then use these adjacent examples and templates to place this article inside the full workflow.

Editorial Note