Real-Time Collaboration Spec: Conflict Resolution
Every real-time collaboration spec I have reviewed gets the demo right and the edge cases wrong. Two users typing in the same doc looks magical on a laptop plugged into gigabit ethernet. The spec earns its pay on a flight with flaky wifi, a revoked permission mid-edit, and an undo key pressed by the wrong person at the wrong time.
Review Note
Reviewed May 6, 2026. This article is now part of the public topic path for the Spec-First Development Hub. It was rechecked for concrete examples, internal links, and indexable metadata before returning to the sitemap and feed.
Pick Your Conflict Model Before You Pick Your Database
The decision that shapes the rest of the spec is how concurrent edits converge. Pick one of four and write the tradeoff down:
- Operational Transform (OT). Server-authoritative, transforms each op against concurrent ops. Smaller wire payloads, but transform functions are where bugs hide. Google Docs still runs on OT because it was first and rewriting a working engine is a career-ending move.
- CRDT (Yjs, Automerge, Loro). Peer-friendly, convergence proven by math. You pay for it with metadata overhead (tombstones, vector clocks) and a less intuitive cursor model.
- Server-side locks at paragraph or cell granularity. Trivially correct, feels terrible for prose, fine for spreadsheets. Figma uses per-object locking because users rarely fight over the same rectangle.
- Three-way merge on save. Git-style. Honest for long-form writing with few collaborators, dishonest if you marketed "real-time."
The unpopular truth: for a new product with a small team, CRDT via an existing library beats hand-rolled OT every time. The OT debugging tax is real and you do not have the years Google had to pay it down.
Presence Is a Different System, Spec It Separately
Presence (who is online, cursor, selection, avatar color) looks like document state but behaves nothing like it. Ephemeral, lossy, high-frequency, privacy-sensitive. Keep it on its own channel:
- Never persisted. A presence record older than 30 seconds is garbage.
- Throttled at the client. Cursor at most 20 updates/sec, coalesced server-side to 10 before rebroadcast.
- Stripped of PII in logs. A user id is fine; a name plus selection text is a leak waiting to happen.
- Degrades independently. If presence fails the document channel keeps editing, and vice versa.
Snapshots, Op Logs, and Compaction
The canonical form is the snapshot plus the ops since it. Write the schedule into the spec, not a wiki, because ops teams need it under pressure:
- Snapshot every 500 ops or every 10 minutes of sustained editing, whichever comes first.
- Retain the last 30 days of ops uncompacted for audit and undo depth. Older ops compact into the next snapshot.
- A document idle for 24 hours gets snapshotted once and its op log compacted to zero. This alone cut our hot storage by 40 percent.
Offline: Minutes, Hours, and Days Are Three Different Problems
"The client went offline" is not one requirement. It is three, and the spec should answer each:
- Minutes. Buffer ops in memory, reconnect, replay against the server's current vector. The server accepts or transforms. This is the easy case and the only one most demos cover.
- Hours. Persist the op buffer to IndexedDB. On reconnect, fetch the new server snapshot, rebase the local ops, show the user which of their edits survived rebase and which were rejected because their target no longer exists.
- Days. The local base snapshot is older than the server's oldest retained op. You cannot rebase; you can only three-way merge. Show the user a diff view and make them the merge author. Do not silently drop their work and do not silently clobber the server.
The Concrete Paragraph Example
Alice and Bob both have "The quick brown fox jumps over the lazy dog" open. Alice's cursor is after "brown" and she types " and fast". Bob selects "lazy" and replaces it with "sleeping". Both ops hit the server within 40ms of each other.
Under OT, the server orders by receipt, transforms Bob's op against Alice's insertion (shifting his delete range by 9 characters), and both clients converge to "The quick brown and fast fox jumps over the sleeping dog". Under CRDT, each character has a stable id, the insertion anchors after the 'n' in "brown", the replacement targets specific "lazy" characters, convergence is automatic. Under locking, whoever grabbed the lock wins. Write this exact example into the spec so reviewers argue about behavior, not diagrams.
Undo Is the Hardest UX Decision in the Product
I have never seen a team get undo right on the first try. The question is whose stack you pop.
- Local undo. Ctrl+Z only undoes your own ops. Matches user intuition. Requires that each op be individually invertible against the current document, which is hard when ops were already transformed.
- Global undo. Undoes the last op on the document regardless of author. Easy to implement. Destroys trust the first time Alice undoes Bob's paragraph.
- Session-scoped undo with ownership. What Google Docs does. Local by default, but ops built on a now-undone op must be rebased or dropped. Spec the drop policy explicitly.
Pick local undo. Write down what happens when a local undo targets content another user has since modified. That paragraph is what reviewers should argue about.
Mid-Session Access Control and Wire Protocol
Permission checks at connection time are not enough. What happens when an admin revokes Bob's edit access while Bob has three unsent ops buffered? My default: server rejects with a typed error, client shows a non-dismissible banner, local copy becomes read-only, unsent ops export to a downloadable file so work is not silently lost.
For transport, WebSocket with a binary frame format (CBOR or MessagePack) is the right default in 2026. SSE is fine for read-only viewers. Long polling exists for corporate proxies that block WebSocket upgrades; test it quarterly or it will rot. One non-negotiable: every message carries a monotonic client sequence number and a server-assigned commit id, reconciled on reconnect. Without those two numbers you cannot debug a desync at 2am.
Observability and Testability
The metrics that actually tell you the system is healthy:
- Ops per second per document, p50 and p99. A doc at p99 above 200 ops/sec is almost always a runaway script.
- Transform conflict rate. Rising conflict rate precedes user-visible corruption.
- Apply latency p99 from client send to client ack, end-to-end, not server-only.
- Rebase-on-reconnect count per session. Spikes mean your channel split is leaking.
For tests, I insist on a deterministic simulator: a seeded random schedule of ops from N virtual clients with scripted network partitions. Every production bug came back as a seed, got a failing test, and never shipped again. If you cannot replay a concurrent-edit bug from a seed, you do not have a testable system.
Acceptance Criteria
- Given Alice and Bob are editing the same paragraph When both submit overlapping ops within 50ms Then both clients converge to identical document state within 200ms And the server op log records both ops with monotonic commit ids - Given Bob has been offline for 90 minutes with 12 buffered ops When Bob reconnects and the server snapshot has advanced Then the client rebases Bob's ops onto the new base And ops whose targets no longer exist are shown to Bob for review, not dropped - Given an admin revokes Bob's edit permission while Bob has unsent ops When Bob's next op reaches the server Then the server responds with PERMISSION_REVOKED And Bob's client locks to read-only and offers to export buffered ops as a file
Desync Incident Packet
Real-time collaboration bugs are miserable to debug unless the spec already defines the evidence. A good incident packet should let an engineer reconstruct the document state without asking users to describe what happened.
Desync incident packet Document id: Client ids: Server commit range: Last acknowledged commit per client: Buffered local ops: Permission changes during session: Network partition window: Deterministic simulator seed: Expected convergence rule: User-visible recovery: - silent rebase - review conflicted ops - export unsent work - block editing until reload
The packet is also a design test. If you cannot fill in the expected convergence rule before launch, you will not be able to explain the desync after launch.
Topic Path
Read the hub first, then use these adjacent examples and templates to place this article inside the full workflow.
Keep Reading
Editorial Note
- Author details: Spec Coding Editorial Team
- Editorial policy: How we review and update articles
- Corrections: Contact the editor