WebSocket Real-Time Collaboration Design Guide

Design real-time collaborative features with WebSockets, presence, conflict handling, offline recovery, scaling, observability, and security safeguards.

Prompt Template

You are a senior backend and real-time systems architect. Design a WebSocket-based real-time collaboration feature for [application/product].

System context:
- Collaboration feature: [comments, cursors, document editing, whiteboard, chat, dashboard updates, multiplayer workflow]
- Users and scale: [concurrent users, rooms/workspaces, peak traffic]
- Stack: [frontend framework, backend language, database, cache/pub-sub, hosting]
- Consistency needs: [eventual consistency, strong ordering, conflict-free editing, audit trail]
- Offline/reconnect needs: [mobile clients, flaky networks, resumable sessions]
- Auth model: [sessions, JWT, SSO, tenant roles]
- Data sensitivity: [PII, enterprise data, healthcare, financial, public chat]
- Existing APIs/events: [REST, GraphQL, queues, webhooks]
- Operational constraints: [cost, latency region, compliance, small team, vendor preference]

Deliver:
1. **Recommended architecture**: WebSocket gateway, app services, pub/sub, persistence, and client state
2. **Message protocol** with event names, payload examples, versioning, and validation rules
3. **Presence model** for online users, cursors, typing, room membership, and idle states
4. **Conflict strategy**: optimistic updates, locks, operational transform, CRDT, or last-write rules with rationale
5. **Reconnect and offline recovery flow** with sequence IDs, replay, and stale-session handling
6. **Authorization and tenant isolation checks** for every connection and room join
7. **Scaling plan** for multiple nodes, sticky sessions, Redis/NATS/Kafka, backpressure, and rate limits
8. **Observability plan**: metrics, logs, traces, synthetic tests, and alert thresholds
9. **Security checklist** for abuse, payload limits, origin checks, token expiry, and data leakage
10. **Implementation roadmap** from MVP to production hardening

Flag risky assumptions and include test cases for race conditions and reconnect bugs.

Example Output

Real-Time Design — Collaborative Roadmap Board

Architecture

Use a WebSocket gateway for room connections, Redis pub/sub for cross-node fanout, PostgreSQL for durable board events, and a client-side optimistic store. Each board is a room scoped by tenant_id and board_id.

Event protocol

| Event | Direction | Purpose |

|---|---|---|

| board.join | client → server | Authorize and enter room |

| card.move.requested | client → server | Request optimistic move |

| card.move.applied | server → clients | Broadcast validated move |

| presence.updated | server → clients | Cursor and active user state |

Reconnect flow

Clients include last_seen_sequence on reconnect. Server replays missed events from the durable event table for 5 minutes; otherwise the client receives board.snapshot.required.

Test cases

- Two users move the same card at the same time.

- User loses network after optimistic update but before server ack.

- User is removed from tenant while socket remains connected.

- Payload exceeds limit or comes from a disallowed origin.

Tips for Best Results

  • 💡Specify whether you need collaborative editing or simpler real-time status updates; the conflict model changes everything.
  • 💡Ask for reconnect flows early — real-time features fail in the messy middle, not the happy path.
  • 💡Include tenant isolation and room authorization in every design review.
  • 💡Plan observability before launch; debugging ghost sockets in production is pure Gremlins-after-midnight energy.