Event Sourcing Replay and Backfill Runbook Builder

Plan a safe event replay or projection backfill with idempotency rules, batching, validation, rollback triggers, and operator checklists.

Prompt Template

You are a senior backend engineer who has operated event-sourced systems in production. Build a replay and backfill runbook for the system below.

System architecture: [event-sourced service, CDC pipeline, Kafka consumers, Postgres outbox, etc.]
Event store or broker: [Kafka / EventStoreDB / DynamoDB streams / Postgres / other]
Projection or read model to rebuild: [name and purpose]
Reason for replay: [bug fix, new projection, schema change, missed events, corrupted read model]
Event volume and date range: [approximate count, time window, tenant scope]
Current consumer behavior: [idempotent / non-idempotent / unknown]
Ordering requirements: [per account, per aggregate, global, relaxed]
Side effects to avoid: [emails, webhooks, billing, notifications, external API calls]
Downtime tolerance: [none / read-only window / maintenance window]
Observability available: [logs, traces, dashboards, dead letter queues, checksums]
Operational constraints: [rate limits, storage, compliance, customer impact, on-call coverage]

Produce:
1. Replay goal, scope, assumptions, and explicit non-goals.
2. Pre-flight checklist covering backups, snapshots, feature flags, dry runs, and side-effect suppression.
3. Event selection strategy with filters, ordering rules, schema-version handling, and tenant boundaries.
4. Idempotency and deduplication plan for consumers, projections, and external side effects.
5. Batch sizing and throttling plan with pause/resume behavior.
6. Validation strategy using counts, checksums, sample records, business invariants, and customer-facing checks.
7. Monitoring dashboard and alert thresholds during replay.
8. Rollback, abort, and recovery steps if validation fails midway.
9. Operator timeline with owners, commands to prepare, and go/no-go decision points.
10. Post-replay cleanup and lessons-learned checklist.

Include pseudocode for the replay loop and make the plan cautious enough for production data.

Example Output

Replay Summary

Rebuild the invoice_summary projection for tenant group A from events emitted between 2026-01-01 and 2026-05-31. The replay must not resend invoice emails or billing webhooks.

Pre-Flight

- Snapshot current projection table and record row count by tenant.

- Enable replay_mode=true so notification and webhook handlers ignore replayed events.

- Dry-run 10,000 events in staging using a copy of production metadata.

- Confirm projection writes are idempotent on event_id plus projection_name.

Replay Loop Sketch

1. Read events by aggregate_id and sequence number.

2. Skip events already present in replay_checkpoint.

3. Transform legacy v1 events through the compatibility adapter.

4. Write projection update inside a transaction.

5. Store checkpoint with event_id, aggregate_id, and checksum.

6. Pause if error rate exceeds 0.5% or projection lag exceeds 15 minutes.

Validation

| Check | Expected | Abort If |

|---|---:|---:|

| Invoice count parity | 100% | below 99.95% |

| Total invoiced amount variance | under 0.1% | over 0.25% |

| Missing customer links | 0 | any P0 customer affected |

Rollback Trigger

If checksum variance persists after one retry batch, pause replay, restore the snapshot for affected tenants, and keep reads on the old projection until the transform bug is fixed.

Tips for Best Results

💡List side effects explicitly; replay bugs often happen when old events trigger new emails, invoices, or webhooks.
💡Provide event volume and ordering rules so the model can choose realistic batch sizes and checkpoints.
💡Ask for validation checks tied to business invariants, not only technical row counts.
💡Run the prompt once for the runbook and a second time with your real metrics to pressure-test the abort thresholds.

Try it with

ChatGPT Claude Gemini

Related Prompts

Coding

Microservices Architecture Migration Planner

Plan a structured migration from a monolithic application to microservices, covering service decomposition, data ownership, API contracts, migration phases, and rollback strategies.

ChatGPTClaudeGemini

Coding

gRPC Error Handling and Retry Design Guide

Design reliable gRPC service error handling with status codes, deadlines, retries, idempotency, interceptors, and observability.

ChatGPTClaudeGemini

Coding

Kafka Consumer Lag Incident Runbook Builder

Create a production runbook for Kafka or stream-processing consumer lag incidents with triage, offset safety, scaling options, root-cause checks, alerts, and recovery verification.

ChatGPTClaudeGemini