Webhook Retry and Idempotency Design Guide
Design resilient webhook delivery and consumer handling with idempotency keys, retry policies, signature verification, and dead-letter recovery workflows.
Prompt Template
You are a senior distributed systems engineer specializing in webhook reliability. Help me design a production-ready webhook delivery and consumption system. **Use case:** [e.g., SaaS billing events, marketplace order updates, CRM sync] **Webhook producer or consumer?:** [producer / consumer / both] **Expected volume:** [events per minute/day] **Payload shape:** [briefly describe fields and approximate size] **Current pain points:** [duplicate deliveries, missing retries, out-of-order events, slow downstream services] **Security requirements:** [HMAC signature, IP allowlist, mTLS, none yet] **Infrastructure:** [e.g., Node.js + Postgres + SQS, Laravel + Redis, serverless] **Downstream dependencies:** [APIs, database writes, third-party services] **Failure tolerance:** [how long events can wait, acceptable data loss = none/low/medium] Please provide: 1. **Reference Architecture** for reliable webhook publishing and/or consumption 2. **Idempotency Strategy** including unique event IDs, storage design, TTL, and replay behavior 3. **Retry Policy** with backoff schedule, max attempts, jitter, and terminal failure handling 4. **Security Layer** covering signature verification, timestamp tolerance, replay attack prevention, and secret rotation 5. **Ordering and Concurrency Rules** for events that may arrive out of order or be processed in parallel 6. **Dead-Letter and Replay Workflow** with operator runbook steps 7. **Observability Plan** including logs, metrics, alerts, and dashboard widgets 8. **Implementation Checklist** with common mistakes to avoid Include pseudocode or code snippets for the stack I specify, plus a sample event table schema.
Example Output
# Webhook Reliability Blueprint
**Use case:** Subscription billing events
**Stack:** Node.js + Postgres + SQS
Architecture
Producer writes each event to an `outbox_events` table inside the same DB transaction as the business action. A relay worker publishes to the delivery queue. Consumers verify the signature, persist `event_id`, process side effects, and mark the event complete.
Retry Policy
| Attempt | Delay | Notes |
|---|---:|---|
| 1 | immediate | initial delivery |
| 2 | 30s | transient failure |
| 3 | 2m | add jitter ±20% |
| 4 | 10m | alert if failure rate spikes |
| 5 | 1h | final automatic retry |
| 6 | manual replay | move to DLQ and page ops |
Idempotency Table
CREATE TABLE processed_webhooks (
event_id TEXT PRIMARY KEY,
event_type TEXT NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
status TEXT NOT NULL,
response_code INT,
idempotency_expires_at TIMESTAMPTZ
);
Consumer Flow
1. Verify HMAC signature and reject requests older than 5 minutes.
2. Check `processed_webhooks` for `event_id`.
3. If already completed, return 200 with `duplicate=true`.
4. If not seen, insert row, process side effects inside a transaction, then mark complete.
5. On failure, keep the row and retry safely because writes are keyed by `event_id`.
Alerts
- Retry queue depth > 500 for 10 minutes
- Signature failures > 2%
- DLQ count > 0 in production
- P95 processing time > 5s
Tips for Best Results
- 💡Use the outbox pattern when you publish webhooks from your own system. It prevents the classic bug where the DB commit succeeds but the webhook send never happens.
- 💡Never treat a webhook as exactly-once delivery. Design for at-least-once and make the consumer idempotent by default.
- 💡Store the raw payload and signature headers for failed events so support can replay them without guessing what the sender actually sent.
- 💡Return 2xx quickly and offload heavy work to a queue when possible. Slow synchronous handlers create duplicate deliveries and timeout storms.
Related Prompts
Background Job Queue Design Guide
Design a reliable background job queue system with retries, idempotency, scheduling, observability, and failure handling.
Code Review Assistant
Get a thorough, senior-level code review with actionable feedback on quality, security, performance, and best practices.
Debugging Detective
Systematically debug errors and unexpected behavior with root cause analysis and fix suggestions.