Webhook Retry and Idempotency Design Guide

Design resilient webhook delivery and consumer handling with idempotency keys, retry policies, signature verification, and dead-letter recovery workflows.

Prompt Template

You are a senior distributed systems engineer specializing in webhook reliability. Help me design a production-ready webhook delivery and consumption system.

**Use case:** [e.g., SaaS billing events, marketplace order updates, CRM sync]
**Webhook producer or consumer?:** [producer / consumer / both]
**Expected volume:** [events per minute/day]
**Payload shape:** [briefly describe fields and approximate size]
**Current pain points:** [duplicate deliveries, missing retries, out-of-order events, slow downstream services]
**Security requirements:** [HMAC signature, IP allowlist, mTLS, none yet]
**Infrastructure:** [e.g., Node.js + Postgres + SQS, Laravel + Redis, serverless]
**Downstream dependencies:** [APIs, database writes, third-party services]
**Failure tolerance:** [how long events can wait, acceptable data loss = none/low/medium]

Please provide:

1. **Reference Architecture** for reliable webhook publishing and/or consumption
2. **Idempotency Strategy** including unique event IDs, storage design, TTL, and replay behavior
3. **Retry Policy** with backoff schedule, max attempts, jitter, and terminal failure handling
4. **Security Layer** covering signature verification, timestamp tolerance, replay attack prevention, and secret rotation
5. **Ordering and Concurrency Rules** for events that may arrive out of order or be processed in parallel
6. **Dead-Letter and Replay Workflow** with operator runbook steps
7. **Observability Plan** including logs, metrics, alerts, and dashboard widgets
8. **Implementation Checklist** with common mistakes to avoid

Include pseudocode or code snippets for the stack I specify, plus a sample event table schema.

Example Output

# Webhook Reliability Blueprint

**Use case:** Subscription billing events

**Stack:** Node.js + Postgres + SQS

Architecture

Producer writes each event to an `outbox_events` table inside the same DB transaction as the business action. A relay worker publishes to the delivery queue. Consumers verify the signature, persist `event_id`, process side effects, and mark the event complete.

Retry Policy

| Attempt | Delay | Notes |

|---|---:|---|

| 1 | immediate | initial delivery |

| 2 | 30s | transient failure |

| 3 | 2m | add jitter ±20% |

| 4 | 10m | alert if failure rate spikes |

| 5 | 1h | final automatic retry |

| 6 | manual replay | move to DLQ and page ops |

Idempotency Table

CREATE TABLE processed_webhooks (

event_id TEXT PRIMARY KEY,

event_type TEXT NOT NULL,

received_at TIMESTAMPTZ NOT NULL DEFAULT now(),

status TEXT NOT NULL,

response_code INT,

idempotency_expires_at TIMESTAMPTZ

);

Consumer Flow

1. Verify HMAC signature and reject requests older than 5 minutes.

2. Check `processed_webhooks` for `event_id`.

3. If already completed, return 200 with `duplicate=true`.

4. If not seen, insert row, process side effects inside a transaction, then mark complete.

5. On failure, keep the row and retry safely because writes are keyed by `event_id`.

Alerts

- Retry queue depth > 500 for 10 minutes

- Signature failures > 2%

- DLQ count > 0 in production

- P95 processing time > 5s

Tips for Best Results

  • 💡Use the outbox pattern when you publish webhooks from your own system. It prevents the classic bug where the DB commit succeeds but the webhook send never happens.
  • 💡Never treat a webhook as exactly-once delivery. Design for at-least-once and make the consumer idempotent by default.
  • 💡Store the raw payload and signature headers for failed events so support can replay them without guessing what the sender actually sent.
  • 💡Return 2xx quickly and offload heavy work to a queue when possible. Slow synchronous handlers create duplicate deliveries and timeout storms.