Background Job Queue Design Guide

Design a reliable background job queue system with retries, idempotency, scheduling, observability, and failure handling.

Prompt Template

Act as a senior backend engineer. Design a background job queue architecture for [application/product] that needs to process [job types].

Application context: [SaaS app, marketplace, internal tool, ecommerce platform, AI workflow, etc.]
Job examples: [emails, webhooks, imports, exports, billing sync, media processing, notifications]
Expected volume: [jobs per minute/hour/day, burst patterns]
Latency requirements: [real-time, near-real-time, overnight batch, SLA]
Current stack: [language, framework, database, queue/broker, hosting]
Failure risks: [duplicate work, partial completion, external API limits, long-running jobs, poison messages]
Operational constraints: [team size, budget, compliance, observability tooling]

Deliver:
1. **Queue architecture recommendation** — broker, worker model, storage, and deployment topology
2. **Job taxonomy** — priority classes, payload shape, ownership, and timeout rules
3. **Retry and backoff policy** — transient vs permanent failures, max attempts, jitter, and dead-letter queue
4. **Idempotency design** — dedupe keys, state transitions, locks, and safe replays
5. **Scheduling model** — delayed jobs, recurring jobs, cron replacement, and timezone handling
6. **Scaling plan** — worker concurrency, autoscaling triggers, rate limits, and backpressure
7. **Observability** — metrics, logs, tracing, alerts, dashboards, and runbooks
8. **Testing strategy** — unit, integration, chaos, and replay tests
9. **Migration plan** — how to introduce the queue without breaking existing synchronous flows

Include concrete implementation notes for [preferred framework or queue tool].

Example Output

Background Job Queue Design: B2B Billing Sync

**Recommendation:** Use Sidekiq with Redis for near-real-time billing sync jobs, backed by PostgreSQL idempotency records. Separate queues: `critical`, `default`, `billing-sync`, and `bulk-import`.

Retry policy

- External API timeout: retry 8 times with exponential backoff and jitter.

- 4xx validation error: mark permanent failure after one attempt.

- Rate limit: respect `Retry-After`, pause the vendor-specific queue, and emit an alert after 10 minutes.

Idempotency

Use `job_type + account_id + external_invoice_id` as the dedupe key. Record states: `queued`, `processing`, `succeeded`, `failed_permanent`, `dead_lettered`. Workers must check whether the invoice already synced before making external calls.

Observability

Dashboard: queue depth, job age p95, success rate, retry rate, dead-letter count, vendor API latency, and worker saturation. Page on-call when critical jobs are older than 10 minutes or dead-letter count increases by more than 20/hour.

Migration

Start by moving invoice emails to async jobs, then billing sync, then bulk imports after replay tests pass.

Tips for Best Results

  • 💡Separate long-running bulk jobs from user-facing jobs so one queue cannot starve the other.
  • 💡Design idempotency before retries; retries without dedupe create expensive ghosts in the machine.
  • 💡Include a dead-letter review workflow, not just a dead-letter queue.
  • 💡Use job age and queue depth together; depth alone can hide stuck high-priority work.