Background Job Queue Design Guide
Design a reliable background job queue system with retries, idempotency, scheduling, observability, and failure handling.
Prompt Template
Act as a senior backend engineer. Design a background job queue architecture for [application/product] that needs to process [job types]. Application context: [SaaS app, marketplace, internal tool, ecommerce platform, AI workflow, etc.] Job examples: [emails, webhooks, imports, exports, billing sync, media processing, notifications] Expected volume: [jobs per minute/hour/day, burst patterns] Latency requirements: [real-time, near-real-time, overnight batch, SLA] Current stack: [language, framework, database, queue/broker, hosting] Failure risks: [duplicate work, partial completion, external API limits, long-running jobs, poison messages] Operational constraints: [team size, budget, compliance, observability tooling] Deliver: 1. **Queue architecture recommendation** — broker, worker model, storage, and deployment topology 2. **Job taxonomy** — priority classes, payload shape, ownership, and timeout rules 3. **Retry and backoff policy** — transient vs permanent failures, max attempts, jitter, and dead-letter queue 4. **Idempotency design** — dedupe keys, state transitions, locks, and safe replays 5. **Scheduling model** — delayed jobs, recurring jobs, cron replacement, and timezone handling 6. **Scaling plan** — worker concurrency, autoscaling triggers, rate limits, and backpressure 7. **Observability** — metrics, logs, tracing, alerts, dashboards, and runbooks 8. **Testing strategy** — unit, integration, chaos, and replay tests 9. **Migration plan** — how to introduce the queue without breaking existing synchronous flows Include concrete implementation notes for [preferred framework or queue tool].
Example Output
Background Job Queue Design: B2B Billing Sync
**Recommendation:** Use Sidekiq with Redis for near-real-time billing sync jobs, backed by PostgreSQL idempotency records. Separate queues: `critical`, `default`, `billing-sync`, and `bulk-import`.
Retry policy
- External API timeout: retry 8 times with exponential backoff and jitter.
- 4xx validation error: mark permanent failure after one attempt.
- Rate limit: respect `Retry-After`, pause the vendor-specific queue, and emit an alert after 10 minutes.
Idempotency
Use `job_type + account_id + external_invoice_id` as the dedupe key. Record states: `queued`, `processing`, `succeeded`, `failed_permanent`, `dead_lettered`. Workers must check whether the invoice already synced before making external calls.
Observability
Dashboard: queue depth, job age p95, success rate, retry rate, dead-letter count, vendor API latency, and worker saturation. Page on-call when critical jobs are older than 10 minutes or dead-letter count increases by more than 20/hour.
Migration
Start by moving invoice emails to async jobs, then billing sync, then bulk imports after replay tests pass.
Tips for Best Results
- 💡Separate long-running bulk jobs from user-facing jobs so one queue cannot starve the other.
- 💡Design idempotency before retries; retries without dedupe create expensive ghosts in the machine.
- 💡Include a dead-letter review workflow, not just a dead-letter queue.
- 💡Use job age and queue depth together; depth alone can hide stuck high-priority work.
Related Prompts
Webhook Retry and Idempotency Design Guide
Design resilient webhook delivery and consumer handling with idempotency keys, retry policies, signature verification, and dead-letter recovery workflows.
API Rate Limit and Quota Design Guide
Design a developer-friendly API rate limiting and quota system with algorithms, headers, errors, storage, and rollout strategy.
Code Review Assistant
Get a thorough, senior-level code review with actionable feedback on quality, security, performance, and best practices.