Scheduled Job Observability Runbook Builder

Design observability, alerting, retries, ownership, and incident runbooks for cron jobs, scheduled tasks, and recurring background workflows.

Prompt Template

You are a senior SRE helping an engineering team make scheduled jobs observable and reliable. Build a runbook for the scheduled workflows below.

Application or system: [what the app does]
Scheduled jobs: [job names, purpose, frequency, expected duration]
Scheduler or platform: [cron, Kubernetes CronJob, GitHub Actions, Airflow, Temporal, Sidekiq, Celery beat, serverless scheduler]
Stack and hosting: [language, framework, cloud, containers, database, queue]
Business impact if late or failed: [billing delay, stale reports, missed emails, compliance risk, data sync, customer impact]
Current visibility: [logs, metrics, traces, dashboard, alerts, none]
Failure modes seen: [silent failure, partial success, duplicate runs, timeout, stuck lock, bad data, missed schedule]
Retry and idempotency state: [manual retry, automatic retry, idempotent, unknown, dangerous side effects]
Dependencies: [database, APIs, files, warehouses, email/SMS, payment providers]
Data volume and runtime pattern: [rows, tenants, batches, peak days, seasonal spikes]
Ownership model: [team, on-call, escalation, product owner, compliance owner]
SLO or freshness target: [deadline, max lateness, acceptable miss rate]
Incident constraints: [customer comms, data repair, audit evidence, rollback, rate limits]

Produce:
1. Job inventory table with owner, schedule, deadline, dependencies, impact, and risk level.
2. Success, lateness, partial failure, and duplicate-run definitions for each job type.
3. Metrics, logs, traces, heartbeat events, and correlation IDs to add.
4. Alert thresholds that avoid both silent failures and noisy false positives.
5. Dashboard layout for scheduled job health and freshness.
6. Retry, replay, lock, and idempotency checklist.
7. Step-by-step incident runbook for missed, failed, stuck, duplicate, and slow jobs.
8. Backfill and data repair procedure with validation checks.
9. Test plan for scheduler changes, daylight saving time, time zones, dependency failures, and deploy overlap.
10. Rollout plan with owners and priority order.

Make the runbook calm, operational, and specific enough for an on-call engineer to use at 3 AM.

Example Output

Job Inventory

|---|---|---|---|---|

Signals to Add

- job_started, job_completed, job_failed, job_partial, and job_skipped events with job_id and scheduled_at.

- Runtime histogram, processed count, failed count, retry count, and freshness lag.

- Lock acquisition and lock age metric for singleton jobs.

Missed Job Runbook

1. Confirm whether the scheduler fired.

2. Check lock age and last successful heartbeat.

3. Inspect dependency errors and recent deploys.

4. Run dry-run or limited tenant retry if idempotency is confirmed.

5. Validate row counts and freshness before closing the incident.

Tips for Best Results

💡Define what late means for each job; a five-minute delay and a six-hour stale report are different incidents.
💡Add heartbeats for success, not only failure logs, so silent scheduler failures are visible.
💡Confirm idempotency before telling operators to retry or backfill.
💡Test daylight saving time and deploy overlap if schedules use local time or singleton locks.

Try it with

ChatGPT Claude Gemini

Related Prompts

Coding

Code Review Assistant

Get a thorough, senior-level code review with actionable feedback on quality, security, performance, and best practices.

ChatGPTClaudeGemini

Coding

Debugging Detective

Systematically debug errors and unexpected behavior with root cause analysis and fix suggestions.

ChatGPTClaudeGemini

Coding

Code Refactoring Advisor

Transform messy, complex code into clean, maintainable, well-structured code with clear explanations.

ChatGPTClaudeGemini