Kubernetes Deployment Troubleshooting Runbook Builder
Generate a practical Kubernetes runbook for diagnosing failed rollouts, crash loops, image pulls, probes, config errors, and safe rollback paths.
Prompt Template
Act as a senior SRE helping an engineering team troubleshoot a Kubernetes deployment issue. Build a step-by-step runbook for [symptom, e.g., CrashLoopBackOff, ImagePullBackOff, rollout stuck, 503 errors] in [environment/cluster]. Service/app: [service name] Namespace: [namespace] Deployment method: [Helm/Kustomize/Argo CD/GitHub Actions/manual kubectl] Recent changes: [image/config/secret/ingress/resource change] Observed errors/log snippets: [errors] Blast radius and urgency: [impact] Access constraints: [what commands/tools are available] Structure the runbook with: 1. **Triage summary** — likely failure classes and first checks 2. **Safe read-only commands** — kubectl commands to inspect rollout, pods, events, probes, images, resources, secrets/config refs, ingress/service endpoints 3. **Decision tree** — if you see X, check Y next 4. **Root-cause hypotheses** — ranked by probability and evidence needed 5. **Rollback plan** — safest rollback options for the deployment method 6. **Fix-forward options** — config, image, resources, probes, or dependency fixes 7. **Communication template** — stakeholder update during incident 8. **Prevention checklist** — CI/CD, probes, alerts, manifests, and release gates Flag destructive commands clearly and ask for confirmation before suggesting any delete, scale-down, or production mutation.
Example Output
1. Triage Summary
Symptom: `CrashLoopBackOff` after image `api:2026.05.11-1432` rolled out to `payments-api` in `prod`. Most likely causes: missing env var, failed DB migration compatibility, memory limit too low, or startup probe timeout.
2. Safe Read-Only Commands
kubectl -n prod rollout status deploy/payments-api
kubectl -n prod describe deploy payments-api
kubectl -n prod get pods -l app=payments-api -o wide
kubectl -n prod describe pod <pod-name>
kubectl -n prod logs <pod-name> --previous --tail=120
kubectl -n prod get events --sort-by=.lastTimestamp | tail -40
Decision Tree
- If logs show `Missing PAYMENT_GATEWAY_KEY`, verify Secret name and envFrom references.
- If pod exits with code 137, inspect memory requests/limits and recent traffic.
- If readiness probe fails only, compare startup time against probe thresholds.
Rollback
For Helm: `helm -n prod history payments-api`, then prepare `helm rollback payments-api <revision>` after incident lead approval.
Tips for Best Results
- 💡Paste the exact Kubernetes status, events, and recent deployment diff for a sharper runbook.
- 💡Ask for read-only commands first when you are in production incident mode.
- 💡Include your deployment tool because Helm, Argo CD, and raw kubectl have different rollback paths.
- 💡Have the model separate diagnosis, rollback, and prevention so responders do not mix them under stress.
Related Prompts
CI/CD Pipeline Configuration Generator
Generate production-ready CI/CD pipeline configurations for GitHub Actions, GitLab CI, or other platforms with testing, linting, building, and deployment stages.
Docker Containerization and Deployment Guide
Generate a complete Docker setup for any application including Dockerfile, docker-compose configuration, multi-stage builds, and production deployment best practices.
Incident Postmortem Template Builder
Generate a blameless engineering postmortem with timeline, root cause analysis, and follow-up actions after a production incident.