Kubernetes Deployment Troubleshooting Runbook Builder

Generate a practical Kubernetes runbook for diagnosing failed rollouts, crash loops, image pulls, probes, config errors, and safe rollback paths.

Prompt Template

Act as a senior SRE helping an engineering team troubleshoot a Kubernetes deployment issue. Build a step-by-step runbook for [symptom, e.g., CrashLoopBackOff, ImagePullBackOff, rollout stuck, 503 errors] in [environment/cluster].

Service/app: [service name]
Namespace: [namespace]
Deployment method: [Helm/Kustomize/Argo CD/GitHub Actions/manual kubectl]
Recent changes: [image/config/secret/ingress/resource change]
Observed errors/log snippets: [errors]
Blast radius and urgency: [impact]
Access constraints: [what commands/tools are available]

Structure the runbook with:
1. **Triage summary** — likely failure classes and first checks
2. **Safe read-only commands** — kubectl commands to inspect rollout, pods, events, probes, images, resources, secrets/config refs, ingress/service endpoints
3. **Decision tree** — if you see X, check Y next
4. **Root-cause hypotheses** — ranked by probability and evidence needed
5. **Rollback plan** — safest rollback options for the deployment method
6. **Fix-forward options** — config, image, resources, probes, or dependency fixes
7. **Communication template** — stakeholder update during incident
8. **Prevention checklist** — CI/CD, probes, alerts, manifests, and release gates

Flag destructive commands clearly and ask for confirmation before suggesting any delete, scale-down, or production mutation.

Example Output

1. Triage Summary

Symptom: `CrashLoopBackOff` after image `api:2026.05.11-1432` rolled out to `payments-api` in `prod`. Most likely causes: missing env var, failed DB migration compatibility, memory limit too low, or startup probe timeout.

2. Safe Read-Only Commands

kubectl -n prod rollout status deploy/payments-api

kubectl -n prod describe deploy payments-api

kubectl -n prod get pods -l app=payments-api -o wide

kubectl -n prod describe pod <pod-name>

kubectl -n prod logs <pod-name> --previous --tail=120

kubectl -n prod get events --sort-by=.lastTimestamp | tail -40

Decision Tree

- If logs show `Missing PAYMENT_GATEWAY_KEY`, verify Secret name and envFrom references.

- If pod exits with code 137, inspect memory requests/limits and recent traffic.

- If readiness probe fails only, compare startup time against probe thresholds.

Rollback

For Helm: `helm -n prod history payments-api`, then prepare `helm rollback payments-api <revision>` after incident lead approval.

Tips for Best Results

  • 💡Paste the exact Kubernetes status, events, and recent deployment diff for a sharper runbook.
  • 💡Ask for read-only commands first when you are in production incident mode.
  • 💡Include your deployment tool because Helm, Argo CD, and raw kubectl have different rollback paths.
  • 💡Have the model separate diagnosis, rollback, and prevention so responders do not mix them under stress.