SLO Error Budget Policy Builder

Design service level objectives, error budgets, burn-rate alerts, and release policies that balance reliability with product velocity.

Prompt Template

You are a senior site reliability engineer. Create an SLO and error budget policy for a production service.

**Service:** [service name and purpose]
**Users:** [internal users / customers / enterprise customers / public API consumers]
**Critical user journeys:** [login, checkout, search, API request, file upload, etc.]
**Current reliability data:** [uptime, latency, error rate, incidents, support tickets]
**Architecture:** [monolith / microservices / serverless / Kubernetes / edge]
**Monitoring stack:** [Datadog, Prometheus, Grafana, CloudWatch, etc.]
**Release cadence:** [daily / weekly / continuous deployment]
**Business tolerance:** [strict enterprise SLA / balanced / startup speed]

Produce:
1. **SLI selection** for availability, latency, correctness, and freshness where relevant.
2. **SLO targets** with rationale and measurement windows.
3. **Error budget calculation** showing allowed bad events or downtime per window.
4. **Burn-rate alerting rules** for fast burn and slow burn scenarios.
5. **Release policy**: what happens at 100%, 50%, 25%, and 0% remaining budget.
6. **Incident review loop** connecting budget spend to postmortems and roadmap items.
7. **Dashboard spec** with charts, thresholds, owners, and weekly review routine.
8. **Executive explanation** in plain language so product and leadership can understand the tradeoff.

Make the policy actionable for engineering teams, not just theoretical SRE language.

Example Output

SLO Policy: Checkout API

Critical User Journey

A shopper can submit payment and receive an order confirmation within 4 seconds.

Recommended SLIs and SLOs

| SLI | Definition | Target | Window |

|---|---|---:|---|

| Availability | Successful checkout requests / total valid checkout requests | 99.9% | 30 days |

| Latency | p95 checkout confirmation time | < 4s for 99% | 30 days |

| Correctness | Orders created without duplicate charge or missing confirmation | 99.99% | 30 days |

Error Budget

At 99.9% availability, the service can have 0.1% bad checkout requests in 30 days. If monthly checkout volume is 2,000,000 requests, the budget is 2,000 failed valid requests.

Burn-Rate Alerts

- Fast burn: 2% of monthly budget consumed in 1 hour, page on-call.

- Slow burn: 10% consumed in 24 hours, Slack alert plus owner review.

Release Policy

- 100-50% budget: normal releases.

- 50-25%: require rollback plan and enhanced monitoring.

- 25-0%: freeze risky changes; reliability work takes priority.

- 0%: only emergency fixes until root causes are mitigated.

Plain-English Leadership Summary

The error budget is the reliability allowance. If we spend it too quickly, customers are feeling pain and the team shifts from feature shipping to stability until trust is restored.

Tips for Best Results

  • 💡Start with user journeys, not infrastructure metrics; customers care whether checkout works, not whether pod CPU looks tidy.
  • 💡Avoid unrealistic 99.999% targets unless the business is ready to fund that level of reliability.
  • 💡Pair every alert with an owner and action; dashboards without decision rules become wallpaper.
  • 💡Use error budgets to negotiate product velocity calmly instead of debating reliability by vibes.