Runbook: Error Budget Exhaustion and SLO Alerts
Trigger
-
Prometheus alerts
StudyAppErrorBudgetExhausted,StudyAppErrorBudgetCritical,StudyAppFast5xxBurn,StudyAppHighHttp5xxRate,StudyAppLatencySLOViolation, orStudyAppReadyProbeFailingare firing. - Grafana shows error budget burn over 1, a high 5xx rate, or readiness at 0.
- Users see an outage or you breach an external SLA at the same time.
What “error budget exhausted” means
For this API we allow at most 0.1% of requests to return HTTP 5xx in a rolling 30-day window
(99.9% success). The rule job:study_app:http_5xx:burn30d compares real 5xx share to that allowance.
When burn is above 1, you have used more errors than budget for that window.
Full policy and SLI text: ADR 0011: SLOs, SLAs, error budget, and monitoring alerts.
Fast triage (order)
-
Readiness / dependencies — If
StudyAppReadyProbeFailingfires, use Observability scrape failing: API up, DB reachable,curl -i http://127.0.0.1:8000/readyreturns200. - 5xx and latency — Open Grafana dashboard Study App Observability and Prometheus Alerts / Graph.
- Recent change — Look for deploys, config edits, migrations, or traffic spikes in that window.
- Rollback — If a new release is likely the cause, roll back to the last good version (your usual process).
- Saturation — If latency SLO fires but 5xx are low, check CPU, memory, pools, and rate limits.
Response when budget is exhausted or critical
- Incident — Sustained burn or user impact: name a lead, use one channel, write a short timeline.
- Feature freeze — Until things are stable, pause non-urgent releases; ship only fixes for the incident.
- Root cause — After mitigation, record cause and follow-ups (tests, alerts, ADR if needed).
- External SLA — If you have a published SLA, follow its customer comms and escalation steps.
Useful links (local defaults)
Grafana:
Study App Observability
Prometheus alerts:
http://127.0.0.1:9090/alerts
ADR 0011:
SLOs, SLAs, error budget
Observability baseline:
ADR 0009
Exit criteria
- Firing alerts are resolved or acknowledged with a documented exception.
- 5xx rate and latency panels are back within SLO targets on the Grafana dashboard.
- Error budget burn is going down, or the team agreed a new window rule in writing.
Page history
| Date | Change | Author |
|---|---|---|
| Added Page history section (repository baseline). | Ivan Boyarkin |