Runbook: Error Budget Exhaustion and SLO Alerts

Trigger

Prometheus alerts StudyAppErrorBudgetExhausted, StudyAppErrorBudgetCritical, StudyAppFast5xxBurn, StudyAppHighHttp5xxRate, StudyAppLatencySLOViolation, or StudyAppReadyProbeFailing are firing.
Grafana shows error budget burn over 1, a high 5xx rate, or readiness at 0.
Users see an outage or you breach an external SLA at the same time.

What “error budget exhausted” means

For this API we allow at most 0.1% of requests to return HTTP 5xx in a rolling 30-day window (99.9% success). The rule job:study_app:http_5xx:burn30d compares real 5xx share to that allowance. When burn is above 1, you have used more errors than budget for that window.

Full policy and SLI text: ADR 0011: SLOs, SLAs, error budget, and monitoring alerts.

Fast triage (order)

Readiness / dependencies — If StudyAppReadyProbeFailing fires, use Observability scrape failing: API up, DB reachable, curl -i http://127.0.0.1:8000/ready returns 200.
5xx and latency — Open Grafana dashboard Study App Observability and Prometheus Alerts / Graph.
Recent change — Look for deploys, config edits, migrations, or traffic spikes in that window.
Rollback — If a new release is likely the cause, roll back to the last good version (your usual process).
Saturation — If latency SLO fires but 5xx are low, check CPU, memory, pools, and rate limits.

Response when budget is exhausted or critical

Incident — Sustained burn or user impact: name a lead, use one channel, write a short timeline.
Feature freeze — Until things are stable, pause non-urgent releases; ship only fixes for the incident.
Root cause — After mitigation, record cause and follow-ups (tests, alerts, ADR if needed).
External SLA — If you have a published SLA, follow its customer comms and escalation steps.

Useful links (local defaults)

Grafana: Study App Observability
Prometheus alerts: http://127.0.0.1:9090/alerts
ADR 0011: SLOs, SLAs, error budget
Observability baseline: ADR 0009

Exit criteria

Firing alerts are resolved or acknowledged with a documented exception.
5xx rate and latency panels are back within SLO targets on the Grafana dashboard.
Error budget burn is going down, or the team agreed a new window rule in writing.

Page history

Date	Change	Author
2026-04-21	Added Page history section (repository baseline).	Ivan Boyarkin