Runbook: Error Budget Exhaustion and SLO Alerts

Trigger

  • Prometheus alerts StudyAppErrorBudgetExhausted, StudyAppErrorBudgetCritical, StudyAppFast5xxBurn, StudyAppHighHttp5xxRate, StudyAppLatencySLOViolation, or StudyAppReadyProbeFailing are firing.
  • Grafana shows error budget burn over 1, a high 5xx rate, or readiness at 0.
  • Users see an outage or you breach an external SLA at the same time.

What “error budget exhausted” means

For this API we allow at most 0.1% of requests to return HTTP 5xx in a rolling 30-day window (99.9% success). The rule job:study_app:http_5xx:burn30d compares real 5xx share to that allowance. When burn is above 1, you have used more errors than budget for that window.

Full policy and SLI text: ADR 0011: SLOs, SLAs, error budget, and monitoring alerts.

Fast triage (order)

  1. Readiness / dependencies — If StudyAppReadyProbeFailing fires, use Observability scrape failing: API up, DB reachable, curl -i http://127.0.0.1:8000/ready returns 200.
  2. 5xx and latency — Open Grafana dashboard Study App Observability and Prometheus Alerts / Graph.
  3. Recent change — Look for deploys, config edits, migrations, or traffic spikes in that window.
  4. Rollback — If a new release is likely the cause, roll back to the last good version (your usual process).
  5. Saturation — If latency SLO fires but 5xx are low, check CPU, memory, pools, and rate limits.

Response when budget is exhausted or critical

  • Incident — Sustained burn or user impact: name a lead, use one channel, write a short timeline.
  • Feature freeze — Until things are stable, pause non-urgent releases; ship only fixes for the incident.
  • Root cause — After mitigation, record cause and follow-ups (tests, alerts, ADR if needed).
  • External SLA — If you have a published SLA, follow its customer comms and escalation steps.

Useful links (local defaults)

Grafana: Study App Observability
Prometheus alerts: http://127.0.0.1:9090/alerts
ADR 0011: SLOs, SLAs, error budget
Observability baseline: ADR 0009

Exit criteria

  • Firing alerts are resolved or acknowledged with a documented exception.
  • 5xx rate and latency panels are back within SLO targets on the Grafana dashboard.
  • Error budget burn is going down, or the team agreed a new window rule in writing.

Page history

Date Change Author
Added Page history section (repository baseline). Ivan Boyarkin