ADR 0011: SLOs, SLAs, Error Budget, and Monitoring Alerts
Ratification
Adopted before ADR 0018. There was no separate ratification process. Git history for this file on main is the record.
- Discussion Issue: not recorded (before ADR 0018)
- Merge PR: see git history for this file
- Accepted: as merged to
main
Context
Why this matters: Dashboards alone do not answer “are we allowed to ship today?” Service level objectives (SLOs) turn metrics into targets (for example “99.9% of requests succeed without 5xx”). An error budget is the allowed failure slice implied by that target; when it is exhausted, the team slows feature work and fixes reliability—an idea popularized by Google SRE and widely adopted in the industry.
ADR 0009 added metrics and probes. This ADR adds explicit SLIs/SLOs, budget-style burn signals, and alerts so on-call responds to customer-impacting risk, not arbitrary thresholds.
Decision
Service level indicators (SLIs)
-
HTTP availability (5xx) — fraction of requests that do not return a 5xx status, measured from
http_requests_total(Prometheus jobstudy-app-api). -
API latency — 95th percentile of
http_request_duration_secondsfor routes whosepath_templateis not/metrics,/live, or/ready, so health and scrape traffic do not dominate the SLI. -
Readiness — success of an HTTP GET to
/readyexpecting status 200, measured via the Prometheus Blackbox exporter (jobstudy-app-blackbox, metricprobe_success). This reflects dependency readiness; it is distinct from the HTTP error SLI on application routes.
Service level objectives (SLOs)
Targets are evaluated over a rolling 30-day window unless stated otherwise.
| SLI | Objective | Notes |
|---|---|---|
| HTTP availability (non-5xx) | 99.9% of requests succeed without 5xx | Complementary view: 5xx ratio must stay within 0.1% of traffic over the window. |
| API latency (p95) | 95% of in-scope requests complete in under 500 ms | Measured as histogram p95 over 5 min rate windows for alerting; SLO is a monthly/rolling intent. |
| Readiness probe | /ready returns HTTP 200 at least 99.9% of probe intervals |
Blackbox-based; indicates DB and dependency health from outside the process. |
Error budget
For HTTP availability, the allowed bad event ratio is 1 − 0.999 = 0.001 (0.1% of
requests may fail with 5xx over the rolling window). Prometheus recording rules expose
job:study_app:http_5xx:burn30d: the ratio of actual 5xx share to the allowed 0.001 share. Values
above 1 mean the error budget for the window is exhausted.
Fast burn (short windows) and sustained high 5xx rates are alerted separately so on-call can act before the full 30-day budget is consumed.
Optional external SLA
An SLA is a commitment to customers or internal consumers (support tickets, credits, page). This ADR defines internal SLOs only. If a public SLA is published, it should be equal to or stricter than these SLOs, with measurement clauses referencing Prometheus/Grafana and the error budget exhaustion runbook for breach handling.
Alerting
Prometheus alerting rules live under ops/prometheus/rules/. Alertmanager is not
required for evaluation: firing alerts appear in the Prometheus UI (/alerts). Wiring Alertmanager
for notifications is a follow-up.
Implementation
ops/prometheus/rules/study_app_slo.yml— recording and alert rules.ops/prometheus/blackbox.yml— Blackbox modulehttp_2xxfor/ready.ops/prometheus/prometheus.tpl.yml—rule_files, API scrape, Blackbox scrape.scripts/render_prometheus_config.py— substitutes scrape target and ready-probe URL.docker-compose.observability.yml— Blackbox service; Prometheus mounts rules directory.ops/grafana/dashboards/study-app-observability.json— SLO panels (5xx, readiness, budget, p95).- Runbook: Error budget exhaustion — policy and triage when budget is burned or alerts fire.
Consequences
Positive
- Shared language for reliability and explicit trade-offs between velocity and stability.
- Alerts grounded in SLOs rather than ad hoc thresholds alone.
Trade-offs
- Low-traffic environments can make 30-day windows noisy or flat; tune evaluation windows for non-prod if needed.
- Blackbox adds a container and requires the API reachable from Docker (same defaults as Prometheus scrape).
Page history
| Date | Change | Author |
|---|---|---|
| Added Page history section (repository baseline). | Ivan Boyarkin |