ADR 0011: SLOs, SLAs, Error Budget, and Monitoring Alerts

Ratification

Adopted before ADR 0018. There was no separate ratification process. Git history for this file on main is the record.

Discussion Issue: not recorded (before ADR 0018)
Merge PR: see git history for this file
Accepted: as merged to main

Context

Why this matters: Dashboards alone do not answer “are we allowed to ship today?” Service level objectives (SLOs) turn metrics into targets (for example “99.9% of requests succeed without 5xx”). An error budget is the allowed failure slice implied by that target; when it is exhausted, the team slows feature work and fixes reliability—an idea popularized by Google SRE and widely adopted in the industry.

ADR 0009 added metrics and probes. This ADR adds explicit SLIs/SLOs, budget-style burn signals, and alerts so on-call responds to customer-impacting risk, not arbitrary thresholds.

Decision

Service level indicators (SLIs)

HTTP availability (5xx) — fraction of requests that do not return a 5xx status, measured from http_requests_total (Prometheus job study-app-api).
API latency — 95th percentile of http_request_duration_seconds for routes whose path_template is not /metrics, /live, or /ready, so health and scrape traffic do not dominate the SLI.
Readiness — success of an HTTP GET to /ready expecting status 200, measured via the Prometheus Blackbox exporter (job study-app-blackbox, metric probe_success). This reflects dependency readiness; it is distinct from the HTTP error SLI on application routes.

Service level objectives (SLOs)

Targets are evaluated over a rolling 30-day window unless stated otherwise.

SLI	Objective	Notes
HTTP availability (non-5xx)	99.9% of requests succeed without 5xx	Complementary view: 5xx ratio must stay within 0.1% of traffic over the window.
API latency (p95)	95% of in-scope requests complete in under 500 ms	Measured as histogram p95 over 5 min rate windows for alerting; SLO is a monthly/rolling intent.
Readiness probe	`/ready` returns HTTP 200 at least 99.9% of probe intervals	Blackbox-based; indicates DB and dependency health from outside the process.

Error budget

For HTTP availability, the allowed bad event ratio is 1 − 0.999 = 0.001 (0.1% of requests may fail with 5xx over the rolling window). Prometheus recording rules expose job:study_app:http_5xx:burn30d: the ratio of actual 5xx share to the allowed 0.001 share. Values above 1 mean the error budget for the window is exhausted.

Fast burn (short windows) and sustained high 5xx rates are alerted separately so on-call can act before the full 30-day budget is consumed.

Optional external SLA

An SLA is a commitment to customers or internal consumers (support tickets, credits, page). This ADR defines internal SLOs only. If a public SLA is published, it should be equal to or stricter than these SLOs, with measurement clauses referencing Prometheus/Grafana and the error budget exhaustion runbook for breach handling.

Alerting

Prometheus alerting rules live under ops/prometheus/rules/. Alertmanager is not required for evaluation: firing alerts appear in the Prometheus UI (/alerts). Wiring Alertmanager for notifications is a follow-up.

Implementation

ops/prometheus/rules/study_app_slo.yml — recording and alert rules.
ops/prometheus/blackbox.yml — Blackbox module http_2xx for /ready.
ops/prometheus/prometheus.tpl.yml — rule_files, API scrape, Blackbox scrape.
scripts/render_prometheus_config.py — substitutes scrape target and ready-probe URL.
docker-compose.observability.yml — Blackbox service; Prometheus mounts rules directory.
ops/grafana/dashboards/study-app-observability.json — SLO panels (5xx, readiness, budget, p95).
Runbook: Error budget exhaustion — policy and triage when budget is burned or alerts fire.

Consequences

Positive

Shared language for reliability and explicit trade-offs between velocity and stability.
Alerts grounded in SLOs rather than ad hoc thresholds alone.

Trade-offs

Low-traffic environments can make 30-day windows noisy or flat; tune evaluation windows for non-prod if needed.
Blackbox adds a container and requires the API reachable from Docker (same defaults as Prometheus scrape).

Page history

Date	Change	Author
2026-04-21	Added Page history section (repository baseline).	Ivan Boyarkin