ADR 0011: SLOs, SLAs, Error Budget, and Monitoring Alerts

Ratification

Adopted before ADR 0018. There was no separate ratification process. Git history for this file on main is the record.

Context

Why this matters: Dashboards alone do not answer “are we allowed to ship today?” Service level objectives (SLOs) turn metrics into targets (for example “99.9% of requests succeed without 5xx”). An error budget is the allowed failure slice implied by that target; when it is exhausted, the team slows feature work and fixes reliability—an idea popularized by Google SRE and widely adopted in the industry.

ADR 0009 added metrics and probes. This ADR adds explicit SLIs/SLOs, budget-style burn signals, and alerts so on-call responds to customer-impacting risk, not arbitrary thresholds.

Decision

Service level indicators (SLIs)

Service level objectives (SLOs)

Targets are evaluated over a rolling 30-day window unless stated otherwise.

SLI Objective Notes
HTTP availability (non-5xx) 99.9% of requests succeed without 5xx Complementary view: 5xx ratio must stay within 0.1% of traffic over the window.
API latency (p95) 95% of in-scope requests complete in under 500 ms Measured as histogram p95 over 5 min rate windows for alerting; SLO is a monthly/rolling intent.
Readiness probe /ready returns HTTP 200 at least 99.9% of probe intervals Blackbox-based; indicates DB and dependency health from outside the process.

Error budget

For HTTP availability, the allowed bad event ratio is 1 − 0.999 = 0.001 (0.1% of requests may fail with 5xx over the rolling window). Prometheus recording rules expose job:study_app:http_5xx:burn30d: the ratio of actual 5xx share to the allowed 0.001 share. Values above 1 mean the error budget for the window is exhausted.

Fast burn (short windows) and sustained high 5xx rates are alerted separately so on-call can act before the full 30-day budget is consumed.

Optional external SLA

An SLA is a commitment to customers or internal consumers (support tickets, credits, page). This ADR defines internal SLOs only. If a public SLA is published, it should be equal to or stricter than these SLOs, with measurement clauses referencing Prometheus/Grafana and the error budget exhaustion runbook for breach handling.

Alerting

Prometheus alerting rules live under ops/prometheus/rules/. Alertmanager is not required for evaluation: firing alerts appear in the Prometheus UI (/alerts). Wiring Alertmanager for notifications is a follow-up.

Implementation

Consequences

Positive

Trade-offs

Page history

Date Change Author
Added Page history section (repository baseline). Ivan Boyarkin