ADR 0009: Health, Readiness, and Prometheus Observability
Ratification
Adopted before ADR 0018. There was no separate ratification process. Git history for this file on main is the record.
- Discussion Issue: not recorded (before ADR 0018)
- Merge PR: see git history for this file
- Accepted: as merged to
main
Context
Why this matters: Runtimes and load balancers must know whether to send traffic to an instance: liveness (“is the process alive?”) and readiness (“can this instance serve requests right now?”—including DB). Without separate probes, restarts and routing decisions are guesses.
Metrics (latency, status codes, DB time) turn “it feels slow” into graphs and alerts. We had a basic health path but needed explicit readiness and a Prometheus baseline aligned with staging/production.
Decision
- Introduce
/livefor process liveness checks. - Introduce
/readyfor dependency readiness checks (including DB probe). - Expose Prometheus metrics via
/metrics. - Collect API latency, HTTP status counters, and DB operation latency with low-cardinality labels.
- Adopt OpenAPI snapshot contract tests (
make contract-test) in the quality gate.
Implementation
- Health/readiness endpoints in
app/main.pyand schemas inapp/schemas/system.py. - Metrics collector and SQLAlchemy hooks in
app/core/metrics.py. - Config toggles and thresholds in
app/core/config.pyandenv/example. - Operational commands in
Makefileand observability stack indocker-compose.observability.yml.
Operational Model
- Prometheus uses pull-based scraping from
/metrics. - Dockerized Prometheus + Grafana is the default local/staging bootstrap path.
- Production deployment may run the same stack outside Docker (for example on a VM) with the same endpoint contract.
Consequences
Positive
- Faster diagnosis of incidents via explicit readiness and standardized metrics.
- Reduced risk of accidental contract drift through snapshot tests.
- Consistent observability baseline from local to staging/production.
Trade-offs
- Additional maintenance for metrics and dashboards.
- Need to enforce low-cardinality labels and avoid exposing sensitive dimensions.
Related
- ADR 0011: SLOs, SLAs, error budget, and monitoring alerts — targets, recording rules, and readiness probing.
-
ADR 0023: Structured logs, request correlation, optional local Elasticsearch —
complements metrics with searchable NDJSON and
X-Request-Id; not a substitute for Prometheus. - Runbook: Error budget exhaustion — response when SLO budget is burned or alerts fire.
Page history
| Date | Change | Author |
|---|---|---|
| Added Page history section (repository baseline). | Ivan Boyarkin |