ADR 0009: Health, Readiness, and Prometheus Observability

Ratification

Adopted before ADR 0018. There was no separate ratification process. Git history for this file on main is the record.

Discussion Issue: not recorded (before ADR 0018)
Merge PR: see git history for this file
Accepted: as merged to main

Context

Why this matters: Runtimes and load balancers must know whether to send traffic to an instance: liveness (“is the process alive?”) and readiness (“can this instance serve requests right now?”—including DB). Without separate probes, restarts and routing decisions are guesses.

Metrics (latency, status codes, DB time) turn “it feels slow” into graphs and alerts. We had a basic health path but needed explicit readiness and a Prometheus baseline aligned with staging/production.

Decision

Introduce /live for process liveness checks.
Introduce /ready for dependency readiness checks (including DB probe).
Expose Prometheus metrics via /metrics.
Collect API latency, HTTP status counters, and DB operation latency with low-cardinality labels.
Adopt OpenAPI snapshot contract tests (make contract-test) in the quality gate.

Implementation

Health/readiness endpoints in app/main.py and schemas in app/schemas/system.py.
Metrics collector and SQLAlchemy hooks in app/core/metrics.py.
Config toggles and thresholds in app/core/config.py and env/example.
Operational commands in Makefile and observability stack in docker-compose.observability.yml.

Operational Model

Prometheus uses pull-based scraping from /metrics.
Dockerized Prometheus + Grafana is the default local/staging bootstrap path.
Production deployment may run the same stack outside Docker (for example on a VM) with the same endpoint contract.

Consequences

Positive

Faster diagnosis of incidents via explicit readiness and standardized metrics.
Reduced risk of accidental contract drift through snapshot tests.
Consistent observability baseline from local to staging/production.

Trade-offs

Additional maintenance for metrics and dashboards.
Need to enforce low-cardinality labels and avoid exposing sensitive dimensions.

ADR 0011: SLOs, SLAs, error budget, and monitoring alerts — targets, recording rules, and readiness probing.
ADR 0023: Structured logs, request correlation, optional local Elasticsearch — complements metrics with searchable NDJSON and X-Request-Id; not a substitute for Prometheus.
Runbook: Error budget exhaustion — response when SLO budget is burned or alerts fire.

Page history

Date	Change	Author
2026-04-21	Added Page history section (repository baseline).	Ivan Boyarkin