Runbook: Observability Scrape Failing
Trigger
- Prometheus target is
DOWN. /metricsendpoint returns non-200 or empty payload.- Grafana dashboard panels show
No data.
Fast triage
- Ensure API is running:
make run. - Ensure observability stack is running:
make observability-up. - Run smoke-check:
make observability-smoke. - Verify endpoint directly:
curl http://127.0.0.1:8000/metrics. - Open Prometheus targets page and inspect scrape error details.
Useful links:
Prometheus targets:
http://127.0.0.1:9090/targets
Grafana dashboard:
http://127.0.0.1:3001/d/study-app-observability/study-app-observability?orgId=1
Quick Prometheus queries:
RPS:
open
Error rate:
open
API p95:
open
DB p95:
open
Most common causes
- API process is down or started on a different port.
METRICS_ENABLED=falsein runtime environment.- Wrong scrape target in
ops/prometheus/prometheus.tpl.yml(rendered toops/prometheus/prometheus.yml). - Docker cannot reach the host (
host.docker.internalmissing or broken).
Recovery steps
Endpoint-level checks
- Run
curl -i http://127.0.0.1:8000/metricsand confirm status200. - Confirm payload contains metric names such as
http_requests_total.
Prometheus target checks
- Open
http://127.0.0.1:9090/targets. - If target is down, inspect
PROMETHEUS_SCRAPE_TARGET, then re-runmake observability-upto render and apply config. - Restart stack:
make observability-down && make observability-up.
Grafana checks
- Open Grafana (
http://127.0.0.1:3001) and verify Prometheus datasource is healthy. - Generate test traffic to create fresh metrics:
for i in {1..20}; do curl -s http://127.0.0.1:8000/live > /dev/null; done
for i in {1..20}; do curl -s http://127.0.0.1:8000/ready > /dev/null; done
Exit criteria
- Prometheus target is
UP. /metricsreturns expected metrics.- Grafana dashboard panels show non-empty time-series data.
Follow-up
- If scrape targets or metrics policy changed, update ops docs or an ADR.
- If the same scrape failure repeats, add a note to this runbook.
Page history
| Date | Change | Author |
|---|---|---|
| Added Page history section (repository baseline). | Ivan Boyarkin |