Runbook: Observability Scrape Failing

Trigger

  • Prometheus target is DOWN.
  • /metrics endpoint returns non-200 or empty payload.
  • Grafana dashboard panels show No data.

Fast triage

  1. Ensure API is running: make run.
  2. Ensure observability stack is running: make observability-up.
  3. Run smoke-check: make observability-smoke.
  4. Verify endpoint directly: curl http://127.0.0.1:8000/metrics.
  5. Open Prometheus targets page and inspect scrape error details.

Useful links:
Prometheus targets: http://127.0.0.1:9090/targets
Grafana dashboard: http://127.0.0.1:3001/d/study-app-observability/study-app-observability?orgId=1

Quick Prometheus queries:
RPS: open
Error rate: open
API p95: open
DB p95: open

Most common causes

  • API process is down or started on a different port.
  • METRICS_ENABLED=false in runtime environment.
  • Wrong scrape target in ops/prometheus/prometheus.tpl.yml (rendered to ops/prometheus/prometheus.yml).
  • Docker cannot reach the host (host.docker.internal missing or broken).

Recovery steps

Endpoint-level checks

  • Run curl -i http://127.0.0.1:8000/metrics and confirm status 200.
  • Confirm payload contains metric names such as http_requests_total.

Prometheus target checks

  • Open http://127.0.0.1:9090/targets.
  • If target is down, inspect PROMETHEUS_SCRAPE_TARGET, then re-run make observability-up to render and apply config.
  • Restart stack: make observability-down && make observability-up.

Grafana checks

  • Open Grafana (http://127.0.0.1:3001) and verify Prometheus datasource is healthy.
  • Generate test traffic to create fresh metrics:
for i in {1..20}; do curl -s http://127.0.0.1:8000/live > /dev/null; done
            for i in {1..20}; do curl -s http://127.0.0.1:8000/ready > /dev/null; done

Exit criteria

  • Prometheus target is UP.
  • /metrics returns expected metrics.
  • Grafana dashboard panels show non-empty time-series data.

Follow-up

  • If scrape targets or metrics policy changed, update ops docs or an ADR.
  • If the same scrape failure repeats, add a note to this runbook.

Page history

Date Change Author
Added Page history section (repository baseline). Ivan Boyarkin