Runbooks Index

Overview

Use this page to pick a runbook by incident type. Each runbook lists triggers, quick checks, likely causes, recovery steps, and how to confirm the problem is gone.

Runbooks

Document	Description
Template	Shell for a new runbook (same sections every time).
Tests failing	Failing tests in CI or locally: triage and fix.
Migrations failing	Alembic errors, bad migration files, or schema mismatch.
Logging failing	No log lines, bad paths, or permissions on `logs/`.
Pre-commit failing	Hooks fail on commit; staging and auto-format conflicts.
Quality check failing	`make verify` fails: lint, types, tests, or docs sync.
API security failing	Auth, rate limits, CORS, headers, or body size problems.
OpenAPI contract test failing	Contract test or OpenAPI baseline drift.
Observability scrape failing	Prometheus down targets or empty Grafana panels.
Error budget exhaustion	SLO alerts, high 5xx, latency, or readiness failures.
In-page TOC missing	“On this page” sidebar missing, empty, or wrong.

Operational rule

If you find a policy or architecture gap, add or update an ADR.
If it is a process or step gap, update the runbook (this one or another).

Mini-SLA

During an incident: add short notes or a checklist to the runbook within 30 minutes.
After the fix: publish a clean runbook update within 24 hours.
Same incident twice in 30 days: update the runbook and add a preventive action (docs or ADR).

Page history

Date	Change	Author
2026-04-21	Added Page history section (repository baseline).	Ivan Boyarkin