Runbooks Index

Overview

Use this page to pick a runbook by incident type. Each runbook lists triggers, quick checks, likely causes, recovery steps, and how to confirm the problem is gone.

Runbooks

Document Description
Template Shell for a new runbook (same sections every time).
Tests failing Failing tests in CI or locally: triage and fix.
Migrations failing Alembic errors, bad migration files, or schema mismatch.
Logging failing No log lines, bad paths, or permissions on logs/.
Pre-commit failing Hooks fail on commit; staging and auto-format conflicts.
Quality check failing make verify fails: lint, types, tests, or docs sync.
API security failing Auth, rate limits, CORS, headers, or body size problems.
OpenAPI contract test failing Contract test or OpenAPI baseline drift.
Observability scrape failing Prometheus down targets or empty Grafana panels.
Error budget exhaustion SLO alerts, high 5xx, latency, or readiness failures.
In-page TOC missing “On this page” sidebar missing, empty, or wrong.

Operational rule

  • If you find a policy or architecture gap, add or update an ADR.
  • If it is a process or step gap, update the runbook (this one or another).

Mini-SLA

  • During an incident: add short notes or a checklist to the runbook within 30 minutes.
  • After the fix: publish a clean runbook update within 24 hours.
  • Same incident twice in 30 days: update the runbook and add a preventive action (docs or ADR).

Page history

Date Change Author
Added Page history section (repository baseline). Ivan Boyarkin