RFC 0001: Documentation Search Implementation
Metadata
- Date: 2026-04-17
- Related decision: ADR 0027
Goals and non-goals
Goals
- Fast client-side search across all HTML docs pages.
- Lexical ranking that is deterministic and easy to tune.
- Few runtime dependencies; works on static hosting.
- Telemetry we can act on: zero-result rate, query click-through, and time-to-first-success.
Non-goals
- Semantic/vector search, typo fixing, or language-specific word forms.
- Hosted search products (e.g. cloud SaaS).
Architecture overview
- The build writes
docs/assets/search-index.jsonfromdocs/**/*.html. - The browser loads the index once; ranking runs in JavaScript at query time.
- The UI shows the top results and supports keyboard navigation.
- The browser sends telemetry events to the API.
- The API appends rows to SQLite table
docs_search_events. - The API exposes KPI snapshots at
/internal/telemetry/docs-search/metrics.
Build-time indexing
Implemented in scripts/build_docs_search_index.py.
Input corpus
- Includes: all
docs/**/*.html. - Excludes: files under
docs/assets/.
Extraction and normalization
- Title from
<title>with fallback to file stem. - Content from
<main>; fallback to<body>. - Removes script/style/noscript/svg/template blocks and HTML tags.
- Whitespace collapse and entity decoding.
- Lowercase normalization and alphanumeric tokenization.
Index schema (version 2)
meta: doc count, vocabulary size, avg content length, formulas.docs: metadata for rendering and length normalization.doc_freq: token document frequencies.postings: token postings with per-field TF values.
Query-time ranking
Implemented in docs/assets/docs-nav.js.
Core formula
score_base(d, T) = Σ over t in T:
idf(t) * (
8.0 * log(1 + tf_title(t, d))
+ 4.0 * log(1 + tf_url(t, d))
+ 2.0 * log(1 + tf_section(t, d))
+ 1.4 * log(1 + tf_content(t, d))
)
idf(t) = log(1 + (N + 1) / (df(t) + 0.5))
norm(d) = 1 / (1 + 0.08 * max(0, content_len(d) / avg_content_len - 1))
score_norm = score_base * norm
Precision boosts
- All tokens present in title.
- All tokens present in URL.
- Exact query phrase in title.
- Exact query phrase in URL.
- Title prefix match.
- Exact section match.
Prefix expansion policy
- Applied only to the last query token.
- Enabled for token length 3+.
- Capped by
DOCS_SEARCH_MAX_PREFIX_EXPANSIONSfor bounded latency.
Telemetry implementation
Client event types
search_query: emitted after each completed query execution.search_result_click: emitted on result click / enter activation.search_success: first successful click within active search session.search_query_error: emitted when index loading fails.
Transport behavior
- Uses
fetchwithkeepalive: true. - Locally, docs on
127.0.0.1:8765sends telemetry to the API on127.0.0.1:8000. - You can override the endpoint with
<meta name="docs-search-telemetry-endpoint" ...>.
Server-side ingestion and storage
- Ingest endpoint:
POST /internal/telemetry/docs-search. - Storage table:
docs_search_events(append-only). - DB file: same SQLite file configured by
SQLITE_DB_PATH. - WAL mode enabled in telemetry store.
Canonical KPI definitions
In a rolling time window:
- Zero-result rate =
count(search_query where results_count = 0) / count(search_query). - Query CTR =
count(distinct query_id with at least one search_result_click) / count(search_query). - Time-to-first-success = distribution of first
search_success.time_to_success_msper session (reported as p50/p75).
Metrics endpoint:
GET /internal/telemetry/docs-search/metrics?window_minutes=<N>.
Local validation playbook
- Rebuild the search index:
python3 scripts/build_docs_search_index.py - Serve docs over HTTP (not
file://):cd docs python3 -m http.server 8765 - Run the API:
make run - Open
http://127.0.0.1:8765/index.html, run searches, click a result. - In the browser network tab, confirm
GET /assets/search-index.jsonreturns200or304. - Check the database:
sqlite3 study_app.db "select id,event,session_id,query_id,datetime(emitted_at_ms/1000,'unixepoch','localtime') from docs_search_events order by id desc limit 20;" - Check the metrics API:
curl "http://127.0.0.1:8000/internal/telemetry/docs-search/metrics?window_minutes=60"
Troubleshooting
Could not load search index
- Browsers block
fetchforfile://pages. Serve docs over HTTP. - A bad relative path can request
/internal/assets/search-index.jsoninstead of/assets/search-index.json.
Telemetry preflight returns 200 but the DB has no rows
OPTIONSonly checks CORS; confirm a realPOSTand new rows in the DB.- Hard-refresh the page so the browser does not use an old
docs-nav.js. - Some SQLite GUI tools do not refresh for WAL mode; use the terminal query above.
Change management
- ADRs stay for architecture (what/why); RFCs hold implementation details (how).
- If ranking weights or formulas change, update this RFC and run the local validation playbook.
- If the telemetry schema changes, keep the metrics endpoint backward compatible when you can.
Page history
| Date | Change | Author |
|---|---|---|
| Added Page history section (repository baseline). | Ivan Boyarkin |