ADR 0027: High-relevance client-side docs search (inverted index, IDF, and ranking boosts)
Ratification
- Discussion Issue: N/A (implemented as architecture improvement for docs UX)
- Merge PR: pending
- Accepted: 2026-04-17
Context
Docs are hosted as static files. We need fast, useful search across ADRs, runbooks, internal guides, developer pages, and API docs.
A simple “word present or not” scorer is easy to maintain but weak on large sites: common words win too often, long pages rank too high, and prefix search while typing needs better behavior.
Decision
We replace the old scorer with an inverted-index lexical model that combines:
- Per-field term frequencies at build time (title, URL, section, content).
- IDF-based weighting at query time.
- Log-scaled TF and document length normalization.
- High-precision boosts for exact phrase, all-token coverage, and title prefix matches.
- Prefix expansion only for the last query token for good type-ahead UX.
We also add search telemetry (append-only) in the app SQLite database so local runs stay simple and we can verify behavior.
Theoretical basis
Representation
Each document is represented as:
d = (title, url, section, preview, content_len, tf_title, tf_url, tf_section, tf_content)
where tf_field(term, d) is term frequency in that field.
Normalization
Input text is lowercased, punctuation-normalized, and whitespace-collapsed:
N(x) = trim(collapseSpaces(lowercase(x)))
Tokens are alphanumeric terms extracted from normalized text.
Core ranking formula
For query tokens T, base score is:
score_base(d, T) = Σ over t in T:
idf(t) * (
w_title * log(1 + tf_title(t, d))
+ w_url * log(1 + tf_url(t, d))
+ w_section * log(1 + tf_section(t, d))
+ w_content * log(1 + tf_content(t, d))
)
Weights are tuned for precision-first behavior:
w_title = 8.0
w_url = 4.0
w_section = 2.0
w_content = 1.4
IDF and length normalization
idf(t) = log(1 + (N + 1) / (df(t) + 0.5))
len_ratio(d) = content_len(d) / avg_content_len
norm(d) = 1 / (1 + 0.08 * max(0, len_ratio(d) - 1))
score_norm(d, T) = score_base(d, T) * norm(d)
This downweights very common tokens and prevents long pages from winning by volume alone.
Precision boosts
Final score adds deterministic bonuses:
score_final = score_norm
+ B_all_tokens_in_title
+ B_all_tokens_in_url
+ B_exact_phrase_in_title
+ B_exact_phrase_in_url
+ B_title_prefix
+ B_exact_section
These bonuses enforce intuitive ranking for navigational queries and short phrase queries.
Complexity model
With an inverted index, query complexity becomes proportional to postings, not all documents:
Build: O(total_tokens)
Query: O(sum_postings_for_query_terms + rerank_candidates)
Space: O(vocabulary + postings + doc_metadata)
This is substantially faster than scanning every document on each query.
Scope
- In scope: all HTML pages under
docs/, includingdocs/api/. - Out of scope: neural/vector search, typo-edit-distance search, multilingual morphology.
Alternatives considered
- Simple field-presence scoring
- Pros: tiny implementation.
- Cons: weak ranking on larger corpora; limited precision control.
- Hosted search providers
- Pros: rich relevance and analytics.
- Cons: external dependency, operational overhead, crawler governance.
- Client-side third-party full-text libraries
- Pros: mature ranking options.
- Cons: larger runtime dependency surface than needed.
Consequences
Positive
- Higher relevance for exact and navigational queries.
- Predictable and debuggable ranking math.
- Fast query path via postings-based candidate generation.
Trade-offs
- Larger index compared to minimal list-based format.
- More ranking parameters to maintain.
- Still lexical: no semantic similarity understanding.
Compatibility and migration
- Backward-compatibility impact: low (search implementation change only).
- Migration plan: generate new index version in existing docs pipeline.
- Rollback strategy: revert to previous scorer and simple index schema.
Implementation mapping
- Index builder:
scripts/build_docs_search_index.py - Runtime search and reranking:
docs/assets/docs-nav.js - UI styling and snippets:
docs/assets/docs.css - Artifact:
docs/assets/search-index.json(schema version 2) - Telemetry ingest API:
POST /internal/telemetry/docs-search - Telemetry metrics API:
GET /internal/telemetry/docs-search/metrics - Telemetry store module:
app/core/docs_search_telemetry.py - Telemetry database file: same SQLite file configured by
SQLITE_DB_PATH
Telemetry events and storage
Client events
search_query: one emitted query execution with query length, token count, result count, latency, and top-N impressions.search_result_click: click-through on a result (mouse/keyboard), including rank and URL.search_success: first successful click in a search session; includes time-to-success and time-to-click.search_query_error: client-side index-load failure for observability and diagnostics.
Persistence model
- All events are written append-only to
docs_search_eventsin the current application SQLite DB. - Telemetry data and business data share one DB file to simplify local setup.
- SQLite runs in WAL mode to reduce writer contention for frequent event inserts.
Metric definitions (canonical formulas)
Let Q be all search_query events in a time window, and S be first
search_success per session in the same window.
Zero-result rate
zero_result_rate = count(q in Q where q.results_count = 0) / count(Q)
Query CTR
query_ctr = count(distinct query_id in Q with at least one search_result_click) / count(Q)
Time-to-first-success
TTFS = distribution of s.time_to_success_ms for s in S
reported as p50 and p75
For this ADR, dashboard defaults are: p50 and p75 over the selected rolling window.
Validation
- Run
make docs-fixand verify index generation step succeeds. - Smoke-test queries across ADR/internal/developer/runbooks/api paths.
- Verify keyboard UX: up/down/enter/escape.
- Track payload size and first-query latency after docs growth.
- Send a search event and verify row creation in telemetry DB (
docs_search_events). - Call
GET /internal/telemetry/docs-search/metricsand verify non-zero aggregates.
References
Page history
| Date | Change | Author |
|---|---|---|
| Added Page history section (repository baseline). | Ivan Boyarkin |