ADR 0027: High-relevance client-side docs search (inverted index, IDF, and ranking boosts)

Ratification

Context

Docs are hosted as static files. We need fast, useful search across ADRs, runbooks, internal guides, developer pages, and API docs.

A simple “word present or not” scorer is easy to maintain but weak on large sites: common words win too often, long pages rank too high, and prefix search while typing needs better behavior.

Decision

We replace the old scorer with an inverted-index lexical model that combines:

  1. Per-field term frequencies at build time (title, URL, section, content).
  2. IDF-based weighting at query time.
  3. Log-scaled TF and document length normalization.
  4. High-precision boosts for exact phrase, all-token coverage, and title prefix matches.
  5. Prefix expansion only for the last query token for good type-ahead UX.

We also add search telemetry (append-only) in the app SQLite database so local runs stay simple and we can verify behavior.

Theoretical basis

Representation

Each document is represented as:

d = (title, url, section, preview, content_len, tf_title, tf_url, tf_section, tf_content)

where tf_field(term, d) is term frequency in that field.

Normalization

Input text is lowercased, punctuation-normalized, and whitespace-collapsed:

N(x) = trim(collapseSpaces(lowercase(x)))

Tokens are alphanumeric terms extracted from normalized text.

Core ranking formula

For query tokens T, base score is:

score_base(d, T) = Σ over t in T:
        idf(t) * (
        w_title   * log(1 + tf_title(t, d))
        + w_url     * log(1 + tf_url(t, d))
        + w_section * log(1 + tf_section(t, d))
        + w_content * log(1 + tf_content(t, d))
        )

Weights are tuned for precision-first behavior:

w_title = 8.0
          w_url = 4.0
          w_section = 2.0
          w_content = 1.4

IDF and length normalization

idf(t) = log(1 + (N + 1) / (df(t) + 0.5))

            len_ratio(d) = content_len(d) / avg_content_len
            norm(d) = 1 / (1 + 0.08 * max(0, len_ratio(d) - 1))

            score_norm(d, T) = score_base(d, T) * norm(d)

This downweights very common tokens and prevents long pages from winning by volume alone.

Precision boosts

Final score adds deterministic bonuses:

score_final = score_norm
              + B_all_tokens_in_title
              + B_all_tokens_in_url
              + B_exact_phrase_in_title
              + B_exact_phrase_in_url
              + B_title_prefix
              + B_exact_section

These bonuses enforce intuitive ranking for navigational queries and short phrase queries.

Complexity model

With an inverted index, query complexity becomes proportional to postings, not all documents:

Build: O(total_tokens)
                Query: O(sum_postings_for_query_terms + rerank_candidates)
                Space: O(vocabulary + postings + doc_metadata)

This is substantially faster than scanning every document on each query.

Scope

Alternatives considered

  1. Simple field-presence scoring
    • Pros: tiny implementation.
    • Cons: weak ranking on larger corpora; limited precision control.
  2. Hosted search providers
    • Pros: rich relevance and analytics.
    • Cons: external dependency, operational overhead, crawler governance.
  3. Client-side third-party full-text libraries
    • Pros: mature ranking options.
    • Cons: larger runtime dependency surface than needed.

Consequences

Positive

Trade-offs

Compatibility and migration

Implementation mapping

Telemetry events and storage

Client events

Persistence model

Metric definitions (canonical formulas)

Let Q be all search_query events in a time window, and S be first search_success per session in the same window.

Zero-result rate

zero_result_rate = count(q in Q where q.results_count = 0) / count(Q)

Query CTR

query_ctr = count(distinct query_id in Q with at least one search_result_click) / count(Q)

Time-to-first-success

TTFS = distribution of s.time_to_success_ms for s in S
                      reported as p50 and p75

For this ADR, dashboard defaults are: p50 and p75 over the selected rolling window.

Validation

References

Page history

Date Change Author
Added Page history section (repository baseline). Ivan Boyarkin