Skip to content

feat(ui): embed-vs-store latency breakdown + warm HTTP API on startup#60

Merged
Neverdecel merged 2 commits into
masterfrom
claude/retrieval-timing-breakdown
Jun 19, 2026
Merged

feat(ui): embed-vs-store latency breakdown + warm HTTP API on startup#60
Neverdecel merged 2 commits into
masterfrom
claude/retrieval-timing-breakdown

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Context

Follow-up to the LanceDB latency investigation (#59). A user reported 207 ms retrieval over only 906 chunks. Measured on normal hardware the full warm path at that scale is ~17 ms (≈3 ms query embedding + ≈14 ms LanceDB store), so the high number is host-bound (a shared/throttled demo box), not a code regression. To make that provable from the UI instead of guessed, this adds a per-phase latency breakdown — and fixes one real warm-up gap found along the way.

Changes

1. Embed-vs-store breakdown in the demo speed badge

  • HybridSearcher.search() now takes an optional timings dict and records per-phase latency (embed_ms / dense_ms / lexical_ms / hydrate_ms / rerank_ms). Default None keeps every existing caller unchanged.
  • The demo UI splits the ⚡ badge into embed (model inference) vs store (vector + BM25 + hydrate), so a slow query is immediately attributable. On a busy host the embedding usually dominates — the single number hid that.

2. Warm the embedding model on HTTP API startup (separate commit)

  • run_server() called cr.status(), which builds the provider/store but never embeds — so the first HTTP /search paid the full cold model load + ONNX JIT. run_ui() already calls cr.warm(); this does the same for the HTTP surface.

Verification

  • Rendered the demo UI end-to-end (TestClient + fake provider): badge shows embed X · store Y ms.
  • All retrieval / store / indexer / webui / surfaces tests pass; lint + format clean.
  • Measurements backing the diagnosis: bge-small warm embed ≈2.5 ms; full LanceDB store path at 906 rows ≈14 ms (median).

🤖 Generated with Claude Code


Generated by Claude Code

@codecov-commenter

codecov-commenter commented Jun 19, 2026

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 48.00000% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
coderag/surfaces/webui.py 0.00% 6 Missing ⚠️
coderag/retrieval/search.py 70.58% 5 Missing ⚠️
coderag/surfaces/http_api.py 0.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

claude added 2 commits June 19, 2026 10:36
The ⚡ badge timed the whole search() call, so a slow result couldn't be
attributed. On a busy/throttled host the query embedding (model inference)
usually dominates — not the LanceDB retrieval — but the single number hid that.

- HybridSearcher.search() takes an optional timings dict and records per-phase
  latency (embed_ms / dense_ms / lexical_ms / hydrate_ms / rerank_ms). Default
  None keeps every existing caller unchanged.
- The demo UI splits the badge into "embed" (model) vs "store" (vector + BM25 +
  hydrate) so a slow query is immediately attributable.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Y1DfHPqxHppXF6zEYgFKi3
run_server() called cr.status(), which builds the provider/store but never
embeds — so the first HTTP /search paid the full cold model load + ONNX JIT.
run_ui() already calls cr.warm(); do the same here so the first request is fast.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Y1DfHPqxHppXF6zEYgFKi3
@Neverdecel Neverdecel force-pushed the claude/retrieval-timing-breakdown branch from e090396 to 3dd014d Compare June 19, 2026 10:36
@Neverdecel Neverdecel merged commit 895f9d4 into master Jun 19, 2026
13 checks passed
Neverdecel added a commit that referenced this pull request Jun 19, 2026
…61)

warm() ran status() + embed_query(), so the store's vector/FTS/scalar indexes and LanceDB's query path stayed cold until the first real query — which then paid the full index-load cost (visible via #60's badge as a large store_ms: embed 26ms vs store 363ms over 548 chunks on the demo).

Run one representative search() in warm() so the retrieval indexes are resident before the first user query. Local repro (~550 chunks): first-query store drops from ~35ms to ~14ms. Best-effort and guarded so warm-up can't block startup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants