Skip to content

feat(indexing): surface live progress and reframe large-tree guidance#55

Merged
Neverdecel merged 12 commits into
masterfrom
claude/pensive-heisenberg-iniu7p
Jun 18, 2026
Merged

feat(indexing): surface live progress and reframe large-tree guidance#55
Neverdecel merged 12 commits into
masterfrom
claude/pensive-heisenberg-iniu7p

Conversation

@Neverdecel

Copy link
Copy Markdown
Owner

Make a long index legible instead of a silent wait, and stop discouraging
large workspaces — they are a supported use case, not a footgun.

  • indexer: add _ProgressReporter (stderr, tty/non-tty, throttled) narrating
    the discovery walk and embedding pass, so a big index never sits at a
    silent "0%".
  • cli: coderag index prints a prep line + elapsed time.
  • mcp_server: _notify() emits startup / background-index / ready lifecycle
    lines on stderr (stdout stays the stdio MCP channel).
  • install wizard: reframe the broad-root message from "don't" to "large but
    supported" (background indexing, takes longer, skips VCS/build/deps), keep
    only the genuine "/" pseudo-filesystem caveat; neutralize the prompt label.
  • default_workspace(): git-root default for the common case.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

claude added 11 commits June 18, 2026 15:15
Make a long index legible instead of a silent wait, and stop discouraging
large workspaces — they are a supported use case, not a footgun.

- indexer: add _ProgressReporter (stderr, tty/non-tty, throttled) narrating
  the discovery walk and embedding pass, so a big index never sits at a
  silent "0%".
- cli: `coderag index` prints a prep line + elapsed time.
- mcp_server: _notify() emits startup / background-index / ready lifecycle
  lines on stderr (stdout stays the stdio MCP channel).
- install wizard: reframe the broad-root message from "don't" to "large but
  supported" (background indexing, takes longer, skips VCS/build/deps), keep
  only the genuine "/" pseudo-filesystem caveat; neutralize the prompt label.
- default_workspace(): git-root default for the common case.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…e detection

The connection runs in autocommit, so the old per-chunk INSERT loop fsynced
once per chunk — ~1.25M fsyncs for a 125k-file tree, the dominant write cost.

- sqlite_store: add a `transaction()` context manager; the indexer now wraps a
  file's delete+upsert+add in ONE commit (one fsync/file) and the
  delete-before-add is atomic. Add throughput PRAGMAs (cache_size, mmap_size,
  temp_store=MEMORY, busy_timeout).
- schema: add a nullable `size` column (SCHEMA_VERSION 2) with an idempotent
  ALTER-TABLE migration in bootstrap() for legacy stores.
- indexer: cheap (size, mtime) stat check skips the read+hash of unchanged
  files on re-index (hash stays the authority on content change); carry the
  existing file id so _write drops its second get_file; mutate FAISS only
  after the store commit (consistent on interrupt, self-heals on reopen).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…le ignores

At /home scale, dependency and cache directories are the bulk of the file
count. Prune them by default so a broad index covers real source.

- config: expand DEFAULT_IGNORE_GLOBS with caches/build/dep dirs
  (site-packages, .cache, .local, .npm, .cargo, target, vendor, .gradle,
  .terraform, .idea/.vscode, …); each `<name>/*` prunes that dir anywhere.
- config: CODERAG_IGNORE_GLOBS env *appends* extra excludes to the defaults.
- config: add `use_gitignore` (CODERAG_GITIGNORE, default on) — wired up next.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Honor repo .gitignore files (nested, with negation and dir-only rules) so a
large tree drops build/output the repo already excludes — on top of the
built-in dependency/cache excludes.

- _ignore: add walk_files(), the single ignore-aware enumerator, plus a
  pathspec-backed nested .gitignore matcher (nearest-rule-wins via the
  tri-state check_file). Prunes ignored dirs before descending.
- indexer._walk and fs_search now both go through walk_files, so semantic
  search and exact search see exactly the same workspace (no divergence).
- config: `use_gitignore` (default on); cli: `--gitignore/--no-gitignore`.
- deps: add pathspec (>=0.12,<2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Embedding is the dominant first-index cost; give it the hardware. fastembed
now runs on a GPU when one is present and uses the configured batch size.

- config: add embed_device (auto|cpu|cuda, CODERAG_EMBED_DEVICE) and
  embed_threads (CODERAG_EMBED_THREADS).
- fastembed provider: select ONNX providers for the device (auto probes
  onnxruntime for CUDA), with a graceful CPU fallback if GPU init fails;
  pass embed_batch_size through to passage_embed (it was unused before).
- pyproject: optional `gpu` extra (onnxruntime-gpu) for NVIDIA boxes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…o beat)

Adds scripts/bench_store.py: generates a synthetic tree (with junk dirs and a
.gitignore) and indexes it with the fake provider to measure the store/pipeline
path without model/network cost — first-index throughput, peak RSS, on-disk
size, and incremental re-index time. Establishes the SQLite+FAISS baseline for
the LanceDB investigation.

Measured (fake provider): ~600 files/s store throughput, junk correctly
excluded, and incremental re-index of a few-thousand-file tree in well under a
second (stat-skip + per-file transactions).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…te+FAISS

Investigates replacing SQLite+FAISS with one embedded vector DB (the user's
ask). LanceDB is the only embedded, OSS engine that scales to millions and
unifies vector ANN + BM25 in a single store; sqlite-vec/Qdrant-local/DuckDB-VSS
are disqualified (see docs/research/lancedb-spike.md).

- coderag/store/lance_store.py: self-contained LanceStore (buffered/batched
  writes, cosine vector search, Tantivy BM25, RRF fusion → SearchHit). NOT
  wired into production — it exists for the head-to-head only.
- scripts/bench_lance.py: throughput/disk/latency bake-off (offline, fake
  provider). Result: LanceDB indexes ~3x faster in ~2x less disk, latency
  comparable; per-file appends are pathological so writes must be batched.
- scripts/eval_lance.py: retrieval-quality bake-off via the eval harness
  (needs a real embedding model — the adoption gate; blocked by egress here).
- tests/test_lance_store.py: importorskip(lancedb) so CI stays green.
- pyproject: optional `lance` extra; docs/research/lancedb-spike.md writeup.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
LanceDB now holds chunk metadata, text (BM25), and vectors (ANN) in one
embedded store — removing the SQLite source-of-truth + separate FAISS cache and
all the hand-rolled FAISS<->SQLite consistency, IVF-training, and rebuild code.

- store: production LanceStore (two tables: files for change-detection, chunks
  for BM25+vectors). Buffered batched writes (per-file appends are pathological
  for a columnar store), delete-before-add via write_file(replace=), cosine
  vector search, Tantivy BM25, RRF fusion, symbol index (gen-cached), stats,
  bootstrap (clear-on-model-change via a meta.json), optimize (FTS + vector ANN
  above a row threshold; brute-force below).
- engine: api/indexer/search/webui go through cr.store only (cr.vectors gone).
  Indexer preloads file metadata in one scan, keeps the stat(size+mtime) skip,
  and only rebuilds indexes (optimize) on a full pass that changed something —
  single-file/watcher edits just flush (searchable via the unindexed tail scan).
- config: drop index_type/ivf_* and db_path/faiss_path (FAISS-specific).
- deps: lancedb + pylance + pyarrow are core; faiss-cpu and the lance extra
  removed. tests: delete the SQLite/FAISS-specific test_store.py, retarget the
  store-invariant assertions, expand test_lance_store.py (now the primary store
  suite). Remove the now-obsolete two-backend bake-off scripts.

End-to-end (fake provider, integrated): ~750 files/s indexing, junk filtered,
incremental re-index correct. CI gates (ruff/mypy/pytest -m "not integration")
green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Sweep stale SQLite/FAISS/IVF references now that LanceDB is the single store.

- README, AGENTS.md, DEVELOPMENT.md: rewrite the architecture/invariants to the
  one-LanceDB-store model; drop the FAISS/Flat→IVF diagram and CODERAG_INDEX_TYPE
  / CODERAG_IVF_THRESHOLD from the env tables.
- example.env: remove the dead index-type/IVF block.
- docs/configuration.md: store-dir/env wording.
- deploy (helm + README): drop indexType/ivfThreshold values, schema, and
  configmap env; reword the persistence/single-writer notes; Chart keyword
  faiss → lancedb.
- eval datasets: repoint the deleted store files to coderag/store/lance_store.py.
- docs/research/lancedb-spike.md: mark adopted.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
- bench_store.py: measure the whole store dir for on-disk size (the old
  coderag.db/index.faiss paths no longer exist under LanceDB).
- coderag/__init__.py: comment wording (lazy import keeps lancedb/fastembed out).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 89.84375% with 52 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
coderag/store/lance_store.py 94.50% 15 Missing ⚠️
coderag/embeddings/fastembed_provider.py 58.33% 10 Missing ⚠️
coderag/surfaces/cli.py 42.85% 8 Missing ⚠️
coderag/surfaces/mcp_server.py 41.66% 7 Missing ⚠️
coderag/_ignore.py 91.30% 6 Missing ⚠️
coderag/indexer.py 94.66% 4 Missing ⚠️
coderag/embeddings/__init__.py 0.00% 1 Missing ⚠️
coderag/surfaces/webui.py 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ive)

bandit flagged the "/tmp" path literal in the install wizard's broad-root
warn-list as hardcoded_tmp_directory (B108, Medium), reddening the Security
workflow. It's a path name, not temp-file usage. Remove it — "/tmp" isn't a
meaningful workspace root to warn about; "/", $HOME, /home, /usr, … still do.
`bandit -r coderag -ll` now clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
@Neverdecel Neverdecel merged commit 16b4658 into master Jun 18, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants