feat(indexing): surface live progress and reframe large-tree guidance#55
Merged
Conversation
Make a long index legible instead of a silent wait, and stop discouraging large workspaces — they are a supported use case, not a footgun. - indexer: add _ProgressReporter (stderr, tty/non-tty, throttled) narrating the discovery walk and embedding pass, so a big index never sits at a silent "0%". - cli: `coderag index` prints a prep line + elapsed time. - mcp_server: _notify() emits startup / background-index / ready lifecycle lines on stderr (stdout stays the stdio MCP channel). - install wizard: reframe the broad-root message from "don't" to "large but supported" (background indexing, takes longer, skips VCS/build/deps), keep only the genuine "/" pseudo-filesystem caveat; neutralize the prompt label. - default_workspace(): git-root default for the common case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…e detection The connection runs in autocommit, so the old per-chunk INSERT loop fsynced once per chunk — ~1.25M fsyncs for a 125k-file tree, the dominant write cost. - sqlite_store: add a `transaction()` context manager; the indexer now wraps a file's delete+upsert+add in ONE commit (one fsync/file) and the delete-before-add is atomic. Add throughput PRAGMAs (cache_size, mmap_size, temp_store=MEMORY, busy_timeout). - schema: add a nullable `size` column (SCHEMA_VERSION 2) with an idempotent ALTER-TABLE migration in bootstrap() for legacy stores. - indexer: cheap (size, mtime) stat check skips the read+hash of unchanged files on re-index (hash stays the authority on content change); carry the existing file id so _write drops its second get_file; mutate FAISS only after the store commit (consistent on interrupt, self-heals on reopen). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…le ignores At /home scale, dependency and cache directories are the bulk of the file count. Prune them by default so a broad index covers real source. - config: expand DEFAULT_IGNORE_GLOBS with caches/build/dep dirs (site-packages, .cache, .local, .npm, .cargo, target, vendor, .gradle, .terraform, .idea/.vscode, …); each `<name>/*` prunes that dir anywhere. - config: CODERAG_IGNORE_GLOBS env *appends* extra excludes to the defaults. - config: add `use_gitignore` (CODERAG_GITIGNORE, default on) — wired up next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Honor repo .gitignore files (nested, with negation and dir-only rules) so a large tree drops build/output the repo already excludes — on top of the built-in dependency/cache excludes. - _ignore: add walk_files(), the single ignore-aware enumerator, plus a pathspec-backed nested .gitignore matcher (nearest-rule-wins via the tri-state check_file). Prunes ignored dirs before descending. - indexer._walk and fs_search now both go through walk_files, so semantic search and exact search see exactly the same workspace (no divergence). - config: `use_gitignore` (default on); cli: `--gitignore/--no-gitignore`. - deps: add pathspec (>=0.12,<2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Embedding is the dominant first-index cost; give it the hardware. fastembed now runs on a GPU when one is present and uses the configured batch size. - config: add embed_device (auto|cpu|cuda, CODERAG_EMBED_DEVICE) and embed_threads (CODERAG_EMBED_THREADS). - fastembed provider: select ONNX providers for the device (auto probes onnxruntime for CUDA), with a graceful CPU fallback if GPU init fails; pass embed_batch_size through to passage_embed (it was unused before). - pyproject: optional `gpu` extra (onnxruntime-gpu) for NVIDIA boxes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…o beat) Adds scripts/bench_store.py: generates a synthetic tree (with junk dirs and a .gitignore) and indexes it with the fake provider to measure the store/pipeline path without model/network cost — first-index throughput, peak RSS, on-disk size, and incremental re-index time. Establishes the SQLite+FAISS baseline for the LanceDB investigation. Measured (fake provider): ~600 files/s store throughput, junk correctly excluded, and incremental re-index of a few-thousand-file tree in well under a second (stat-skip + per-file transactions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
…te+FAISS Investigates replacing SQLite+FAISS with one embedded vector DB (the user's ask). LanceDB is the only embedded, OSS engine that scales to millions and unifies vector ANN + BM25 in a single store; sqlite-vec/Qdrant-local/DuckDB-VSS are disqualified (see docs/research/lancedb-spike.md). - coderag/store/lance_store.py: self-contained LanceStore (buffered/batched writes, cosine vector search, Tantivy BM25, RRF fusion → SearchHit). NOT wired into production — it exists for the head-to-head only. - scripts/bench_lance.py: throughput/disk/latency bake-off (offline, fake provider). Result: LanceDB indexes ~3x faster in ~2x less disk, latency comparable; per-file appends are pathological so writes must be batched. - scripts/eval_lance.py: retrieval-quality bake-off via the eval harness (needs a real embedding model — the adoption gate; blocked by egress here). - tests/test_lance_store.py: importorskip(lancedb) so CI stays green. - pyproject: optional `lance` extra; docs/research/lancedb-spike.md writeup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
LanceDB now holds chunk metadata, text (BM25), and vectors (ANN) in one embedded store — removing the SQLite source-of-truth + separate FAISS cache and all the hand-rolled FAISS<->SQLite consistency, IVF-training, and rebuild code. - store: production LanceStore (two tables: files for change-detection, chunks for BM25+vectors). Buffered batched writes (per-file appends are pathological for a columnar store), delete-before-add via write_file(replace=), cosine vector search, Tantivy BM25, RRF fusion, symbol index (gen-cached), stats, bootstrap (clear-on-model-change via a meta.json), optimize (FTS + vector ANN above a row threshold; brute-force below). - engine: api/indexer/search/webui go through cr.store only (cr.vectors gone). Indexer preloads file metadata in one scan, keeps the stat(size+mtime) skip, and only rebuilds indexes (optimize) on a full pass that changed something — single-file/watcher edits just flush (searchable via the unindexed tail scan). - config: drop index_type/ivf_* and db_path/faiss_path (FAISS-specific). - deps: lancedb + pylance + pyarrow are core; faiss-cpu and the lance extra removed. tests: delete the SQLite/FAISS-specific test_store.py, retarget the store-invariant assertions, expand test_lance_store.py (now the primary store suite). Remove the now-obsolete two-backend bake-off scripts. End-to-end (fake provider, integrated): ~750 files/s indexing, junk filtered, incremental re-index correct. CI gates (ruff/mypy/pytest -m "not integration") green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Sweep stale SQLite/FAISS/IVF references now that LanceDB is the single store. - README, AGENTS.md, DEVELOPMENT.md: rewrite the architecture/invariants to the one-LanceDB-store model; drop the FAISS/Flat→IVF diagram and CODERAG_INDEX_TYPE / CODERAG_IVF_THRESHOLD from the env tables. - example.env: remove the dead index-type/IVF block. - docs/configuration.md: store-dir/env wording. - deploy (helm + README): drop indexType/ivfThreshold values, schema, and configmap env; reword the persistence/single-writer notes; Chart keyword faiss → lancedb. - eval datasets: repoint the deleted store files to coderag/store/lance_store.py. - docs/research/lancedb-spike.md: mark adopted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
- bench_store.py: measure the whole store dir for on-disk size (the old coderag.db/index.faiss paths no longer exist under LanceDB). - coderag/__init__.py: comment wording (lazy import keeps lancedb/fastembed out). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ive) bandit flagged the "/tmp" path literal in the install wizard's broad-root warn-list as hardcoded_tmp_directory (B108, Medium), reddening the Security workflow. It's a path name, not temp-file usage. Remove it — "/tmp" isn't a meaningful workspace root to warn about; "/", $HOME, /home, /usr, … still do. `bandit -r coderag -ll` now clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Make a long index legible instead of a silent wait, and stop discouraging
large workspaces — they are a supported use case, not a footgun.
the discovery walk and embedding pass, so a big index never sits at a
silent "0%".
coderag indexprints a prep line + elapsed time.lines on stderr (stdout stays the stdio MCP channel).
supported" (background indexing, takes longer, skips VCS/build/deps), keep
only the genuine "/" pseudo-filesystem caveat; neutralize the prompt label.
Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7