feat(indexing): surface live progress and reframe large-tree guidance by Neverdecel · Pull Request #55 · Neverdecel/CodeRAG

Neverdecel · 2026-06-18T17:23:49Z

Make a long index legible instead of a silent wait, and stop discouraging
large workspaces — they are a supported use case, not a footgun.

indexer: add _ProgressReporter (stderr, tty/non-tty, throttled) narrating
the discovery walk and embedding pass, so a big index never sits at a
silent "0%".
cli: coderag index prints a prep line + elapsed time.
mcp_server: _notify() emits startup / background-index / ready lifecycle
lines on stderr (stdout stays the stdio MCP channel).
install wizard: reframe the broad-root message from "don't" to "large but
supported" (background indexing, takes longer, skips VCS/build/deps), keep
only the genuine "/" pseudo-filesystem caveat; neutralize the prompt label.
default_workspace(): git-root default for the common case.

Co-Authored-By: Claude Opus 4.8 noreply@anthropic.com
Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

Make a long index legible instead of a silent wait, and stop discouraging large workspaces — they are a supported use case, not a footgun. - indexer: add _ProgressReporter (stderr, tty/non-tty, throttled) narrating the discovery walk and embedding pass, so a big index never sits at a silent "0%". - cli: `coderag index` prints a prep line + elapsed time. - mcp_server: _notify() emits startup / background-index / ready lifecycle lines on stderr (stdout stays the stdio MCP channel). - install wizard: reframe the broad-root message from "don't" to "large but supported" (background indexing, takes longer, skips VCS/build/deps), keep only the genuine "/" pseudo-filesystem caveat; neutralize the prompt label. - default_workspace(): git-root default for the common case. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

…e detection The connection runs in autocommit, so the old per-chunk INSERT loop fsynced once per chunk — ~1.25M fsyncs for a 125k-file tree, the dominant write cost. - sqlite_store: add a `transaction()` context manager; the indexer now wraps a file's delete+upsert+add in ONE commit (one fsync/file) and the delete-before-add is atomic. Add throughput PRAGMAs (cache_size, mmap_size, temp_store=MEMORY, busy_timeout). - schema: add a nullable `size` column (SCHEMA_VERSION 2) with an idempotent ALTER-TABLE migration in bootstrap() for legacy stores. - indexer: cheap (size, mtime) stat check skips the read+hash of unchanged files on re-index (hash stays the authority on content change); carry the existing file id so _write drops its second get_file; mutate FAISS only after the store commit (consistent on interrupt, self-heals on reopen). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

…le ignores At /home scale, dependency and cache directories are the bulk of the file count. Prune them by default so a broad index covers real source. - config: expand DEFAULT_IGNORE_GLOBS with caches/build/dep dirs (site-packages, .cache, .local, .npm, .cargo, target, vendor, .gradle, .terraform, .idea/.vscode, …); each `<name>/*` prunes that dir anywhere. - config: CODERAG_IGNORE_GLOBS env *appends* extra excludes to the defaults. - config: add `use_gitignore` (CODERAG_GITIGNORE, default on) — wired up next. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

Honor repo .gitignore files (nested, with negation and dir-only rules) so a large tree drops build/output the repo already excludes — on top of the built-in dependency/cache excludes. - _ignore: add walk_files(), the single ignore-aware enumerator, plus a pathspec-backed nested .gitignore matcher (nearest-rule-wins via the tri-state check_file). Prunes ignored dirs before descending. - indexer._walk and fs_search now both go through walk_files, so semantic search and exact search see exactly the same workspace (no divergence). - config: `use_gitignore` (default on); cli: `--gitignore/--no-gitignore`. - deps: add pathspec (>=0.12,<2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

Embedding is the dominant first-index cost; give it the hardware. fastembed now runs on a GPU when one is present and uses the configured batch size. - config: add embed_device (auto|cpu|cuda, CODERAG_EMBED_DEVICE) and embed_threads (CODERAG_EMBED_THREADS). - fastembed provider: select ONNX providers for the device (auto probes onnxruntime for CUDA), with a graceful CPU fallback if GPU init fails; pass embed_batch_size through to passage_embed (it was unused before). - pyproject: optional `gpu` extra (onnxruntime-gpu) for NVIDIA boxes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

…o beat) Adds scripts/bench_store.py: generates a synthetic tree (with junk dirs and a .gitignore) and indexes it with the fake provider to measure the store/pipeline path without model/network cost — first-index throughput, peak RSS, on-disk size, and incremental re-index time. Establishes the SQLite+FAISS baseline for the LanceDB investigation. Measured (fake provider): ~600 files/s store throughput, junk correctly excluded, and incremental re-index of a few-thousand-file tree in well under a second (stat-skip + per-file transactions). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

…te+FAISS Investigates replacing SQLite+FAISS with one embedded vector DB (the user's ask). LanceDB is the only embedded, OSS engine that scales to millions and unifies vector ANN + BM25 in a single store; sqlite-vec/Qdrant-local/DuckDB-VSS are disqualified (see docs/research/lancedb-spike.md). - coderag/store/lance_store.py: self-contained LanceStore (buffered/batched writes, cosine vector search, Tantivy BM25, RRF fusion → SearchHit). NOT wired into production — it exists for the head-to-head only. - scripts/bench_lance.py: throughput/disk/latency bake-off (offline, fake provider). Result: LanceDB indexes ~3x faster in ~2x less disk, latency comparable; per-file appends are pathological so writes must be batched. - scripts/eval_lance.py: retrieval-quality bake-off via the eval harness (needs a real embedding model — the adoption gate; blocked by egress here). - tests/test_lance_store.py: importorskip(lancedb) so CI stays green. - pyproject: optional `lance` extra; docs/research/lancedb-spike.md writeup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

LanceDB now holds chunk metadata, text (BM25), and vectors (ANN) in one embedded store — removing the SQLite source-of-truth + separate FAISS cache and all the hand-rolled FAISS<->SQLite consistency, IVF-training, and rebuild code. - store: production LanceStore (two tables: files for change-detection, chunks for BM25+vectors). Buffered batched writes (per-file appends are pathological for a columnar store), delete-before-add via write_file(replace=), cosine vector search, Tantivy BM25, RRF fusion, symbol index (gen-cached), stats, bootstrap (clear-on-model-change via a meta.json), optimize (FTS + vector ANN above a row threshold; brute-force below). - engine: api/indexer/search/webui go through cr.store only (cr.vectors gone). Indexer preloads file metadata in one scan, keeps the stat(size+mtime) skip, and only rebuilds indexes (optimize) on a full pass that changed something — single-file/watcher edits just flush (searchable via the unindexed tail scan). - config: drop index_type/ivf_* and db_path/faiss_path (FAISS-specific). - deps: lancedb + pylance + pyarrow are core; faiss-cpu and the lance extra removed. tests: delete the SQLite/FAISS-specific test_store.py, retarget the store-invariant assertions, expand test_lance_store.py (now the primary store suite). Remove the now-obsolete two-backend bake-off scripts. End-to-end (fake provider, integrated): ~750 files/s indexing, junk filtered, incremental re-index correct. CI gates (ruff/mypy/pytest -m "not integration") green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

Sweep stale SQLite/FAISS/IVF references now that LanceDB is the single store. - README, AGENTS.md, DEVELOPMENT.md: rewrite the architecture/invariants to the one-LanceDB-store model; drop the FAISS/Flat→IVF diagram and CODERAG_INDEX_TYPE / CODERAG_IVF_THRESHOLD from the env tables. - example.env: remove the dead index-type/IVF block. - docs/configuration.md: store-dir/env wording. - deploy (helm + README): drop indexType/ivfThreshold values, schema, and configmap env; reword the persistence/single-writer notes; Chart keyword faiss → lancedb. - eval datasets: repoint the deleted store files to coderag/store/lance_store.py. - docs/research/lancedb-spike.md: mark adopted. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

- bench_store.py: measure the whole store dir for on-disk size (the old coderag.db/index.faiss paths no longer exist under LanceDB). - coderag/__init__.py: comment wording (lazy import keeps lancedb/fastembed out). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

codecov-commenter · 2026-06-18T17:24:55Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 89.84375% with 52 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
coderag/store/lance_store.py	94.50%	15 Missing ⚠️
coderag/embeddings/fastembed_provider.py	58.33%	10 Missing ⚠️
coderag/surfaces/cli.py	42.85%	8 Missing ⚠️
coderag/surfaces/mcp_server.py	41.66%	7 Missing ⚠️
coderag/_ignore.py	91.30%	6 Missing ⚠️
coderag/indexer.py	94.66%	4 Missing ⚠️
coderag/embeddings/__init__.py	0.00%	1 Missing ⚠️
coderag/surfaces/webui.py	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

…ive) bandit flagged the "/tmp" path literal in the install wizard's broad-root warn-list as hardcoded_tmp_directory (B108, Medium), reddening the Security workflow. It's a path name, not temp-file usage. Remove it — "/tmp" isn't a meaningful workspace root to warn about; "/", $HOME, /home, /usr, … still do. `bandit -r coderag -ll` now clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

claude added 11 commits June 18, 2026 15:15

docs: bench_store docstring reflects LanceDB backend

ad609a1

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01PMicd6oQ89MDbp76KR2Hg7

Neverdecel merged commit 16b4658 into master Jun 18, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(indexing): surface live progress and reframe large-tree guidance#55

feat(indexing): surface live progress and reframe large-tree guidance#55
Neverdecel merged 12 commits into
masterfrom
claude/pensive-heisenberg-iniu7p

Neverdecel commented Jun 18, 2026

Uh oh!

codecov-commenter commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Neverdecel commented Jun 18, 2026

Uh oh!

codecov-commenter commented Jun 18, 2026

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants