diff --git a/AGENTS.md b/AGENTS.md index 15950e3..24756bb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -5,7 +5,7 @@ - `coderag/config.py`, `coderag/types.py`: Immutable `Config` and shared dataclasses. - `coderag/embeddings/`: `EmbeddingProvider` protocol + `fastembed` (default), `openai`, `fake`. - `coderag/chunking/`: Symbol-aware chunking (`python_ast.py`, `treesitter.py`, line-window `base.py`). -- `coderag/store/`: `sqlite_store.py` (source of truth + FTS5) and `vector_index.py` (FAISS Flat/IVF cache). +- `coderag/store/`: `lance_store.py` — a single embedded LanceDB store (chunk metadata, BM25, and vectors). - `coderag/retrieval/`: Hybrid dense + BM25 search fused with RRF. - `coderag/indexer.py`, `coderag/watch.py`: Incremental indexing and the debounced watcher. - `coderag/_ignore.py`: Shared ignore-glob matching used by both the indexer and `fs_search`. @@ -29,10 +29,10 @@ - First-party module is `coderag`; surfaces must stay thin — no engine logic in `surfaces/`. ## Architecture Invariants -- SQLite is the source of truth; the FAISS index is a rebuildable cache (`rebuild_from_store`). -- `chunks.id` is the FAISS id and is `AUTOINCREMENT` (ids never reused). -- Incremental indexing is delete-before-add (no duplicate/stale vectors); unchanged files skip via content hash. -- Embedding dimension comes from the provider, not a constant; a model change triggers a rebuild. +- One embedded LanceDB store holds metadata + BM25 + vectors; it's rebuildable by re-indexing from source. +- `chunks.id` is a store-managed integer id used as the fusion/hydrate key. +- Incremental indexing is delete-before-add (no duplicate/stale rows); unchanged files skip via size+mtime then content hash. +- Embedding dimension comes from the provider, not a constant; a model change clears the store for a clean re-index. ## Testing Guidelines - Place tests in `tests/` as `test_*.py`; keep them deterministic and offline (use the `fake` provider fixture). diff --git a/DEVELOPMENT.md b/DEVELOPMENT.md index 3be3ecc..3106627 100644 --- a/DEVELOPMENT.md +++ b/DEVELOPMENT.md @@ -26,29 +26,28 @@ coderag/ ├── llm.py # Optional streamed LLM answer over retrieved chunks ├── embeddings/ # EmbeddingProvider protocol + fastembed / openai / fake ├── chunking/ # Symbol-aware chunking: python_ast, treesitter, line-window base -├── store/ # SQLite source of truth + pluggable FAISS vector index -│ ├── sqlite_store.py # files/chunks/vectors + FTS5 lexical search -│ └── vector_index.py # FaissVectorIndex: Flat (exact) / IVF (scale) +├── store/ # Single embedded LanceDB store +│ └── lance_store.py # files/chunks + BM25 (FTS) + vectors (ANN) in one place ├── retrieval/ # Hybrid search: dense + BM25, fused with RRF └── surfaces/ # cli.py · http_api.py (FastAPI) · webui.py · mcp_server.py (MCP) ``` ### Design invariants (don't break these) -- **SQLite is the source of truth; FAISS is a rebuildable cache.** Vectors are stored as - BLOBs in SQLite, so `FaissVectorIndex.rebuild_from_store()` can always reconstruct the - index. `ensure_consistent()` does this automatically when counts disagree. -- **`chunks.id` is the FAISS id and is `AUTOINCREMENT`** — ids are never reused, which keeps - a stale cache from resurrecting deleted content. -- **Delete-before-add.** A changed file's old chunks are removed from both SQLite and FAISS - before new ones are added (`Indexer._write`). This is the bug the old `monitor.py` had. +- **One LanceDB store holds everything** (chunk metadata, text/BM25, and vectors/ANN). It is + rebuildable from source: re-indexing recreates it, and a `--full` pass clears and rebuilds. +- **`chunks.id` is a store-managed integer id** used as the fusion/hydrate key; ids are not + reused within a run. +- **Delete-before-add.** A changed file's old rows are removed before new ones are added + (`Indexer._write` → `LanceStore.write_file(replace=True)`), so editing never accumulates + stale or duplicate rows. - **The embedding dimension comes from the provider**, never a hard-coded constant. A model - change is detected via `meta.embed_dim` and triggers a clean rebuild. + change is detected via the store's `meta.json` and clears the store for a clean re-index. - **Writes serialize; reads don't block.** All indexing/deletion goes through one lock on the - `CodeRAG` facade (`_index_lock`), and `FaissVectorIndex` guards its own add/remove/search/ - rebuild — so the MCP server's background index and live watcher run safely alongside + `CodeRAG` facade (`_index_lock`); the store buffers writes on the writer and reads query + committed data — so the MCP server's background index and live watcher run safely alongside concurrent agent searches. Indexing may parallelize chunk+embed across `index_workers` - threads, but the SQLite/FAISS writes stay single-writer (`Indexer._write`). + threads, but the store writes stay single-writer (`Indexer._write`). ## Quality gate diff --git a/README.md b/README.md index cfaa8f2..24956a4 100644 --- a/README.md +++ b/README.md @@ -39,7 +39,7 @@ Coding agents like Claude Code and Codex locate code by *running searches* — g repeat — which burns tokens and round-trips and reduces to literal keyword matching. CodeRAG turns the workspace into a **warm, pre-indexed** engine: a single query returns the right functions and files ranked by **meaning *and* keyword**, with exact `path:line` citations. The -embedding model loads once, so each query is one in-process lookup (FAISS + BM25 + fusion), not +embedding model loads once, so each query is one in-process lookup (vector ANN + BM25 + fusion), not a multi-round shell loop — and over MCP (`coderag mcp`, below) it becomes the agent's search tool. **Proof from the eval harness** — this repo's 24 natural-language → file queries (90 files / @@ -72,7 +72,7 @@ and the honest caveats — is in [`docs/eval.md`](docs/eval.md). - **Drop-in for AI coding agents — one command.** `coderag install` wires the **MCP server** into **Claude Code**, **Hermes**, and **Codex** (auto-detect or an interactive wizard, idempotent, with backups) so they search a warm, pre-indexed workspace instead of slow grep/glob/read loops — ranked `path:line` results from a single call, index kept live as you edit. Works on a plain file directory too, not just code. - **Measured, not guessed.** A built-in **evaluation harness** (`coderag eval`) scores retrieval quality — recall@k, MRR, nDCG@k at file *or* symbol level — and can mine a benchmark straight from your git history. Every default (1:1 hybrid, reranker opt-in, adaptive fusion off) is the choice the harness validated, including across an external repo. - **Incremental & live.** Content-hashed indexing only re-embeds files that changed; a debounced watcher keeps the index current as you code. No duplicate or stale vectors. -- **Built to scale.** Exact `Flat` search for small repos, automatic switch to approximate `IVF` past a threshold so it stays fast at 100k+ chunks. +- **Built to scale.** An embedded [LanceDB](https://github.com/lancedb/lancedb) store: brute-force exact search for small repos, automatic ANN indexing past a threshold so it stays fast at 100k+ chunks. - **Five surfaces, one engine.** CLI · Python library · HTTP/REST · web UI · MCP server — all thin wrappers over the same `CodeRAG` object. ### ⚡ One line: install + wire into your agent @@ -158,7 +158,7 @@ from coderag import CodeRAG, Config cr = CodeRAG(Config.from_env(watched_dir="/path/to/repo")) cr.index() -for hit in cr.search("how is the FAISS index persisted?"): +for hit in cr.search("how is the vector index persisted?"): print(f"{hit.location} {hit.symbol} (sim={hit.similarity:.2f})") print(hit.text) ``` @@ -197,7 +197,7 @@ See it live (read-only, indexing this repo): ** B[Symbol-aware chunking
ast / tree-sitter] B --> C[Embeddings
fastembed · OpenAI · self-hosted] - C --> D[(SQLite store
chunks + vectors + FTS5)] - D --> E[FAISS index
Flat → IVF] + C --> D[(LanceDB store
chunks + vectors + BM25)] Q[Query] --> F[Dense + BM25] - E --> F D --> F F --> G[Reciprocal Rank Fusion] G --> H[Ranked hits
path:line + score] ``` -- **SQLite is the source of truth** (chunk text, line ranges, symbols, content hashes, and the - raw vectors). The **FAISS index is a rebuildable cache** — it can always be reconstructed - from SQLite, so switching models or index types never corrupts your data. -- Each file's content is **hashed**; unchanged files are skipped on re-index. A changed file's - old chunks are removed from *both* the store and the vector index **before** new ones are - added — so editing never accumulates stale or duplicate vectors. +- **One embedded LanceDB store** holds everything — chunk text, line ranges, symbols, content + hashes, the vectors (ANN), and the BM25 index — so there is no separate cache to keep in + sync. The store is also a rebuildable view of your code: it can always be re-indexed from + source, so switching embedding models never corrupts your data. +- Each file's content is **hashed**; unchanged files are skipped on re-index (a cheap + size+mtime check avoids even reading them). A changed file's old chunks are removed + **before** new ones are added — so editing never accumulates stale or duplicate vectors. ## ⚙️ Configuration @@ -372,7 +371,7 @@ no API key needed for a local server: ollama serve && ollama pull llama3.1 export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama's OpenAI-compatible endpoint export CODERAG_CHAT_MODEL=llama3.1 -coderag search "how is the FAISS index persisted" --answer # answer written locally +coderag search "how is the vector index persisted" --answer # answer written locally ``` ### Common settings @@ -385,9 +384,7 @@ table is in [`docs/configuration.md`](docs/configuration.md). | `CODERAG_PROVIDER` | `fastembed` | Embedding backend: `fastembed` (local) · `openai` (OpenAI API **or** any OpenAI-compatible/local server) · `fake` | | `CODERAG_MODEL` | `BAAI/bge-small-en-v1.5` | Local embedding model (`coderag eval --list-models`) | | `CODERAG_WATCHED_DIR` | cwd | Codebase to index | -| `CODERAG_STORE_DIR` | `./.coderag` | Where the DB + index live | -| `CODERAG_INDEX_TYPE` | `auto` | `auto` · `flat` · `ivf` | -| `CODERAG_IVF_THRESHOLD` | `50000` | Vectors before switching Flat → IVF | +| `CODERAG_STORE_DIR` | `./.coderag` | Where the LanceDB store lives | | `CODERAG_TOP_K` | `8` | Results returned | | `OPENAI_BASE_URL` | – | Point at a self-hosted / local OpenAI-compatible server (Ollama, vLLM, LM Studio, LocalAI) — enables local embeddings **and** local answers | | `OPENAI_API_KEY` | – | OpenAI **cloud** embeddings / answers (optional for a local server) | @@ -431,7 +428,7 @@ Apache License 2.0 — see [LICENSE](LICENSE). ## 🙏 Acknowledgments -[FAISS](https://github.com/facebookresearch/faiss) · [fastembed](https://github.com/qdrant/fastembed) · +[LanceDB](https://github.com/lancedb/lancedb) · [fastembed](https://github.com/qdrant/fastembed) · [tree-sitter](https://tree-sitter.github.io/tree-sitter/) · [FastAPI](https://fastapi.tiangolo.com/) · [Jinja](https://jinja.palletsprojects.com/) · [Pygments](https://pygments.org/) · [watchdog](https://github.com/gorakhargosh/watchdog) diff --git a/coderag/__init__.py b/coderag/__init__.py index 3214f04..69a927c 100644 --- a/coderag/__init__.py +++ b/coderag/__init__.py @@ -18,7 +18,7 @@ if TYPE_CHECKING: # Re-exported lazily at runtime via __getattr__ below (keeps ``import coderag`` - # light — no faiss/fastembed pulled in at import). Declared here only so type + # light — no lancedb/fastembed pulled in at import). Declared here only so type # checkers and static analysis see ``CodeRAG`` as a defined export of __all__. from coderag.api import CodeRAG @@ -28,7 +28,7 @@ def __getattr__(name: str) -> object: - # Lazy re-export so ``import coderag`` stays light (no faiss/fastembed at import). + # Lazy re-export so ``import coderag`` stays light (no lancedb/fastembed at import). if name == "CodeRAG": from coderag.api import CodeRAG diff --git a/coderag/_ignore.py b/coderag/_ignore.py index 14314f8..83983ca 100644 --- a/coderag/_ignore.py +++ b/coderag/_ignore.py @@ -1,16 +1,23 @@ -"""Shared ignore-glob matching for indexing and exact filesystem search. +"""Shared file-walking + ignore matching for indexing and exact filesystem search. Both the :class:`~coderag.indexer.Indexer` and the exact filesystem search -(:mod:`coderag.fs_search`) must skip the *same* set of paths — vendored deps, VCS -directories, build output — or the two would disagree about what "the workspace" is. -The matching rule lives here so both callers stay in lock-step instead of each -re-implementing it. +(:mod:`coderag.fs_search`) must enumerate the *same* set of paths — skipping vendored +deps, VCS directories, build output, and (optionally) anything matched by ``.gitignore`` — +or the two would disagree about what "the workspace" is. The single :func:`walk_files` +generator below is the one place that decision is made, so both callers stay in lock-step. """ from __future__ import annotations import fnmatch -from typing import Iterable, Set +import logging +import os +from pathlib import Path +from typing import Iterable, Iterator, List, Optional, Set, Tuple + +logger = logging.getLogger(__name__) + +GITIGNORE_FILE = ".gitignore" def ignore_dir_names(ignore_globs: Iterable[str]) -> Set[str]: @@ -33,3 +40,117 @@ def is_ignored(rel: str, ignore_globs: Iterable[str], ignore_dirs: Set[str]) -> if ignore_dirs.intersection(parts): return True return any(fnmatch.fnmatch(rel, g) for g in ignore_globs) + + +def _is_ancestor(base: str, dir_rel: str) -> bool: + """Whether a ``.gitignore`` at ``base`` still applies at ``dir_rel`` (``""`` = root).""" + if base == "": + return True + return dir_rel == base or dir_rel.startswith(base + "/") + + +class _GitignoreMatcher: + """Honor nested ``.gitignore`` files during a top-down walk (nearest rule wins). + + A ``.gitignore`` at directory ``B`` scopes its patterns to paths under ``B``; the + closest file's rules take precedence and may re-include via ``!``. We keep a stack of + ``(base_rel, spec)`` ordered root→leaf, trimmed to the current directory's ancestors as + the (DFS pre-order) walk moves, and test a path nearest-first using pathspec's + tri-state ``check_file`` (ignore / negated-include / no-match). A no-op if pathspec is + somehow unavailable, so indexing never hard-fails on a missing optional dependency. + """ + + def __init__(self) -> None: + try: + from pathspec import GitIgnoreSpec + except ImportError: # pragma: no cover - pathspec is a declared dependency + logger.warning( + "pathspec not installed; .gitignore files will not be honored." + ) + self._spec_cls = None + else: + self._spec_cls = GitIgnoreSpec + self._stack: List[Tuple[str, object]] = [] + + @property + def enabled(self) -> bool: + return self._spec_cls is not None + + def enter(self, dir_rel: str, dir_abs: Path) -> None: + """Refresh the active-rule stack for ``dir_rel`` and load its ``.gitignore``.""" + if self._spec_cls is None: + return + # Drop rules from sibling subtrees we've left; keep only ancestors of dir_rel. + self._stack = [ + (base, spec) for base, spec in self._stack if _is_ancestor(base, dir_rel) + ] + try: + text = (dir_abs / GITIGNORE_FILE).read_text( + encoding="utf-8", errors="replace" + ) + except OSError: + return # no .gitignore here (or unreadable) + self._stack.append((dir_rel, self._spec_cls.from_lines(text.splitlines()))) + + def match(self, rel: str, *, is_dir: bool) -> bool: + """True if ``rel`` (root-relative POSIX) is ignored by the active rules.""" + if not self._stack: + return False + suffix = "/" if is_dir else "" + for base, spec in reversed(self._stack): + sub = rel if base == "" else rel[len(base) + 1 :] + result = spec.check_file(sub + suffix) # type: ignore[attr-defined] + if result.include is not None: + return bool(result.include) + return False + + +def walk_files( + start: Path, + ignore_globs: Iterable[str], + *, + root: Optional[Path] = None, + use_gitignore: bool = True, +) -> Iterator[Tuple[Path, str]]: + """Yield ``(absolute_path, posix_rel)`` for every non-ignored file under ``start``. + + ``rel`` is relative to ``root`` (defaults to ``start``) so every caller shares one + notion of the workspace. Ignored directories are pruned *before descending* (the big + win at ``/home`` scale), honoring ``ignore_globs`` (dir-name prune + path globs) and, + when ``use_gitignore``, nested ``.gitignore`` files. + """ + start = Path(start) + root = Path(root) if root is not None else start + globs = tuple(ignore_globs) + ignore_dirs = ignore_dir_names(globs) + matcher = _GitignoreMatcher() if use_gitignore else None + active = matcher if (matcher is not None and matcher.enabled) else None + + for dirpath, dirnames, filenames in os.walk(start): + d_abs = Path(dirpath) + try: + d_rel = "" if d_abs == root else d_abs.relative_to(root).as_posix() + except ValueError: # pragma: no cover - start outside root + continue + if active is not None: + active.enter(d_rel, d_abs) + + kept: List[str] = [] + for name in dirnames: + if name in ignore_dirs: + continue + rel = name if d_rel == "" else f"{d_rel}/{name}" + if is_ignored(rel, globs, ignore_dirs): + continue + if active is not None and active.match(rel, is_dir=True): + continue + kept.append(name) + dirnames[:] = kept + + for name in filenames: + rel = name if d_rel == "" else f"{d_rel}/{name}" + if is_ignored(rel, globs, ignore_dirs): + continue + if active is not None and active.match(rel, is_dir=False): + continue + yield d_abs / name, rel diff --git a/coderag/api.py b/coderag/api.py index 8cf1cce..ab19c78 100644 --- a/coderag/api.py +++ b/coderag/api.py @@ -1,8 +1,9 @@ """The public CodeRAG facade — the one object every surface (CLI, HTTP, UI) routes through. -Holds the wired-together engine: embedding provider, SQLite store, FAISS vector index, -indexer, and hybrid searcher. Collaborators are built lazily so constructing a ``CodeRAG`` -is cheap and importing this module pulls in no heavy dependencies. +Holds the wired-together engine: embedding provider, the LanceDB store (chunk metadata + +text/BM25 + vectors/ANN in one place), the indexer, and the hybrid searcher. Collaborators +are built lazily so constructing a ``CodeRAG`` is cheap and importing this module pulls in no +heavy dependencies. """ from __future__ import annotations @@ -20,8 +21,7 @@ from coderag.embeddings import EmbeddingProvider from coderag.indexer import Indexer from coderag.retrieval.search import HybridSearcher - from coderag.store.sqlite_store import SQLiteStore - from coderag.store.vector_index import FaissVectorIndex + from coderag.store.lance_store import LanceStore logger = logging.getLogger(__name__) @@ -32,13 +32,9 @@ class CodeRAG: def __init__(self, config: Optional[Config] = None) -> None: self.config = config or Config.from_env() self._provider: Optional["EmbeddingProvider"] = None - self._store: Optional["SQLiteStore"] = None - self._vectors: Optional["FaissVectorIndex"] = None + self._store: Optional["LanceStore"] = None self._indexer: Optional["Indexer"] = None self._searcher: Optional["HybridSearcher"] = None - # Set when the store's embedding model/dim changed and the FAISS cache must - # be rebuilt from scratch (consumed when the vector index is first opened). - self._rebuild_required: bool = False # Serializes all indexing/deletion so concurrent writers (the CLI, the HTTP # surface, the MCP server's background index, and the live watcher) can't # interleave a file's delete-before-add sequence. Reads (search) are unaffected. @@ -55,44 +51,23 @@ def provider(self) -> "EmbeddingProvider": return self._provider @property - def store(self) -> "SQLiteStore": + def store(self) -> "LanceStore": if self._store is None: - from coderag.store.sqlite_store import SQLiteStore + from coderag.store.lance_store import LanceStore self.config.store_dir.mkdir(parents=True, exist_ok=True) - self._store = SQLiteStore(self.config.db_path) - # bootstrap() returns True when the embedding model/dim changed and the - # store was cleared — the vector cache must then be fully rebuilt. - self._rebuild_required = self._store.bootstrap( - self.provider.dim, self.provider.model_id - ) + self._store = LanceStore(self.config.store_dir, self.provider.dim) + # Clears the store when the embedding model/dim changed; a re-index then + # repopulates the now-empty tables (there is no separate cache to rebuild). + self._store.bootstrap(self.provider.dim, self.provider.model_id) return self._store - @property - def vectors(self) -> "FaissVectorIndex": - if self._vectors is None: - from coderag.store.vector_index import FaissVectorIndex - - # Access the store first so its bootstrap() runs and sets the rebuild flag. - store = self.store - self._vectors = FaissVectorIndex.open(self.config, self.provider.dim) - # FAISS is a rebuildable cache; reconcile with the source of truth on open. - # An explicit rebuild signal (model/dim changed) forces a clean rebuild - # rather than relying on a chunk-count mismatch as a proxy. - if self._rebuild_required: - self._vectors.rebuild_from_store(store) - else: - self._vectors.ensure_consistent(store) - return self._vectors - @property def indexer(self) -> "Indexer": if self._indexer is None: from coderag.indexer import Indexer - self._indexer = Indexer( - self.config, self.provider, self.store, self.vectors - ) + self._indexer = Indexer(self.config, self.provider, self.store) return self._indexer @property @@ -105,7 +80,6 @@ def searcher(self) -> "HybridSearcher": self.config, self.provider, self.store, - self.vectors, reranker=get_reranker(self.config), ) return self._searcher @@ -141,6 +115,7 @@ def search_files(self, pattern: str, **kwargs: Any) -> dict: self.config.watched_dir, pattern, ignore_globs=self.config.ignore_globs, + use_gitignore=self.config.use_gitignore, **kwargs, ) @@ -175,7 +150,7 @@ def get_file( if root not in full.parents and full != root: raise ValueError(f"Path escapes the indexed root: {path}") rel = full.relative_to(root).as_posix() - if self.store.get_file(rel) is None: + if self.store.get_file_meta(rel) is None: raise FileNotFoundError(f"Not an indexed file: {path}") # Decode raw bytes exactly as the indexer does — no universal-newline # translation — so line numbers line up with the chunker (Path.read_text @@ -198,19 +173,15 @@ def delete_path(self, path: Union[str, Path]) -> int: except ValueError: return 0 with self._index_lock: - removed = self.store.delete_file(rel) - if removed: - self.vectors.remove(removed) - self.vectors.save() - return len(removed) + return self.store.delete_file(rel) def warm(self) -> None: - """Eagerly load the provider, store, vectors, and embedding model. + """Eagerly load the provider, store, and embedding model. Done at server startup so the first query — and the demo UI's search-speed badge — reflect warm performance, not the one-off lazy model load. """ - self.status() # builds provider/store/vectors + self.status() # builds provider/store self.provider.embed_query("warm up") # loads the model + JITs the query path def status(self) -> dict: @@ -227,7 +198,7 @@ def status(self) -> dict: else self.config.chat_model ), "llm_base_url": self.config.openai_base_url or "", - "index_type": self.vectors.kind, + "index_type": self.store.index_kind, "rerank": self.config.rerank, "rerank_model": self.config.rerank_model if self.config.rerank else "", "adaptive_fusion": self.config.adaptive_fusion, @@ -236,7 +207,7 @@ def status(self) -> dict: "watched_dir": str(self.config.watched_dir), "total_files": stats.total_files, "total_chunks": stats.total_chunks, - "vectors": self.vectors.ntotal, + "vectors": stats.total_chunks, } def close(self) -> None: diff --git a/coderag/config.py b/coderag/config.py index 17b115f..049777d 100644 --- a/coderag/config.py +++ b/coderag/config.py @@ -28,22 +28,49 @@ ) # Directories/globs never worth indexing. Note we deliberately do NOT ignore ``tests`` — -# people search their tests too. +# people search their tests too. The dependency/cache entries matter most at home/system +# scale (e.g. indexing ``/home``), where they are the bulk of the file count; each +# ``/*`` entry prunes that directory wholesale anywhere in the tree. DEFAULT_IGNORE_GLOBS: Tuple[str, ...] = ( + # VCS ".git/*", ".hg/*", ".svn/*", - "node_modules/*", - ".venv/*", - "venv/*", - "env/*", - "__pycache__/*", - "*.egg-info/*", + # build / packaging output "build/*", "dist/*", + "target/*", + "*.egg-info/*", + ".next/*", + ".nuxt/*", + # language / tool caches + "__pycache__/*", ".mypy_cache/*", ".pytest_cache/*", + ".ruff_cache/*", + ".tox/*", + ".ipynb_checkpoints/*", + ".gradle/*", + ".terraform/*", ".coderag/*", + # virtualenvs / vendored dependencies + ".venv/*", + "venv/*", + "env/*", + "node_modules/*", + "site-packages/*", + "vendor/*", + # user/home caches (dominant at /home scale) + ".cache/*", + ".local/*", + ".npm/*", + ".cargo/*", + ".rustup/*", + ".m2/*", + ".nuget/*", + # editor metadata + ".idea/*", + ".vscode/*", ) @@ -117,6 +144,10 @@ class Config: # --- What to index --- languages: Tuple[str, ...] = DEFAULT_LANGUAGES ignore_globs: Tuple[str, ...] = DEFAULT_IGNORE_GLOBS + # Honor .gitignore files while walking (in addition to ignore_globs), so a repo's own + # build/output exclusions are respected. On by default; disable with + # CODERAG_GITIGNORE=0 or `--no-gitignore`. + use_gitignore: bool = True # Index any UTF-8-decodable file as plain text, even with an unknown/absent extension # (Dockerfile, Makefile, LICENSE, .log, ...). Off by default so code repos aren't # polluted; turn on (CODERAG_INDEX_ALL_TEXT / `coderag mcp --all-text`) to make @@ -128,12 +159,6 @@ class Config: window_lines: int = 60 # fallback line-window size window_overlap: int = 10 - # --- Vector index --- - index_type: str = "auto" # "auto" | "flat" | "ivf" - ivf_threshold: int = 50_000 # switch flat->ivf above this many vectors - ivf_nlist: int = 0 # 0 => derived from corpus size - ivf_nprobe: int = 16 - # --- Retrieval --- top_k: int = 8 fetch_k: int = 50 # candidates pulled from each retriever before fusion @@ -182,6 +207,11 @@ class Config: # --- Indexing throughput --- embed_batch_size: int = 64 index_workers: int = 4 + # Embedding device for the local (fastembed) provider. "auto" uses a CUDA GPU when + # onnxruntime exposes one (10-50x faster indexing) and falls back to CPU otherwise; + # "cpu"/"cuda" force it. A missing/broken GPU always degrades to CPU rather than failing. + embed_device: str = "auto" # "auto" | "cpu" | "cuda" + embed_threads: int = 0 # ONNX CPU threads (0 => library default) # --- Optional LLM answer surface --- # Which backend turns retrieved chunks into a grounded answer. @@ -224,14 +254,6 @@ class Config: demo_max_answers: int = 5 # LLM answers allowed per browser session demo_cooldown_seconds: int = 20 # minimum seconds between answers in a session - @property - def db_path(self) -> Path: - return self.store_dir / "coderag.db" - - @property - def faiss_path(self) -> Path: - return self.store_dir / "index.faiss" - def with_overrides(self, **kwargs: object) -> "Config": """Return a copy with the given fields replaced (config stays immutable).""" return replace(self, **kwargs) # type: ignore[arg-type] @@ -251,8 +273,6 @@ def from_env(cls, **overrides: object) -> "Config": ), watched_dir=_env_path("CODERAG_WATCHED_DIR", Path.cwd()), store_dir=_env_path("CODERAG_STORE_DIR", Path.cwd() / ".coderag"), - index_type=_env_str("CODERAG_INDEX_TYPE", cls.index_type), - ivf_threshold=_env_int("CODERAG_IVF_THRESHOLD", cls.ivf_threshold), top_k=_env_int("CODERAG_TOP_K", cls.top_k), fetch_k=_env_int("CODERAG_FETCH_K", cls.fetch_k), rrf_k=_env_int("CODERAG_RRF_K", cls.rrf_k), @@ -280,6 +300,8 @@ def from_env(cls, **overrides: object) -> "Config": ), embed_batch_size=_env_int("CODERAG_EMBED_BATCH", cls.embed_batch_size), index_workers=_env_int("CODERAG_WORKERS", cls.index_workers), + embed_device=_env_str("CODERAG_EMBED_DEVICE", cls.embed_device), + embed_threads=_env_int("CODERAG_EMBED_THREADS", cls.embed_threads), llm_provider=_env_str("CODERAG_LLM_PROVIDER", cls.llm_provider), chat_model=_env_str("CODERAG_CHAT_MODEL", cls.chat_model), anthropic_api_key=os.getenv("ANTHROPIC_API_KEY"), @@ -290,6 +312,9 @@ def from_env(cls, **overrides: object) -> "Config": api_key=os.getenv("CODERAG_API_KEY"), cors_origins=_env_tuple("CODERAG_CORS_ORIGINS", cls.cors_origins), index_all_text=_env_bool("CODERAG_INDEX_ALL_TEXT", cls.index_all_text), + # CODERAG_IGNORE_GLOBS *appends* extra excludes to the built-in defaults. + ignore_globs=DEFAULT_IGNORE_GLOBS + _env_tuple("CODERAG_IGNORE_GLOBS", ()), + use_gitignore=_env_bool("CODERAG_GITIGNORE", cls.use_gitignore), mcp_auto_index=_env_bool("CODERAG_MCP_AUTO_INDEX", cls.mcp_auto_index), mcp_watch=_env_bool("CODERAG_MCP_WATCH", cls.mcp_watch), mcp_snippet_lines=_env_int( diff --git a/coderag/embeddings/__init__.py b/coderag/embeddings/__init__.py index 27585eb..e4fa418 100644 --- a/coderag/embeddings/__init__.py +++ b/coderag/embeddings/__init__.py @@ -50,7 +50,13 @@ def get_provider(config: Config) -> EmbeddingProvider: if provider == "fastembed": from coderag.embeddings.fastembed_provider import FastEmbedProvider - return FastEmbedProvider(config.model, cache_dir=config.cache_dir) + return FastEmbedProvider( + config.model, + cache_dir=config.cache_dir, + device=config.embed_device, + threads=config.embed_threads, + batch_size=config.embed_batch_size, + ) if provider == "openai": from coderag.embeddings.openai_provider import OpenAIEmbeddingProvider diff --git a/coderag/embeddings/fastembed_provider.py b/coderag/embeddings/fastembed_provider.py index d9a81ac..851f174 100644 --- a/coderag/embeddings/fastembed_provider.py +++ b/coderag/embeddings/fastembed_provider.py @@ -22,9 +22,20 @@ class FastEmbedProvider: name = "fastembed" - def __init__(self, model: str = DEFAULT_MODEL, cache_dir: Optional[Path] = None): + def __init__( + self, + model: str = DEFAULT_MODEL, + cache_dir: Optional[Path] = None, + *, + device: str = "auto", + threads: int = 0, + batch_size: int = 64, + ): self._model_name = model self._cache_dir = str(cache_dir) if cache_dir else None + self._device = device + self._threads = threads + self._batch_size = max(1, batch_size) self._dim = self._lookup_dim(model) @staticmethod @@ -39,12 +50,50 @@ def _lookup_dim(model: str) -> Optional[int]: pass return None + def _providers(self) -> Optional[list[str]]: + """ONNX execution providers for the chosen device, or None for the library default. + + CUDA is listed with a CPU fallback so onnxruntime degrades gracefully at runtime; + ``auto`` only requests CUDA when onnxruntime actually exposes it. + """ + if self._device == "cpu": + return ["CPUExecutionProvider"] + if self._device == "cuda": + return ["CUDAExecutionProvider", "CPUExecutionProvider"] + try: # auto: prefer a GPU only if one is really available + import onnxruntime as ort + + if "CUDAExecutionProvider" in ort.get_available_providers(): + return ["CUDAExecutionProvider", "CPUExecutionProvider"] + except Exception: # pragma: no cover - onnxruntime optional / probe best-effort + pass + return None + @cached_property def _model(self) -> Any: from fastembed import TextEmbedding - logger.info("Loading fastembed model %s ...", self._model_name) - return TextEmbedding(self._model_name, cache_dir=self._cache_dir) + kwargs: dict[str, Any] = {"cache_dir": self._cache_dir} + providers = self._providers() + if providers is not None: + kwargs["providers"] = providers + if self._threads > 0: + kwargs["threads"] = self._threads + logger.info( + "Loading fastembed model %s (device=%s)…", self._model_name, self._device + ) + try: + return TextEmbedding(self._model_name, **kwargs) + except ( + Exception + ) as exc: # pragma: no cover - GPU init can fail on broken drivers + if providers and providers[0] != "CPUExecutionProvider": + logger.warning( + "GPU embedding init failed (%s); falling back to CPU.", exc + ) + kwargs.pop("providers", None) + return TextEmbedding(self._model_name, **kwargs) + raise @property def model_id(self) -> str: @@ -60,7 +109,7 @@ def dim(self) -> int: def embed_documents(self, texts: Sequence[str]) -> np.ndarray: if not texts: return np.zeros((0, self.dim), dtype="float32") - vecs = list(self._model.passage_embed(list(texts))) + vecs = list(self._model.passage_embed(list(texts), batch_size=self._batch_size)) return np.vstack(vecs).astype("float32") def embed_query(self, text: str) -> np.ndarray: diff --git a/coderag/eval/datasets/coderag_self.jsonl b/coderag/eval/datasets/coderag_self.jsonl index 4575746..0dce946 100644 --- a/coderag/eval/datasets/coderag_self.jsonl +++ b/coderag/eval/datasets/coderag_self.jsonl @@ -1,23 +1,23 @@ {"query": "where are duplicate or stale vectors removed when a file changes", "relevant_files": ["coderag/indexer.py"], "source": "curated"} -{"query": "how is the FAISS index rebuilt from the SQLite source of truth", "relevant_files": ["coderag/store/vector_index.py"], "source": "curated"} +{"query": "how is the FAISS index rebuilt from the SQLite source of truth", "relevant_files": ["coderag/store/lance_store.py"], "source": "curated"} {"query": "where is reciprocal rank fusion implemented", "relevant_files": ["coderag/retrieval/fusion.py"], "source": "curated"} {"query": "how are dense and lexical search results combined into one ranking", "relevant_files": ["coderag/retrieval/search.py"], "source": "curated"} {"query": "how does the debounced filesystem watcher trigger reindexing", "relevant_files": ["coderag/watch.py"], "source": "curated"} {"query": "where is symbol-aware chunking for Python using the ast module", "relevant_files": ["coderag/chunking/python_ast.py"], "source": "curated"} {"query": "how are functions and classes chunked for Go and Rust via tree-sitter", "relevant_files": ["coderag/chunking/treesitter.py"], "source": "curated"} -{"query": "where is BM25 keyword search over SQLite FTS5 implemented", "relevant_files": ["coderag/store/sqlite_store.py"], "source": "curated"} +{"query": "where is BM25 keyword search over SQLite FTS5 implemented", "relevant_files": ["coderag/store/lance_store.py"], "source": "curated"} {"query": "how does the HTTP API require an API key for authentication", "relevant_files": ["coderag/surfaces/http_api.py"], "source": "curated"} {"query": "how is an LLM answer streamed over the retrieved code chunks", "relevant_files": ["coderag/llm.py"], "source": "curated"} {"query": "where is the OpenAI-compatible embedding provider implemented", "relevant_files": ["coderag/embeddings/openai_provider.py"], "source": "curated"} {"query": "how does configuration load from environment variables and a dotenv file", "relevant_files": ["coderag/config.py"], "source": "curated"} {"query": "where is the command line search subcommand defined", "relevant_files": ["coderag/surfaces/cli.py"], "source": "curated"} -{"query": "how does the vector index switch from flat to IVF as the corpus grows", "relevant_files": ["coderag/store/vector_index.py"], "source": "curated"} +{"query": "how does the vector index switch from flat to IVF as the corpus grows", "relevant_files": ["coderag/store/lance_store.py"], "source": "curated"} {"query": "where is content hashing used to skip unchanged files on reindex", "relevant_files": ["coderag/indexer.py"], "source": "curated"} {"query": "how are file contents served safely for only indexed files", "relevant_files": ["coderag/api.py"], "source": "curated"} {"query": "where does the web UI render results with syntax highlighting", "relevant_files": ["coderag/surfaces/webui.py"], "source": "curated"} {"query": "how is an oversized function split into smaller line windows", "relevant_files": ["coderag/chunking/base.py"], "source": "curated"} -{"query": "where is the database table schema for chunks and files defined", "relevant_files": ["coderag/store/schema.py"], "source": "curated"} -{"query": "how does a model or embedding dimension change get detected and trigger a rebuild", "relevant_files": ["coderag/store/sqlite_store.py", "coderag/api.py"], "source": "curated"} +{"query": "where is the database table schema for chunks and files defined", "relevant_files": ["coderag/store/lance_store.py"], "source": "curated"} +{"query": "how does a model or embedding dimension change get detected and trigger a rebuild", "relevant_files": ["coderag/store/lance_store.py", "coderag/api.py"], "source": "curated"} {"query": "where is the deterministic offline fake embedding provider for tests", "relevant_files": ["coderag/embeddings/fake_provider.py"], "source": "curated"} {"query": "how are file extensions mapped to programming languages for chunking", "relevant_files": ["coderag/chunking/languages.py"], "source": "curated"} {"query": "where is text split into lines without collapsing carriage returns", "relevant_files": ["coderag/_lines.py"], "source": "curated"} diff --git a/coderag/eval/datasets/coderag_self_identifiers.jsonl b/coderag/eval/datasets/coderag_self_identifiers.jsonl index e68a227..fd3312f 100644 --- a/coderag/eval/datasets/coderag_self_identifiers.jsonl +++ b/coderag/eval/datasets/coderag_self_identifiers.jsonl @@ -1,14 +1,14 @@ {"query": "reciprocal_rank_fusion", "relevant_files": ["coderag/retrieval/fusion.py"], "relevant_symbols": ["reciprocal_rank_fusion"], "source": "curated-id"} {"query": "search", "relevant_files": ["coderag/retrieval/search.py"], "relevant_symbols": ["HybridSearcher.search"], "source": "curated-id"} {"query": "_index_file", "relevant_files": ["coderag/indexer.py"], "relevant_symbols": ["Indexer._index_file"], "source": "curated-id"} -{"query": "rebuild_from_store", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["FaissVectorIndex.rebuild_from_store"], "source": "curated-id"} -{"query": "_choose_kind", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["FaissVectorIndex._choose_kind"], "source": "curated-id"} -{"query": "search", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["FaissVectorIndex.search"], "source": "curated-id"} -{"query": "_derive_nlist", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["_derive_nlist"], "source": "curated-id"} -{"query": "fts_search", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["SQLiteStore.fts_search"], "source": "curated-id"} -{"query": "bootstrap", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["SQLiteStore.bootstrap"], "source": "curated-id"} -{"query": "hydrate", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["SQLiteStore.hydrate"], "source": "curated-id"} -{"query": "_sanitize_fts", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["_sanitize_fts"], "source": "curated-id"} +{"query": "rebuild_from_store", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["FaissVectorIndex.rebuild_from_store"], "source": "curated-id"} +{"query": "_choose_kind", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["FaissVectorIndex._choose_kind"], "source": "curated-id"} +{"query": "search", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["FaissVectorIndex.search"], "source": "curated-id"} +{"query": "_derive_nlist", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["_derive_nlist"], "source": "curated-id"} +{"query": "fts_search", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["SQLiteStore.fts_search"], "source": "curated-id"} +{"query": "bootstrap", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["SQLiteStore.bootstrap"], "source": "curated-id"} +{"query": "hydrate", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["SQLiteStore.hydrate"], "source": "curated-id"} +{"query": "_sanitize_fts", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["_sanitize_fts"], "source": "curated-id"} {"query": "watch", "relevant_files": ["coderag/watch.py"], "relevant_symbols": ["watch"], "source": "curated-id"} {"query": "extract_spans", "relevant_files": ["coderag/chunking/python_ast.py"], "relevant_symbols": ["extract_spans"], "source": "curated-id"} {"query": "stream_answer", "relevant_files": ["coderag/llm.py"], "relevant_symbols": ["stream_answer"], "source": "curated-id"} diff --git a/coderag/eval/datasets/coderag_self_symbols.jsonl b/coderag/eval/datasets/coderag_self_symbols.jsonl index 4c484c3..1409e6c 100644 --- a/coderag/eval/datasets/coderag_self_symbols.jsonl +++ b/coderag/eval/datasets/coderag_self_symbols.jsonl @@ -1,14 +1,14 @@ {"query": "where is reciprocal rank fusion implemented", "relevant_files": ["coderag/retrieval/fusion.py"], "relevant_symbols": ["reciprocal_rank_fusion"], "source": "curated"} {"query": "how are dense and lexical search results combined into one ranking", "relevant_files": ["coderag/retrieval/search.py"], "relevant_symbols": ["HybridSearcher.search"], "source": "curated"} {"query": "where are a changed file's old chunks removed before new ones are added", "relevant_files": ["coderag/indexer.py"], "relevant_symbols": ["Indexer._index_file"], "source": "curated"} -{"query": "how is the FAISS index rebuilt from the SQLite store", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["FaissVectorIndex.rebuild_from_store"], "source": "curated"} -{"query": "where does the vector index choose between flat and IVF", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["FaissVectorIndex._choose_kind"], "source": "curated"} -{"query": "how are query vectors searched in the FAISS index", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["FaissVectorIndex.search"], "source": "curated"} -{"query": "how is the number of IVF clusters derived from corpus size", "relevant_files": ["coderag/store/vector_index.py"], "relevant_symbols": ["_derive_nlist"], "source": "curated"} -{"query": "where is BM25 keyword search over the full text index", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["SQLiteStore.fts_search"], "source": "curated"} -{"query": "how does the store detect a model or embedding dimension change on startup", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["SQLiteStore.bootstrap"], "source": "curated"} -{"query": "where are search results hydrated from the database by chunk id", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["SQLiteStore.hydrate"], "source": "curated"} -{"query": "how are full text search query strings sanitized", "relevant_files": ["coderag/store/sqlite_store.py"], "relevant_symbols": ["_sanitize_fts"], "source": "curated"} +{"query": "how is the FAISS index rebuilt from the SQLite store", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["FaissVectorIndex.rebuild_from_store"], "source": "curated"} +{"query": "where does the vector index choose between flat and IVF", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["FaissVectorIndex._choose_kind"], "source": "curated"} +{"query": "how are query vectors searched in the FAISS index", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["FaissVectorIndex.search"], "source": "curated"} +{"query": "how is the number of IVF clusters derived from corpus size", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["_derive_nlist"], "source": "curated"} +{"query": "where is BM25 keyword search over the full text index", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["SQLiteStore.fts_search"], "source": "curated"} +{"query": "how does the store detect a model or embedding dimension change on startup", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["SQLiteStore.bootstrap"], "source": "curated"} +{"query": "where are search results hydrated from the database by chunk id", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["SQLiteStore.hydrate"], "source": "curated"} +{"query": "how are full text search query strings sanitized", "relevant_files": ["coderag/store/lance_store.py"], "relevant_symbols": ["_sanitize_fts"], "source": "curated"} {"query": "how does the filesystem watcher start watching and applying changes", "relevant_files": ["coderag/watch.py"], "relevant_symbols": ["watch"], "source": "curated"} {"query": "where are python functions and classes extracted as symbol spans", "relevant_files": ["coderag/chunking/python_ast.py"], "relevant_symbols": ["extract_spans"], "source": "curated"} {"query": "how is an LLM answer streamed over retrieved code chunks", "relevant_files": ["coderag/llm.py"], "relevant_symbols": ["stream_answer"], "source": "curated"} diff --git a/coderag/eval/harness.py b/coderag/eval/harness.py index 4ec5db9..1a942d2 100644 --- a/coderag/eval/harness.py +++ b/coderag/eval/harness.py @@ -145,13 +145,13 @@ def compare_modes( adaptive_fusion=False, graph_expansion=False, ) - searcher = HybridSearcher(cfg, cr.provider, cr.store, cr.vectors) + searcher = HybridSearcher(cfg, cr.provider, cr.store) results.append( evaluate(searcher.search, cases, label=label, ks=ks, level=level) ) if adaptive: cfg = cr.config.with_overrides(adaptive_fusion=True, graph_expansion=False) - searcher = HybridSearcher(cfg, cr.provider, cr.store, cr.vectors) + searcher = HybridSearcher(cfg, cr.provider, cr.store) results.append( evaluate(searcher.search, cases, label="adaptive", ks=ks, level=level) ) @@ -162,7 +162,7 @@ def compare_modes( adaptive_fusion=False, graph_expansion=True, ) - searcher = HybridSearcher(cfg, cr.provider, cr.store, cr.vectors) + searcher = HybridSearcher(cfg, cr.provider, cr.store) results.append( evaluate(searcher.search, cases, label="hybrid+graph", ks=ks, level=level) ) @@ -173,9 +173,7 @@ def compare_modes( adaptive_fusion=False, graph_expansion=False, ) - searcher = HybridSearcher( - cfg, cr.provider, cr.store, cr.vectors, reranker=reranker - ) + searcher = HybridSearcher(cfg, cr.provider, cr.store, reranker=reranker) results.append( evaluate(searcher.search, cases, label="hybrid+rerank", ks=ks, level=level) ) diff --git a/coderag/fs_search.py b/coderag/fs_search.py index 7225b97..67cf9cf 100644 --- a/coderag/fs_search.py +++ b/coderag/fs_search.py @@ -27,9 +27,9 @@ from dataclasses import dataclass, field from fnmatch import fnmatch from pathlib import Path -from typing import Dict, Iterator, List, Optional, Sequence, Tuple +from typing import Dict, List, Optional, Sequence, Tuple -from coderag._ignore import ignore_dir_names, is_ignored +from coderag._ignore import walk_files from coderag._lines import split_lines from coderag.config import DEFAULT_IGNORE_GLOBS @@ -73,22 +73,6 @@ def _rg_available() -> bool: return shutil.which("rg") is not None -def _iter_files(root: Path, ignore_globs: Sequence[str]) -> Iterator[Tuple[Path, str]]: - """Yield ``(absolute_path, posix_rel)`` for every non-ignored file under ``root``.""" - ignore_dirs = ignore_dir_names(ignore_globs) - for dirpath, dirnames, filenames in os.walk(root): - dirnames[:] = [d for d in dirnames if d not in ignore_dirs] - for name in filenames: - abs_path = Path(dirpath) / name - try: - rel = abs_path.relative_to(root).as_posix() - except ValueError: # pragma: no cover - defensive - continue - if is_ignored(rel, ignore_globs, ignore_dirs): - continue - yield abs_path, rel - - def _glob_matches(rel: str, glob: str) -> bool: """Match a glob against the full relative path or just the basename (``*.py``).""" return fnmatch(rel, glob) or fnmatch(rel.rsplit("/", 1)[-1], glob) @@ -226,6 +210,7 @@ def search_files( limit: int = DEFAULT_LIMIT, offset: int = 0, ignore_globs: Sequence[str] = DEFAULT_IGNORE_GLOBS, + use_gitignore: bool = True, ignore_case: bool = False, max_file_bytes: int = _MAX_FILE_BYTES, redact: bool = True, @@ -261,7 +246,9 @@ def search_files( if target == "files": rels = sorted( rel - for _, rel in _iter_files(root_path, ignore_globs) + for _, rel in walk_files( + root_path, ignore_globs, use_gitignore=use_gitignore + ) if _glob_matches(rel, pattern) ) page, truncated, next_offset = _paginate(rels, offset, limit) @@ -285,7 +272,9 @@ def search_files( files = [ (abs_path, rel) - for abs_path, rel in _iter_files(root_path, ignore_globs) + for abs_path, rel in walk_files( + root_path, ignore_globs, use_gitignore=use_gitignore + ) if file_glob is None or _glob_matches(rel, file_glob) ] diff --git a/coderag/indexer.py b/coderag/indexer.py index 9039645..d4442ed 100644 --- a/coderag/indexer.py +++ b/coderag/indexer.py @@ -1,34 +1,72 @@ """Incremental indexing orchestration. -Ties chunking -> embedding -> SQLite -> FAISS together with content-hash change detection. -The critical correctness property (which the old ``monitor.py`` got wrong): a changed file's -*old* chunks are removed from both the store and the vector index **before** the new ones are -added, so re-saving a file never accumulates duplicate or stale vectors. +Ties chunking -> embedding -> the LanceDB store together with content-hash change detection. +The critical correctness property: a changed file's *old* chunks are removed from the store +**before** the new ones are added (``write_file(..., replace=True)``), so re-saving a file +never accumulates duplicate or stale rows. """ from __future__ import annotations import hashlib import logging -import os +import sys +import time from dataclasses import dataclass from pathlib import Path -from typing import Iterator, List, Optional, Tuple +from typing import TYPE_CHECKING, Any, Dict, Iterator, List, Optional, Tuple import numpy as np -from coderag._ignore import ignore_dir_names, is_ignored +from coderag._ignore import ignore_dir_names, is_ignored, walk_files from coderag.chunking import chunk_file from coderag.chunking.languages import detect_language from coderag.config import Config from coderag.embeddings import EmbeddingProvider -from coderag.store.sqlite_store import SQLiteStore -from coderag.store.vector_index import FaissVectorIndex from coderag.types import Chunk, IndexStats +if TYPE_CHECKING: + from coderag.store.lance_store import LanceStore + logger = logging.getLogger(__name__) +class _ProgressReporter: + """Live, human-facing indexing progress, written to stderr (stdout stays clean). + + A large index is otherwise a silent wait — the very problem behind an agent sitting at + "Working… 10 min" while an over-broad root is crawled. This narrates *both* phases: the + discovery walk (which hashes every candidate before a single chunk is embedded) and the + embedding pass. On a TTY it redraws one line in place; otherwise (agent terminals, + captured logs) it prints throttled newline updates so output stays readable. It is a + no-op unless ``enabled`` — the library facade and the MCP background index pass + ``progress=False`` and stay quiet. + """ + + def __init__(self, enabled: bool) -> None: + self.enabled = enabled + self._tty = bool(getattr(sys.stderr, "isatty", lambda: False)()) + self._next = 0.0 # monotonic time of the next allowed (unforced) update + + def update(self, msg: str, *, force: bool = False) -> None: + """Show ``msg``, throttled so per-file calls don't flood the terminal/logs.""" + if not self.enabled: + return + now = time.monotonic() + if not force and now < self._next: + return + self._next = now + (0.1 if self._tty else 2.0) + sys.stderr.write(f"\r\x1b[2K{msg}" if self._tty else msg + "\n") + sys.stderr.flush() + + def done(self, msg: str) -> None: + """Emit a final line and stop redrawing (clears the in-place line on a TTY).""" + if not self.enabled: + return + sys.stderr.write(f"\r\x1b[2K{msg}\n" if self._tty else msg + "\n") + sys.stderr.flush() + + @dataclass(slots=True) class _Work: rel: str @@ -36,6 +74,8 @@ class _Work: text: str content_hash: str mtime: float + size: int + existed: bool # whether the file already had rows (→ replace, delete-before-add) class Indexer: @@ -43,13 +83,11 @@ def __init__( self, config: Config, provider: EmbeddingProvider, - store: SQLiteStore, - vectors: FaissVectorIndex, + store: "LanceStore", ) -> None: self.config = config self.provider = provider self.store = store - self.vectors = vectors self._ignore_dirs = ignore_dir_names(config.ignore_globs) # --- public --- @@ -64,27 +102,44 @@ def index( root = self.config.watched_dir.resolve() target = (target or self.config.watched_dir).resolve() prune = target == root # only a full-root pass removes vanished files + rep = _ProgressReporter(progress) stats = IndexStats() if full: - self._reset() + self.store.clear() - # 1. Discover candidates and detect what actually changed (cheap hash check). + # 1. Discover candidates and detect what actually changed (cheap stat/hash check). + # Preload all file metadata once (one scan) so discovery does no per-file query. + metas = self.store.all_file_metas() + rep.update(f"Scanning {target} for files to index…", force=True) walked: set[str] = set() work: List[_Work] = [] for abs_path, rel, language in self._walk(target, root): walked.add(rel) - item = self._maybe_work(abs_path, rel, language) + item = self._maybe_work(abs_path, rel, language, metas) if item is None: stats.files_skipped += 1 else: work.append(item) + rep.update( + f"Scanning {target} — {len(walked)} file(s) seen, " + f"{len(work)} to index, {stats.files_skipped} unchanged/skipped…" + ) + if work: + rep.update( + f"Embedding {len(work)} changed file(s) " + f"({stats.files_skipped} unchanged/skipped)…", + force=True, + ) + else: + rep.update( + f"Up to date — {stats.files_skipped} file(s) unchanged.", force=True + ) # 2. (Re)index changed files. Chunking + embedding (the CPU/network cost) may run - # in parallel across files (config.index_workers); the SQLite + FAISS writes - # stay on this single thread to preserve the delete-before-add invariant and - # the single-connection store. - for added, removed in self._embed_and_write(work, progress=progress): + # in parallel across files (config.index_workers); the store writes stay on this + # single thread to preserve the delete-before-add invariant and single writer. + for added, removed in self._embed_and_write(work, reporter=rep): stats.chunks_added += added stats.chunks_removed += removed stats.files_indexed += 1 @@ -92,28 +147,54 @@ def index( # 3. Prune files that disappeared from disk (full-root passes only). if prune: for rel in set(self.store.all_file_paths()) - walked: - removed_ids = self.store.delete_file(rel) - self.vectors.remove(removed_ids) + removed = self.store.delete_file(rel) stats.files_removed += 1 - stats.chunks_removed += len(removed_ids) - - # 4. Persist FAISS (rebuilding to IVF if we crossed the scale threshold). - if not self.vectors.maybe_upgrade(self.store): - self.vectors.save() + stats.chunks_removed += removed + + # 4. Persist. A full pass that changed something rebuilds the FTS/vector indexes + # and compacts; an incremental/single-file pass just flushes (new rows are + # searchable via LanceDB's flat scan of the unindexed tail) so a watcher edit + # never triggers a whole-index rebuild. + changed = stats.files_indexed > 0 or stats.files_removed > 0 + if prune and changed: + self.store.optimize() + else: + self.store.flush() final = self.store.stats() stats.total_files = final.total_files stats.total_chunks = final.total_chunks + rep.done( + f"✓ Indexed {stats.files_indexed} file(s) — " + f"{stats.total_files} total / {stats.total_chunks} chunks." + ) return stats # --- internals --- - def _reset(self) -> None: - for rel in list(self.store.all_file_paths()): - self.store.delete_file(rel) - self.vectors.rebuild_from_store(self.store) # -> empty - - def _maybe_work(self, abs_path: Path, rel: str, language: str) -> Optional[_Work]: + def _maybe_work( + self, + abs_path: Path, + rel: str, + language: str, + metas: Dict[str, Dict[str, Any]], + ) -> Optional[_Work]: + existing = metas.get(rel) + try: + st = abs_path.stat() + except OSError as exc: + logger.warning("Cannot stat %s: %s", abs_path, exc) + return None + # Cheap fast-path: if size and mtime are unchanged, skip the read+hash entirely. + # The hash stays the authority on "did content change" — this only avoids the read + # for the common untouched case (the dominant cost of re-indexing a large tree). + if ( + existing is not None + and existing.get("size") is not None + and int(existing["size"]) == st.st_size + and abs(float(existing.get("mtime") or 0.0) - st.st_mtime) < 1e-6 + ): + return None try: data = abs_path.read_bytes() except OSError as exc: @@ -124,61 +205,54 @@ def _maybe_work(self, abs_path: Path, rel: str, language: str) -> Optional[_Work if b"\x00" in data[:8192]: return None # binary file (NUL byte in the head) — never index as text content_hash = hashlib.sha256(data).hexdigest() - existing = self.store.get_file(rel) - if existing is not None and existing["content_hash"] == content_hash: - return None # unchanged -> no embedding cost + if existing is not None and existing.get("content_hash") == content_hash: + return None # content unchanged (e.g. touched) -> no embedding cost text = data.decode("utf-8", errors="replace") - return _Work(rel, language, text, content_hash, abs_path.stat().st_mtime) + return _Work( + rel, + language, + text, + content_hash, + st.st_mtime, + st.st_size, + existing is not None, + ) def _embed_and_write( - self, work: List[_Work], *, progress: bool + self, work: List[_Work], *, reporter: _ProgressReporter ) -> Iterator[Tuple[int, int]]: """Chunk+embed each file (optionally across worker threads) and apply the writes. Embedding is the expensive, parallelizable step and touches no shared mutable - state, so it runs in a thread pool when ``index_workers > 1``. The store/FAISS - writes are drained here on the single calling thread, so the no-duplicate - (delete-before-add) invariant and the single-writer store are preserved. + state, so it runs in a thread pool when ``index_workers > 1``. The store writes are + drained here on the single calling thread, so the no-duplicate (delete-before-add) + invariant and the single-writer store are preserved. """ if not work: return workers = max(1, self.config.index_workers) - bar = self._progress_bar(len(work), progress) - try: - if workers > 1 and len(work) > 1: - from concurrent.futures import ThreadPoolExecutor, as_completed - - with ThreadPoolExecutor(max_workers=workers) as pool: - futures = {pool.submit(self._prepare, item): item for item in work} - for fut in as_completed(futures): - chunks, vectors = fut.result() - yield self._write(futures[fut], chunks, vectors) - if bar is not None: - bar.update(1) - else: - for item in work: - chunks, vectors = self._prepare(item) - yield self._write(item, chunks, vectors) - if bar is not None: - bar.update(1) - finally: - if bar is not None: - bar.close() - - @staticmethod - def _progress_bar(total: int, progress: bool): # type: ignore[no-untyped-def] - if not progress: - return None - try: - from tqdm import tqdm - - return tqdm(total=total, desc="Indexing", unit="file") - except Exception: # pragma: no cover - return None + total = len(work) + done = 0 + if workers > 1 and len(work) > 1: + from concurrent.futures import ThreadPoolExecutor, as_completed + + with ThreadPoolExecutor(max_workers=workers) as pool: + futures = {pool.submit(self._prepare, item): item for item in work} + for fut in as_completed(futures): + chunks, vectors = fut.result() + yield self._write(futures[fut], chunks, vectors) + done += 1 + reporter.update(f"Embedding {done}/{total} file(s)…") + else: + for item in work: + chunks, vectors = self._prepare(item) + yield self._write(item, chunks, vectors) + done += 1 + reporter.update(f"Embedding {done}/{total} file(s)…") def _prepare(self, item: _Work) -> Tuple[List[Chunk], Optional[np.ndarray]]: - """Chunk and embed a file. Pure with respect to the store/FAISS, so it is safe to - run in a worker thread; the resulting writes are applied by :meth:`_write`.""" + """Chunk and embed a file. Pure with respect to the store, so it is safe to run in + a worker thread; the resulting writes are applied by :meth:`_write`.""" chunks = chunk_file(item.text, item.language, self.config) if not chunks: return [], None @@ -188,27 +262,20 @@ def _prepare(self, item: _Work) -> Tuple[List[Chunk], Optional[np.ndarray]]: def _write( self, item: _Work, chunks: List[Chunk], vectors: Optional[np.ndarray] ) -> Tuple[int, int]: - """Apply a prepared file: remove its old chunks (store + FAISS) before adding the - new ones. Must run single-threaded — it is the only writer.""" - removed = 0 - existing = self.store.get_file(item.rel) - if existing is not None: - old_ids = self.store.delete_chunks_for_file(int(existing["id"])) - self.vectors.remove(old_ids) - removed = len(old_ids) - - file_id = self.store.upsert_file( - item.rel, item.language, item.content_hash, item.mtime - ) - - if not chunks or vectors is None: - return 0, removed + """Apply a prepared file to the store (delete-before-add for a replacement). - new_ids = self.store.add_chunks( - file_id, chunks, vectors, self.provider.model_id + Must run single-threaded — it is the only writer. + """ + return self.store.write_file( + item.rel, + item.language, + item.content_hash, + item.mtime, + item.size, + chunks, + vectors, + replace=item.existed, ) - self.vectors.add(np.array(new_ids, dtype="int64"), vectors) - return len(new_ids), removed def _walk(self, target: Path, root: Path) -> Iterator[Tuple[Path, str, str]]: if target.is_file(): @@ -218,17 +285,19 @@ def _walk(self, target: Path, root: Path) -> Iterator[Tuple[Path, str, str]]: yield target, rel, language return - for dirpath, dirnames, filenames in os.walk(target): - # prune ignored directories in place for speed - dirnames[:] = [d for d in dirnames if d not in self._ignore_dirs] - for name in filenames: - abs_path = Path(dirpath) / name - rel = self._rel(abs_path, root) - if not rel or self._ignored(rel): - continue - language = detect_language(name, all_text=self.config.index_all_text) - if language: - yield abs_path, rel, language + # walk_files owns dir-pruning + ignore-glob + .gitignore matching, shared with + # fs_search so semantic and exact search see exactly the same files. + for abs_path, rel in walk_files( + target, + self.config.ignore_globs, + root=root, + use_gitignore=self.config.use_gitignore, + ): + language = detect_language( + abs_path.name, all_text=self.config.index_all_text + ) + if language: + yield abs_path, rel, language @staticmethod def _rel(abs_path: Path, root: Path) -> Optional[str]: diff --git a/coderag/install.py b/coderag/install.py index a421a11..c544e34 100644 --- a/coderag/install.py +++ b/coderag/install.py @@ -21,6 +21,7 @@ import json import shutil import sys +import textwrap import tomllib from dataclasses import dataclass, field from pathlib import Path @@ -149,6 +150,49 @@ def detect_targets() -> List[str]: return found +# Whole-home / whole-system locations that are almost never the right thing to index: too +# many files to finish in reasonable time, and (for "/") even pseudo-filesystems. The +# wizard warns before committing to one. The natural unit is a single project/repo. +_BROAD_ROOTS = { + "/home", + "/usr", + "/etc", + "/var", + "/opt", + "/mnt", + "/srv", + "/root", +} + + +def default_workspace(start: Optional[Path] = None) -> Path: + """Best default for the workspace prompt: the enclosing git repo root, else cwd. + + Running the installer from inside a project should index that whole project — not a + stray subdirectory (too narrow to be useful) and certainly not the user's whole home + (too broad to finish). Walking up to the nearest ``.git`` finds the natural per-repo + root, which is the scope CodeRAG is tuned for. + """ + start = (start or Path.cwd()).resolve() + for d in (start, *start.parents): + if (d / ".git").exists(): + return d + return start + + +def _is_broad_root(path: Path) -> bool: + """Heuristic: is ``path`` a whole-home/whole-system location rather than a project?""" + try: + rp = path.expanduser().resolve() + except OSError: # pragma: no cover - resolve() rarely raises here + rp = path.expanduser() + if rp == Path(rp.anchor): # the filesystem root, e.g. "/" + return True + if rp == Path.home().resolve(): + return True + return str(rp) in _BROAD_ROOTS + + # --- per-target writers --------------------------------------------------------------- @@ -336,13 +380,46 @@ def _ask_tools() -> List[str]: return picked or list(DEFAULT_TOOLS) +def _describe_indexing(watched: Path) -> None: + """Tell the user what indexing this path entails — large trees are supported. + + A broad root (``/home``, ``/``) is a legitimate choice — CodeRAG is meant to handle it. + We set expectations (the first pass takes longer and runs in the background) rather than + discourage it, and flag the one genuine footgun: ``/`` descends into pseudo-filesystems. + """ + print(f"\n → CodeRAG will index: {watched}") + if _is_broad_root(watched): + print( + textwrap.indent( + textwrap.dedent( + """\ + This is a large tree (e.g. /home can be ~125k files). CodeRAG indexes it + in the background and streams results as it goes — the first pass just + takes longer. It skips version-control, build, and dependency directories + (node_modules, .venv, __pycache__, …) automatically. For "/" specifically, + exclude pseudo-filesystems like /proc and /sys.""" + ), + " ", + ) + ) + print( + " It is indexed in the background the first time the agent's server starts, so\n" + " search works right away and fills in as it goes. Check it anytime with\n" + " `coderag status` or the index_status tool." + ) + + def run_wizard(detected: List[str], default_watched: Path) -> List[Plan]: """Collect install choices interactively. Returns one :class:`Plan` per chosen target.""" print("CodeRAG install wizard\n----------------------") targets = _ask_targets(detected) watched = Path( - _ask("Workspace directory to index", str(default_watched)) + _ask( + "Workspace directory to index (a repo root, or a larger tree like ~/projects)", + str(default_watched), + ) ).expanduser() + _describe_indexing(watched) plans: List[Plan] = [] for t in targets: # Only Hermes supports per-server tool filtering in its config. diff --git a/coderag/retrieval/graph.py b/coderag/retrieval/graph.py index 296f6b6..7da2333 100644 --- a/coderag/retrieval/graph.py +++ b/coderag/retrieval/graph.py @@ -21,7 +21,7 @@ from typing import TYPE_CHECKING, List, Mapping, Sequence if TYPE_CHECKING: - from coderag.store.sqlite_store import SQLiteStore + from coderag.store.lance_store import LanceStore # A bare identifier used as a call target, e.g. ``do_thing(`` (≥3 chars). The store's symbol # index only holds names the repo defines, so language builtins (len, str, …) never resolve. @@ -44,7 +44,7 @@ def called_names(text: str) -> List[str]: def neighbor_ids( - store: "SQLiteStore", + store: "LanceStore", seed_ids: Sequence[int], seed_texts: Mapping[int, str], *, diff --git a/coderag/retrieval/search.py b/coderag/retrieval/search.py index c9ce924..13717ce 100644 --- a/coderag/retrieval/search.py +++ b/coderag/retrieval/search.py @@ -10,12 +10,11 @@ from coderag.retrieval.fusion import reciprocal_rank_fusion from coderag.retrieval.graph import neighbor_ids from coderag.retrieval.query_type import fusion_weights -from coderag.store.sqlite_store import SQLiteStore -from coderag.store.vector_index import FaissVectorIndex from coderag.types import SearchHit if TYPE_CHECKING: from coderag.retrieval.rerank import Reranker + from coderag.store.lance_store import LanceStore logger = logging.getLogger(__name__) @@ -25,14 +24,12 @@ def __init__( self, config: Config, provider: EmbeddingProvider, - store: SQLiteStore, - vectors: FaissVectorIndex, + store: "LanceStore", reranker: Optional["Reranker"] = None, ) -> None: self.config = config self.provider = provider self.store = store - self.vectors = vectors self.reranker = reranker def search(self, query: str, top_k: int) -> List[SearchHit]: @@ -45,17 +42,16 @@ def search(self, query: str, top_k: int) -> List[SearchHit]: pool = max(self.config.rerank_candidates, top_k) fetch_k = max(self.config.fetch_k, pool) - # Dense retrieval. + # Dense retrieval (vector ANN over the store). qvec = self.provider.embed_query(query) - dense_ids, dense_scores = self.vectors.search(qvec, fetch_k) + dense = self.store.vector_search(qvec, fetch_k) similarity: Dict[int, float] = { - int(i): float(max(0.0, min(1.0, s))) - for i, s in zip(dense_ids, dense_scores, strict=False) + cid: float(max(0.0, min(1.0, s))) for cid, s in dense } - dense_ranked = [int(i) for i in dense_ids] + dense_ranked = [cid for cid, _ in dense] - # Lexical retrieval (BM25 over FTS5). - lexical_ranked = [cid for cid, _ in self.store.fts_search(query, fetch_k)] + # Lexical retrieval (BM25 over the store). + lexical_ranked = [cid for cid, _ in self.store.lexical_search(query, fetch_k)] # Fuse, then trim to the candidate pool (top_k, or deeper when reranking). # Weights may adapt to the query type (dense-up for NL, BM25-up for identifiers). @@ -94,7 +90,7 @@ def search(self, query: str, top_k: int) -> List[SearchHit]: SearchHit( chunk_id=cid, path=row["path"], - symbol=row["symbol"], + symbol=row["symbol"] or None, kind=row["kind"], language=row["language"], start_line=int(row["start_line"]), diff --git a/coderag/store/__init__.py b/coderag/store/__init__.py index 5dc00ca..a816b55 100644 --- a/coderag/store/__init__.py +++ b/coderag/store/__init__.py @@ -1 +1 @@ -"""Persistent storage: SQLite as the source of truth, FAISS as a rebuildable cache.""" +"""Persistent storage: a single embedded LanceDB store (metadata + BM25 + vectors).""" diff --git a/coderag/store/lance_store.py b/coderag/store/lance_store.py new file mode 100644 index 0000000..dd21d6b --- /dev/null +++ b/coderag/store/lance_store.py @@ -0,0 +1,527 @@ +"""The single embedded store: LanceDB holds chunk metadata, text (BM25), and vectors (ANN). + +This replaces the former SQLite store + separate FAISS index. One LanceDB database at +``store_dir`` with two tables: + +* ``files`` — one row per indexed file (``path``, ``content_hash``, ``mtime``, ``size``, + ``language``): drives incremental change detection. +* ``chunks`` — one row per chunk (``id``, ``path``, ``symbol``, ``kind``, ``language``, + ``start_line``, ``end_line``, ``text``, ``vector``): both BM25 (over ``text``) and vector + ANN (over ``vector``) live here, so there is no FAISS↔SQLite coordination to maintain. + +``chunks.path`` is denormalized so a file's chunks are deleted with a single ``delete( +"path = …")`` (LanceDB has no foreign keys). The integer ``chunks.id`` is the fusion/hydrate +key (it replaces the FAISS id). Writes are buffered and flushed in batches — LanceDB is +columnar and many tiny appends create severe fragment/version bloat. Reads query committed +data only; the writer owns the buffer (guarded by a lock), so a background index stays safe +alongside live queries (partial results until ``optimize`` runs). +""" + +from __future__ import annotations + +import json +import logging +import math +import re +import threading +from pathlib import Path +from typing import Any, Dict, List, Optional, Sequence, Tuple + +import numpy as np + +from coderag.retrieval.fusion import reciprocal_rank_fusion +from coderag.types import Chunk, IndexStats, SearchHit + +logger = logging.getLogger(__name__) + +_CHUNKS = "chunks" +_FILES = "files" +_META_FILE = "meta.json" +_SCHEMA_VERSION = 1 +_FLUSH_ROWS = 8192 +# LanceDB needs enough rows to train a vector ANN index; below this, brute-force is exact +# and fast, so we skip indexing (also keeps tiny test corpora on the exact path). +_ANN_MIN_ROWS = 256 +_HYDRATE_COLS = [ + "id", + "path", + "symbol", + "kind", + "language", + "start_line", + "end_line", + "text", +] +_FTS_TOKEN = re.compile(r"[A-Za-z0-9_]+") + + +def _fts_query(query: str) -> str: + """Reduce an arbitrary query to space-separated tokens (defuses FTS operators).""" + return " ".join(_FTS_TOKEN.findall(query)) + + +class LanceStore: + """LanceDB-backed chunk + file store with vector ANN and BM25 search.""" + + def __init__(self, store_dir: Path, dim: int) -> None: + import lancedb + + self.dim = dim + self._dir = Path(store_dir) + self._dir.mkdir(parents=True, exist_ok=True) + self._db = lancedb.connect(str(self._dir)) + self._lock = threading.RLock() + self._chunks_buf: List[Dict[str, Any]] = [] + self._files_buf: List[Dict[str, Any]] = [] + self._next_id = 0 + self._ann_built = False + # Symbol-index cache (callee graph expansion); invalidated by a write generation. + self._gen = 0 + self._symbol_index: Optional[Dict[str, List[int]]] = None + self._symbol_index_gen = -1 + if _CHUNKS in self._db.table_names(): + self._next_id = self._max_id() + 1 + + # --- schema --- + + def _chunks_schema(self) -> Any: + import pyarrow as pa + + return pa.schema( + [ + ("id", pa.int64()), + ("path", pa.string()), + ("symbol", pa.string()), + ("kind", pa.string()), + ("language", pa.string()), + ("start_line", pa.int32()), + ("end_line", pa.int32()), + ("text", pa.string()), + ("vector", pa.list_(pa.float32(), self.dim)), + ] + ) + + def _files_schema(self) -> Any: + import pyarrow as pa + + return pa.schema( + [ + ("path", pa.string()), + ("content_hash", pa.string()), + ("mtime", pa.float64()), + ("size", pa.int64()), + ("language", pa.string()), + ("indexed_at", pa.float64()), + ] + ) + + def _chunks_tbl(self) -> Any: + if _CHUNKS not in self._db.table_names(): + return self._db.create_table(_CHUNKS, schema=self._chunks_schema()) + return self._db.open_table(_CHUNKS) + + def _files_tbl(self) -> Any: + if _FILES not in self._db.table_names(): + return self._db.create_table(_FILES, schema=self._files_schema()) + return self._db.open_table(_FILES) + + def _max_id(self) -> int: + tbl = self._db.open_table(_CHUNKS) + n = tbl.count_rows() + if n == 0: + return -1 + rows = tbl.search().select(["id"]).limit(n).to_list() + return max((int(r["id"]) for r in rows), default=-1) + + # --- provenance / lifecycle --- + + def bootstrap(self, embed_dim: int, embed_model: str) -> bool: + """Record the embedding model/dim; clear the store if they changed. + + Returns True when a rebuild is required (model/dim changed) — the caller just + re-indexes into the now-empty tables (there is no separate index to rebuild). + """ + meta_path = self._dir / _META_FILE + prev: Dict[str, Any] = {} + if meta_path.exists(): + try: + prev = json.loads(meta_path.read_text(encoding="utf-8")) + except (OSError, json.JSONDecodeError): # pragma: no cover - corrupt meta + prev = {} + changed = bool(prev) and ( + int(prev.get("embed_dim", -1)) != embed_dim + or prev.get("embed_model") != embed_model + ) + if changed: + logger.warning( + "Embedding model changed (%s/%s -> %s/%s); clearing index.", + prev.get("embed_model"), + prev.get("embed_dim"), + embed_model, + embed_dim, + ) + with self._lock: + for name in (_CHUNKS, _FILES): + if name in self._db.table_names(): + self._db.drop_table(name) + self._chunks_buf.clear() + self._files_buf.clear() + self._next_id = 0 + self._ann_built = False + self._gen += 1 + meta_path.write_text( + json.dumps( + { + "embed_model": embed_model, + "embed_dim": embed_dim, + "schema_version": _SCHEMA_VERSION, + } + ), + encoding="utf-8", + ) + return changed + + def close(self) -> None: + with self._lock: + self._chunks_buf.clear() + self._files_buf.clear() + + def clear(self) -> None: + """Drop all data (used by a full rebuild). Keeps the recorded provenance meta.""" + with self._lock: + for name in (_CHUNKS, _FILES): + if name in self._db.table_names(): + self._db.drop_table(name) + self._chunks_buf.clear() + self._files_buf.clear() + self._next_id = 0 + self._ann_built = False + self._gen += 1 + + # --- buffered writes --- + + def _flush(self) -> None: + if self._chunks_buf: + self._chunks_tbl().add(self._chunks_buf) + self._chunks_buf = [] + if self._files_buf: + self._files_tbl().add(self._files_buf) + self._files_buf = [] + + def flush(self) -> None: + with self._lock: + self._flush() + + def _delete_path_rows(self, rel: str) -> int: + """Delete a file's chunk + file rows from the committed tables. Returns chunks gone.""" + names = self._db.table_names() + removed = 0 + pred = f"path = '{rel.replace(chr(39), chr(39) * 2)}'" + if _CHUNKS in names: + ctbl = self._db.open_table(_CHUNKS) + removed = len( + ctbl.search().where(pred).select(["id"]).limit(10**9).to_list() + ) + if removed: + ctbl.delete(pred) + if _FILES in names: + self._db.open_table(_FILES).delete(pred) + return removed + + def write_file( + self, + rel: str, + language: str, + content_hash: str, + mtime: float, + size: int, + chunks: Sequence[Chunk], + vectors: Optional[np.ndarray], + *, + replace: bool, + ) -> Tuple[int, int]: + """Index one file: (replace its old rows, if any) then buffer its new rows. + + Returns ``(chunks_added, chunks_removed)``. New files take the fully-batched fast + path (no flush); replacing a changed file flushes + deletes its old rows first, so + the delete-before-add invariant holds on the single writer thread. + """ + import time + + with self._lock: + removed = 0 + if replace: + self._flush() + removed = self._delete_path_rows(rel) + added = 0 + if chunks and vectors is not None: + mat = np.ascontiguousarray(vectors, dtype="float32") + norms = np.linalg.norm(mat, axis=1, keepdims=True) + mat = mat / np.where(norms == 0.0, 1.0, norms) + for chunk, vec in zip(chunks, mat, strict=False): + self._chunks_buf.append( + { + "id": self._next_id, + "path": rel, + "symbol": chunk.symbol or "", + "kind": chunk.kind, + "language": chunk.language, + "start_line": int(chunk.start_line), + "end_line": int(chunk.end_line), + "text": chunk.text, + "vector": vec.tolist(), + } + ) + self._next_id += 1 + added += 1 + self._files_buf.append( + { + "path": rel, + "content_hash": content_hash, + "mtime": float(mtime), + "size": int(size), + "language": language, + "indexed_at": time.time(), + } + ) + self._gen += 1 + if len(self._chunks_buf) >= _FLUSH_ROWS: + self._flush() + return added, removed + + def delete_file(self, rel: str) -> int: + with self._lock: + self._flush() + removed = self._delete_path_rows(rel) + if removed: + self._gen += 1 + return removed + + def optimize(self) -> None: + """Flush, compact, (re)build the BM25 index, and build the vector ANN index at scale.""" + with self._lock: + self._flush() + if _CHUNKS not in self._db.table_names(): + return + tbl = self._db.open_table(_CHUNKS) + try: + tbl.optimize() + tbl.cleanup_old_versions() + except Exception: # pragma: no cover - compaction is best-effort + logger.exception("LanceDB optimize failed (continuing).") + try: + tbl.create_fts_index("text", replace=True) + except Exception: # pragma: no cover + logger.exception("LanceDB FTS index build failed (continuing).") + n = tbl.count_rows() + if n >= _ANN_MIN_ROWS: + try: + nlist = max(1, min(int(4 * math.sqrt(n)), n // 39)) + tbl.create_index( + metric="cosine", + vector_column_name="vector", + num_partitions=nlist, + replace=True, + ) + self._ann_built = True + except Exception: # pragma: no cover - falls back to brute-force search + logger.exception("LanceDB vector index build failed (brute-force).") + + @property + def index_kind(self) -> str: + return "lancedb-ann" if self._ann_built else "lancedb" + + # --- file metadata / change detection --- + + def get_file_meta(self, rel: str) -> Optional[Dict[str, Any]]: + self.flush() + if _FILES not in self._db.table_names(): + return None + pred = f"path = '{rel.replace(chr(39), chr(39) * 2)}'" + rows = self._db.open_table(_FILES).search().where(pred).limit(1).to_list() + return rows[0] if rows else None + + def all_file_metas(self) -> Dict[str, Dict[str, Any]]: + """Every file's change-detection metadata, in one scan (indexer preload).""" + self.flush() + if _FILES not in self._db.table_names(): + return {} + tbl = self._db.open_table(_FILES) + rows = ( + tbl.search() + .select(["path", "content_hash", "mtime", "size"]) + .limit(max(1, tbl.count_rows())) + .to_list() + ) + return {r["path"]: r for r in rows} + + def all_file_paths(self) -> List[str]: + return list(self.all_file_metas().keys()) + + # --- retrieval --- + + def vector_search(self, qvec: np.ndarray, k: int) -> List[Tuple[int, float]]: + if _CHUNKS not in self._db.table_names(): + return [] + q = np.asarray(qvec, dtype="float32").reshape(-1) + norm = np.linalg.norm(q) + if norm: + q = q / norm + tbl = self._db.open_table(_CHUNKS) + if tbl.count_rows() == 0: + return [] + rows = tbl.search(q.tolist()).metric("cosine").select(["id"]).limit(k).to_list() + return [(int(r["id"]), 1.0 - float(r["_distance"])) for r in rows] + + def lexical_search(self, query: str, k: int) -> List[Tuple[int, float]]: + if _CHUNKS not in self._db.table_names(): + return [] + match = _fts_query(query) + if not match: + return [] + try: + rows = ( + self._db.open_table(_CHUNKS) + .search(match, query_type="fts") + .select(["id"]) + .limit(k) + .to_list() + ) + except Exception: # pragma: no cover - FTS index not built yet / query rejected + return [] + return [(int(r["id"]), float(r["_score"])) for r in rows] + + def chunk_ids_for_path(self, rel: str) -> List[int]: + """The chunk ids belonging to one file (for inspection/tests).""" + self.flush() + if _CHUNKS not in self._db.table_names(): + return [] + pred = f"path = '{rel.replace(chr(39), chr(39) * 2)}'" + tbl = self._db.open_table(_CHUNKS) + rows = ( + tbl.search() + .where(pred) + .select(["id"]) + .limit(max(1, tbl.count_rows())) + .to_list() + ) + return [int(r["id"]) for r in rows] + + def hydrate(self, ids: Sequence[int]) -> Dict[int, Dict[str, Any]]: + if not ids or _CHUNKS not in self._db.table_names(): + return {} + csv = ",".join(str(int(i)) for i in ids) + rows = ( + self._db.open_table(_CHUNKS) + .search() + .where(f"id IN ({csv})") + .select(_HYDRATE_COLS) + .limit(len(ids)) + .to_list() + ) + return {int(r["id"]): r for r in rows} + + def symbol_index(self) -> Dict[str, List[int]]: + """Map each symbol's bare name -> chunk ids defining it (cached, gen-invalidated).""" + with self._lock: + if self._symbol_index is not None and self._symbol_index_gen == self._gen: + return self._symbol_index + gen = self._gen + self._flush() + index: Dict[str, List[int]] = {} + if _CHUNKS in self._db.table_names(): + tbl = self._db.open_table(_CHUNKS) + rows = ( + tbl.search() + .where("symbol != ''") + .select(["id", "symbol"]) + .limit(max(1, tbl.count_rows())) + .to_list() + ) + for r in rows: + bare = str(r["symbol"]).rsplit(".", 1)[-1].strip() + if len(bare) >= 3: + index.setdefault(bare, []).append(int(r["id"])) + with self._lock: + self._symbol_index = index + self._symbol_index_gen = gen + return index + + # --- stats / UI --- + + def total_chunks(self) -> int: + self.flush() + if _CHUNKS not in self._db.table_names(): + return 0 + return int(self._db.open_table(_CHUNKS).count_rows()) + + def stats(self) -> IndexStats: + self.flush() + names = self._db.table_names() + files = int(self._db.open_table(_FILES).count_rows()) if _FILES in names else 0 + chunks = ( + int(self._db.open_table(_CHUNKS).count_rows()) if _CHUNKS in names else 0 + ) + return IndexStats(total_files=files, total_chunks=chunks) + + def _distinct(self, column: str) -> List[str]: + self.flush() + if _CHUNKS not in self._db.table_names(): + return [] + tbl = self._db.open_table(_CHUNKS) + rows = tbl.search().select([column]).limit(max(1, tbl.count_rows())).to_list() + return sorted({r[column] for r in rows if r.get(column)}) + + def distinct_languages(self) -> List[str]: + return self._distinct("language") + + def distinct_kinds(self) -> List[str]: + return self._distinct("kind") + + # --- convenience hybrid search (used by the bake-off scripts; engine uses HybridSearcher) --- + + def search( + self, + query: str, + provider: Any, + top_k: int = 8, + *, + fetch_k: int = 50, + dense_weight: float = 1.0, + lexical_weight: float = 1.0, + rrf_k: int = 60, + ) -> List[SearchHit]: + if not query.strip(): + return [] + fetch = max(fetch_k, top_k) + dense = self.vector_search(provider.embed_query(query), fetch) + lexical = self.lexical_search(query, fetch) + similarity = {cid: max(0.0, min(1.0, s)) for cid, s in dense} + fused = reciprocal_rank_fusion( + [[cid for cid, _ in dense], [cid for cid, _ in lexical]], + k=rrf_k, + weights=[dense_weight, lexical_weight], + )[:top_k] + if not fused: + return [] + rows = self.hydrate([cid for cid, _ in fused]) + hits: List[SearchHit] = [] + for cid, score in fused: + r = rows.get(cid) + if r is None: + continue + hits.append( + SearchHit( + chunk_id=cid, + path=r["path"], + symbol=r["symbol"] or None, + kind=r["kind"], + language=r["language"], + start_line=int(r["start_line"]), + end_line=int(r["end_line"]), + text=r["text"], + score=float(score), + similarity=float(similarity.get(cid, 0.0)), + ) + ) + return hits diff --git a/coderag/store/schema.py b/coderag/store/schema.py deleted file mode 100644 index 6581c28..0000000 --- a/coderag/store/schema.py +++ /dev/null @@ -1,69 +0,0 @@ -"""SQLite schema for the CodeRAG store. - -Design notes: -- ``chunks.id`` IS the FAISS id. It is ``AUTOINCREMENT`` so ids are *never reused*, which - is what keeps a stale FAISS cache from resurrecting deleted content under a recycled id. -- ``chunks_fts`` is an external-content FTS5 table (no duplicated text) kept in sync by - triggers, giving us BM25 lexical search for free alongside dense vectors. -- ``files.content_hash`` drives incremental indexing; ``meta`` records the embedding - model/dim so a provider switch can trigger a rebuild instead of crashing. -""" - -from __future__ import annotations - -SCHEMA_VERSION = 1 - -DDL = """ -CREATE TABLE IF NOT EXISTS files ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - path TEXT NOT NULL UNIQUE, - language TEXT NOT NULL, - content_hash TEXT NOT NULL, - mtime REAL, - indexed_at REAL NOT NULL -); -CREATE INDEX IF NOT EXISTS idx_files_path ON files(path); - -CREATE TABLE IF NOT EXISTS chunks ( - id INTEGER PRIMARY KEY AUTOINCREMENT, - file_id INTEGER NOT NULL REFERENCES files(id) ON DELETE CASCADE, - symbol TEXT, - kind TEXT NOT NULL DEFAULT 'window', - start_line INTEGER NOT NULL, - end_line INTEGER NOT NULL, - language TEXT NOT NULL, - text TEXT NOT NULL, - vector BLOB NOT NULL, - embed_model TEXT NOT NULL, - created_at REAL NOT NULL -); -CREATE INDEX IF NOT EXISTS idx_chunks_file ON chunks(file_id); - -CREATE VIRTUAL TABLE IF NOT EXISTS chunks_fts USING fts5( - text, - symbol, - content='chunks', - content_rowid='id', - tokenize='unicode61 remove_diacritics 2' -); - -CREATE TRIGGER IF NOT EXISTS chunks_ai AFTER INSERT ON chunks BEGIN - INSERT INTO chunks_fts(rowid, text, symbol) VALUES (new.id, new.text, new.symbol); -END; - -CREATE TRIGGER IF NOT EXISTS chunks_ad AFTER DELETE ON chunks BEGIN - INSERT INTO chunks_fts(chunks_fts, rowid, text, symbol) - VALUES('delete', old.id, old.text, old.symbol); -END; - -CREATE TRIGGER IF NOT EXISTS chunks_au AFTER UPDATE ON chunks BEGIN - INSERT INTO chunks_fts(chunks_fts, rowid, text, symbol) - VALUES('delete', old.id, old.text, old.symbol); - INSERT INTO chunks_fts(rowid, text, symbol) VALUES (new.id, new.text, new.symbol); -END; - -CREATE TABLE IF NOT EXISTS meta ( - key TEXT PRIMARY KEY, - value TEXT -); -""" diff --git a/coderag/store/sqlite_store.py b/coderag/store/sqlite_store.py deleted file mode 100644 index 1bd603f..0000000 --- a/coderag/store/sqlite_store.py +++ /dev/null @@ -1,321 +0,0 @@ -"""SQLite-backed source of truth for files, chunks, vectors, and lexical search.""" - -from __future__ import annotations - -import logging -import re -import sqlite3 -import threading -import time -from pathlib import Path -from typing import Dict, Iterator, List, Optional, Sequence, Tuple - -import numpy as np - -from coderag.store.schema import DDL, SCHEMA_VERSION -from coderag.types import Chunk, IndexStats - -logger = logging.getLogger(__name__) - -# Strip FTS5 operators so a raw code query (e.g. ``foo::bar*``) can't raise a syntax error. -_FTS_TOKEN = re.compile(r"[A-Za-z0-9_]+") - - -def _sanitize_fts(query: str) -> str: - """Turn an arbitrary query into a safe FTS5 MATCH expression (token OR token).""" - tokens = _FTS_TOKEN.findall(query) - if not tokens: - return "" - # Quote each token (defuses operators) and OR them for recall on identifiers. - return " OR ".join(f'"{t}"' for t in tokens) - - -class SQLiteStore: - """Thread-safe store over a single shared connection. - - Point reads and writes serialize on one reentrant lock: a single ``sqlite3`` - connection is not safe for concurrent cross-thread use even under WAL, and the - watcher reindexes on a background thread while surfaces may read. WAL is still - enabled so separate *processes* don't block each other. (``iter_vectors`` is the - one unlocked bulk reader; it is only used during single-threaded rebuilds.) - """ - - def __init__(self, db_path: Path) -> None: - self.db_path = Path(db_path) - self.db_path.parent.mkdir(parents=True, exist_ok=True) - self._lock = threading.RLock() - # Bumped on every chunk write so caches derived from the chunk table (the symbol - # index used by callee expansion) can invalidate without a table scan per query. - self._gen = 0 - self._symbol_index: Optional[Dict[str, List[int]]] = None - self._symbol_index_gen = -1 - self._conn = sqlite3.connect( - str(self.db_path), check_same_thread=False, isolation_level=None - ) - self._conn.row_factory = sqlite3.Row - self._conn.execute("PRAGMA journal_mode=WAL") - self._conn.execute("PRAGMA foreign_keys=ON") - self._conn.execute("PRAGMA synchronous=NORMAL") - - # --- lifecycle --- - - def bootstrap(self, embed_dim: int, embed_model: str) -> bool: - """Create schema and reconcile provenance. - - Returns True if a full rebuild is required because the embedding model/dimension - changed since the store was last written (in which case existing chunks/files are - cleared so a reindex repopulates cleanly). - """ - with self._lock: - self._conn.executescript(DDL) - self._set_meta("schema_version", str(SCHEMA_VERSION)) - prev_dim = self._get_meta("embed_dim") - prev_model = self._get_meta("embed_model") - rebuild = False - if prev_dim is not None and ( - int(prev_dim) != embed_dim or prev_model != embed_model - ): - logger.warning( - "Embedding model changed (%s/%s -> %s/%s); clearing index for " - "rebuild.", - prev_model, - prev_dim, - embed_model, - embed_dim, - ) - self._conn.execute("DELETE FROM chunks") - self._conn.execute("DELETE FROM files") - self._gen += 1 - rebuild = True - self._set_meta("embed_dim", str(embed_dim)) - self._set_meta("embed_model", embed_model) - return rebuild - - def close(self) -> None: - with self._lock: - self._conn.close() - - # --- meta --- - - def _get_meta(self, key: str) -> Optional[str]: - with self._lock: - row = self._conn.execute( - "SELECT value FROM meta WHERE key = ?", (key,) - ).fetchone() - return row["value"] if row else None - - def _set_meta(self, key: str, value: str) -> None: - self._conn.execute( - "INSERT INTO meta(key, value) VALUES(?, ?) " - "ON CONFLICT(key) DO UPDATE SET value = excluded.value", - (key, value), - ) - - # --- file records --- - - def get_file(self, path: str) -> Optional[sqlite3.Row]: - with self._lock: - return self._conn.execute( - "SELECT * FROM files WHERE path = ?", (path,) - ).fetchone() - - def all_file_paths(self) -> List[str]: - with self._lock: - rows = self._conn.execute("SELECT path FROM files").fetchall() - return [r["path"] for r in rows] - - def distinct_languages(self) -> List[str]: - """Languages present in the index, sorted — used to populate UI filters.""" - with self._lock: - rows = self._conn.execute( - "SELECT DISTINCT language FROM chunks ORDER BY language" - ).fetchall() - return [r["language"] for r in rows] - - def distinct_kinds(self) -> List[str]: - """Chunk kinds present in the index, sorted — used to populate UI filters.""" - with self._lock: - rows = self._conn.execute( - "SELECT DISTINCT kind FROM chunks ORDER BY kind" - ).fetchall() - return [r["kind"] for r in rows] - - def upsert_file( - self, path: str, language: str, content_hash: str, mtime: float - ) -> int: - with self._lock: - now = time.time() - self._conn.execute( - "INSERT INTO files(path, language, content_hash, mtime, indexed_at) " - "VALUES(?, ?, ?, ?, ?) " - "ON CONFLICT(path) DO UPDATE SET " - " language=excluded.language, content_hash=excluded.content_hash, " - " mtime=excluded.mtime, indexed_at=excluded.indexed_at", - (path, language, content_hash, mtime, now), - ) - row = self._conn.execute( - "SELECT id FROM files WHERE path = ?", (path,) - ).fetchone() - return int(row["id"]) - - # --- chunk records --- - - def chunk_ids_for_file(self, file_id: int) -> List[int]: - with self._lock: - rows = self._conn.execute( - "SELECT id FROM chunks WHERE file_id = ?", (file_id,) - ).fetchall() - return [int(r["id"]) for r in rows] - - def delete_file(self, path: str) -> List[int]: - """Delete a file and its chunks. Returns the removed chunk ids (FAISS ids).""" - with self._lock: - row = self._conn.execute( - "SELECT id FROM files WHERE path = ?", (path,) - ).fetchone() - if row is None: - return [] - file_id = int(row["id"]) - ids = self.chunk_ids_for_file(file_id) - self._conn.execute("DELETE FROM chunks WHERE file_id = ?", (file_id,)) - self._conn.execute("DELETE FROM files WHERE id = ?", (file_id,)) - if ids: - self._gen += 1 - return ids - - def delete_chunks_for_file(self, file_id: int) -> List[int]: - with self._lock: - ids = self.chunk_ids_for_file(file_id) - self._conn.execute("DELETE FROM chunks WHERE file_id = ?", (file_id,)) - if ids: - self._gen += 1 - return ids - - def add_chunks( - self, - file_id: int, - chunks: Sequence[Chunk], - vectors: np.ndarray, - embed_model: str, - ) -> List[int]: - """Insert chunks with their vectors. Returns the assigned chunk ids in order.""" - if len(chunks) != len(vectors): - raise ValueError("chunks and vectors length mismatch") - ids: List[int] = [] - now = time.time() - with self._lock: - for chunk, vec in zip(chunks, vectors, strict=False): - blob = np.asarray(vec, dtype="float32").tobytes() - cur = self._conn.execute( - "INSERT INTO chunks(file_id, symbol, kind, start_line, end_line, " - "language, text, vector, embed_model, created_at) " - "VALUES(?, ?, ?, ?, ?, ?, ?, ?, ?, ?)", - ( - file_id, - chunk.symbol, - chunk.kind, - chunk.start_line, - chunk.end_line, - chunk.language, - chunk.text, - blob, - embed_model, - now, - ), - ) - ids.append(int(cur.lastrowid or 0)) - if ids: - self._gen += 1 - return ids - - # --- retrieval support --- - - def fts_search(self, query: str, limit: int) -> List[Tuple[int, float]]: - """Lexical search via FTS5 BM25. Returns ``(chunk_id, bm25)`` best-first.""" - match = _sanitize_fts(query) - if not match: - return [] - try: - with self._lock: - rows = self._conn.execute( - "SELECT rowid, bm25(chunks_fts) AS score FROM chunks_fts " - "WHERE chunks_fts MATCH ? ORDER BY score LIMIT ?", - (match, limit), - ).fetchall() - except sqlite3.OperationalError as exc: # pragma: no cover - defensive - logger.warning("FTS query failed (%s); degrading to dense-only.", exc) - return [] - return [(int(r["rowid"]), float(r["score"])) for r in rows] - - def symbol_index(self) -> Dict[str, List[int]]: - """Map each symbol's bare name (last dotted component) -> chunk ids defining it. - - Used by callee expansion (``retrieval.graph``) to resolve a called identifier to - its definition. Cached and invalidated on any chunk write, so it costs one table - scan after a change and is O(1) per query otherwise. Names shorter than three - characters are dropped (too common/ambiguous to be useful graph edges). - """ - with self._lock: - if self._symbol_index is not None and self._symbol_index_gen == self._gen: - return self._symbol_index - rows = self._conn.execute( - "SELECT id, symbol FROM chunks WHERE symbol IS NOT NULL" - ).fetchall() - gen = self._gen - index: Dict[str, List[int]] = {} - for r in rows: - bare = r["symbol"].rsplit(".", 1)[-1].strip() - if len(bare) >= 3: - index.setdefault(bare, []).append(int(r["id"])) - with self._lock: - self._symbol_index = index - self._symbol_index_gen = gen - return index - - def hydrate(self, chunk_ids: Sequence[int]) -> Dict[int, sqlite3.Row]: - """Fetch chunk + file rows for the given ids in one query.""" - if not chunk_ids: - return {} - placeholders = ",".join("?" for _ in chunk_ids) - with self._lock: - rows = self._conn.execute( - "SELECT c.id, c.symbol, c.kind, c.start_line, c.end_line, c.language, " # nosec B608 — IN-list is positional "?" placeholders; ids bound as params - " c.text, f.path AS path " - "FROM chunks c JOIN files f ON f.id = c.file_id " - f"WHERE c.id IN ({placeholders})", - tuple(chunk_ids), - ).fetchall() - return {int(r["id"]): r for r in rows} - - def iter_vectors( - self, batch: int = 1000 - ) -> Iterator[Tuple[np.ndarray, np.ndarray]]: - """Yield ``(ids, vectors)`` batches for rebuilding the FAISS index.""" - cur = self._conn.execute("SELECT id, vector FROM chunks ORDER BY id") - while True: - rows = cur.fetchmany(batch) - if not rows: - break - ids = np.array([int(r["id"]) for r in rows], dtype="int64") - vecs = np.vstack( - [np.frombuffer(r["vector"], dtype="float32") for r in rows] - ) - yield ids, vecs - - # --- stats --- - - def stats(self) -> IndexStats: - with self._lock: - files = self._conn.execute("SELECT COUNT(*) AS n FROM files").fetchone()[ - "n" - ] - chunks = self._conn.execute("SELECT COUNT(*) AS n FROM chunks").fetchone()[ - "n" - ] - return IndexStats(total_files=int(files), total_chunks=int(chunks)) - - def total_chunks(self) -> int: - with self._lock: - return int( - self._conn.execute("SELECT COUNT(*) AS n FROM chunks").fetchone()["n"] - ) diff --git a/coderag/store/vector_index.py b/coderag/store/vector_index.py deleted file mode 100644 index 5ae86a3..0000000 --- a/coderag/store/vector_index.py +++ /dev/null @@ -1,220 +0,0 @@ -"""FAISS vector index — a rebuildable cache over the vectors stored in SQLite. - -Two backends behind one interface, selected by corpus size: -- **flat** (``IndexIDMap2(IndexFlatIP)``): exact cosine, ideal for small/medium repos. -- **ivf** (``IndexIVFFlat``): approximate, stays fast at 100k+ vectors. - -Both support ``add_with_ids`` and ``remove_ids``, so incremental indexing (delete a file's -old chunks, add the new ones) works identically regardless of backend. Because every vector -also lives in SQLite, the on-disk ``.faiss`` file is disposable and can be rebuilt at any -time (``rebuild_from_store``). -""" - -from __future__ import annotations - -import logging -import math -import threading -from pathlib import Path -from typing import TYPE_CHECKING, Iterable, Tuple - -import faiss -import numpy as np - -from coderag.config import Config - -if TYPE_CHECKING: - from coderag.store.sqlite_store import SQLiteStore - -logger = logging.getLogger(__name__) - - -def _normalized(vectors: np.ndarray) -> np.ndarray: - """Return an L2-normalized float32 copy (cosine similarity via inner product).""" - mat = np.ascontiguousarray(vectors, dtype="float32") - if mat.size: - mat = mat.copy() - faiss.normalize_L2(mat) - return mat - - -def _derive_nlist(n: int, configured: int) -> int: - if configured > 0: - return max(1, min(configured, n)) - return max(1, min(int(4 * math.sqrt(n)), max(1, n // 39))) - - -class FaissVectorIndex: - def __init__(self, index: faiss.Index, kind: str, config: Config, dim: int) -> None: - self._index = index - self.kind = kind - self.config = config - self.dim = dim - # A FAISS index is not safe for a write (add/remove/rebuild) concurrent with a - # read (search). The MCP server is the first surface to run the watcher (which - # writes) alongside live agent queries (which read), so serialize index access on - # a reentrant lock. Reads are fast, so contention is negligible. - self._lock = threading.RLock() - - # --- construction / persistence --- - - @classmethod - def _empty_flat(cls, dim: int) -> faiss.Index: - return faiss.IndexIDMap2(faiss.IndexFlatIP(dim)) - - @classmethod - def open(cls, config: Config, dim: int) -> "FaissVectorIndex": - path = config.faiss_path - meta_path = Path(str(path) + ".kind") - if path.exists() and meta_path.exists(): - try: - index = faiss.read_index(str(path)) - kind = meta_path.read_text().strip() or "flat" - if kind == "ivf": - # read_index returns a base Index; reach the IVF sub-index to set - # the search-time nprobe (the attribute lives on IndexIVF). - faiss.extract_index_ivf(index).nprobe = config.ivf_nprobe - return cls(index, kind, config, dim) - except Exception as exc: # pragma: no cover - corrupt cache - logger.warning("Failed to load FAISS index (%s); starting empty.", exc) - return cls(cls._empty_flat(dim), "flat", config, dim) - - def save(self) -> None: - path = self.config.faiss_path - path.parent.mkdir(parents=True, exist_ok=True) - with self._lock: - faiss.write_index(self._index, str(path)) - Path(str(path) + ".kind").write_text(self.kind) - - # --- properties --- - - @property - def ntotal(self) -> int: - with self._lock: - return int(self._index.ntotal) - - # --- mutations --- - - def add(self, ids: np.ndarray, vectors: np.ndarray) -> None: - if len(ids) == 0: - return - vecs = _normalized(vectors) - id_arr = np.ascontiguousarray(ids, dtype="int64") - with self._lock: - self._index.add_with_ids(vecs, id_arr) - - def remove(self, ids: Iterable[int]) -> int: - ids = list(ids) - if not ids: - return 0 - selector = faiss.IDSelectorBatch(np.asarray(ids, dtype="int64")) - with self._lock: - return int(self._index.remove_ids(selector)) - - def search(self, query: np.ndarray, k: int) -> Tuple[np.ndarray, np.ndarray]: - """Return ``(ids, scores)`` for the top-k, with FAISS ``-1`` padding stripped.""" - with self._lock: - if self.ntotal == 0: - return np.empty(0, dtype="int64"), np.empty(0, dtype="float32") - q = _normalized(np.asarray(query, dtype="float32").reshape(1, -1)) - k = min(k, self.ntotal) - scores, ids = self._index.search(q, k) - ids_row, scores_row = ids[0], scores[0] - mask = ids_row != -1 - return ids_row[mask].astype("int64"), scores_row[mask].astype("float32") - - # --- rebuild / consistency --- - - def _choose_kind(self, n: int) -> str: - if self.config.index_type == "flat": - return "flat" - if self.config.index_type == "ivf": - return "ivf" if n > 0 else "flat" - # auto - return "ivf" if n > self.config.ivf_threshold else "flat" - - def _build_ivf(self, ids: np.ndarray, vecs: np.ndarray) -> faiss.Index: - nlist = _derive_nlist(len(ids), self.config.ivf_nlist) - quantizer = faiss.IndexFlatIP(self.dim) - index = faiss.IndexIVFFlat( - quantizer, self.dim, nlist, faiss.METRIC_INNER_PRODUCT - ) - index.train(vecs) - index.add_with_ids(vecs, ids) - # nprobe must not exceed nlist (FAISS clamps it, but keep it meaningful). - index.nprobe = max(1, min(self.config.ivf_nprobe, nlist)) - logger.info("Built IVF index: %d vectors, nlist=%d", len(ids), nlist) - return index - - def rebuild_from_store(self, store: "SQLiteStore") -> None: - """Discard the current index and rebuild it from the SQLite vectors. - - Holds the index lock for the whole swap so a concurrent search never observes a - half-built index. This is rare (model change, or the one-time flat->ivf upgrade), - so briefly stalling reads is an acceptable price for correctness. - """ - with self._lock: - n = store.total_chunks() - kind = self._choose_kind(n) - if n == 0: - self._index = self._empty_flat(self.dim) - self.kind = "flat" - self.save() - return - - if kind == "ivf": - # IVF needs all training vectors up front. - all_ids, all_vecs = [], [] - for ids, vecs in store.iter_vectors(): - all_ids.append(ids) - all_vecs.append(_normalized(vecs)) - ids = np.concatenate(all_ids) - vecs = np.vstack(all_vecs) - try: - self._index = self._build_ivf(ids, vecs) - self.kind = "ivf" - except Exception as exc: - # Degenerate corpora (too few or many duplicate vectors) can make IVF - # training fail; fall back to exact flat rather than aborting indexing. - logger.warning( - "IVF training failed (%s); falling back to flat index.", exc - ) - index = self._empty_flat(self.dim) - index.add_with_ids(vecs, np.ascontiguousarray(ids)) - self._index = index - self.kind = "flat" - else: - index = self._empty_flat(self.dim) - for ids, vecs in store.iter_vectors(): - index.add_with_ids(_normalized(vecs), np.ascontiguousarray(ids)) - self._index = index - self.kind = "flat" - logger.info("Built flat index: %d vectors", n) - self.save() - - def ensure_consistent(self, store: "SQLiteStore") -> None: - """Rebuild from SQLite if the cached index disagrees with the store. - - Triggers on a vector-count mismatch *or* a dimension mismatch — the latter - guards against loading an index built with a different embedding model. - """ - if self._index.d != self.dim or self.ntotal != store.total_chunks(): - logger.info( - "FAISS cache out of sync (dim %d vs %d, %d vs %d chunks); rebuilding.", - self._index.d, - self.dim, - self.ntotal, - store.total_chunks(), - ) - self.rebuild_from_store(store) - - def maybe_upgrade(self, store: "SQLiteStore") -> bool: - """Switch flat->ivf when an auto index grows past the threshold. Returns True - if a rebuild happened.""" - if self.config.index_type != "auto" or self.kind == "ivf": - return False - if store.total_chunks() > self.config.ivf_threshold: - logger.info("Corpus exceeded IVF threshold; upgrading flat -> ivf.") - self.rebuild_from_store(store) - return True - return False diff --git a/coderag/surfaces/cli.py b/coderag/surfaces/cli.py index fc565ff..75116bf 100644 --- a/coderag/surfaces/cli.py +++ b/coderag/surfaces/cli.py @@ -11,6 +11,7 @@ import os import sys import textwrap +import time from pathlib import Path from typing import List, Optional @@ -29,6 +30,8 @@ def _build_config(args: argparse.Namespace) -> Config: overrides["provider"] = args.provider if getattr(args, "model", None): overrides["model"] = args.model + if getattr(args, "use_gitignore", None) is not None: + overrides["use_gitignore"] = args.use_gitignore return Config.from_env(**overrides) @@ -37,6 +40,16 @@ def _build_config(args: argparse.Namespace) -> Config: def cmd_index(args: argparse.Namespace) -> int: cr = CodeRAG(_build_config(args)) + if not args.quiet: + # The provider/model loads on first index access; on a fresh install that is a + # one-off model download. Say so on stderr, so the wait isn't a mystery. + print( + f"Preparing to index {cr.config.watched_dir} " + "(first run downloads the embedding model)…", + file=sys.stderr, + flush=True, + ) + started = time.monotonic() stats = cr.indexer.index( Path(args.path).expanduser() if args.path else None, full=args.full, @@ -44,7 +57,7 @@ def cmd_index(args: argparse.Namespace) -> int: ) print( f"Indexed {stats.files_indexed} file(s), skipped {stats.files_skipped}, " - f"removed {stats.files_removed}. " + f"removed {stats.files_removed} in {time.monotonic() - started:.1f}s. " f"Total: {stats.total_files} files / {stats.total_chunks} chunks." ) return 0 @@ -235,7 +248,9 @@ def cmd_install(args: argparse.Namespace) -> int: from coderag import install as inst default_watched = ( - Path(args.watched_dir).expanduser() if args.watched_dir else Path.cwd() + Path(args.watched_dir).expanduser() + if args.watched_dir + else inst.default_workspace() ) explicit_watched = Path(args.watched_dir).expanduser() if args.watched_dir else None interactive = sys.stdin.isatty() @@ -311,6 +326,21 @@ def cmd_install(args: argparse.Namespace) -> int: print("\nNext steps:") for s in sorted(steps): print(f" - {s}") + + wd = next((p.watched_dir for p in plans if p.watched_dir is not None), Path.cwd()) + print("\nHow indexing works:") + print( + " CodeRAG indexes your workspace the first time the agent starts its server. It\n" + " runs in the background, so search works right away and fills in as it goes —\n" + " seconds for a repo. Large trees (a whole home/system) are supported too; the\n" + " first pass just takes longer. It skips version-control, build, and dependency\n" + " directories automatically (see `coderag index --help` for throughput options)." + ) + print("\nHandy commands:") + print(f" coderag status --watched-dir {wd} # totals + where the index lives") + print( + f" coderag index --watched-dir {wd} # build/refresh it now, with progress" + ) return 0 if all(r.action != "error" for r in final) else 1 @@ -346,6 +376,19 @@ def _add_common(p: argparse.ArgumentParser) -> None: "any OpenAI-compatible/local server via OPENAI_BASE_URL) | fake.", ) p.add_argument("--model", help="Embedding model name.") + p.add_argument( + "--gitignore", + dest="use_gitignore", + action="store_true", + default=None, + help="Honor .gitignore files while indexing/searching (default).", + ) + p.add_argument( + "--no-gitignore", + dest="use_gitignore", + action="store_false", + help="Do not honor .gitignore files.", + ) def build_parser() -> argparse.ArgumentParser: diff --git a/coderag/surfaces/mcp_server.py b/coderag/surfaces/mcp_server.py index 3231527..44ded35 100644 --- a/coderag/surfaces/mcp_server.py +++ b/coderag/surfaces/mcp_server.py @@ -18,7 +18,9 @@ import json import logging +import sys import threading +import time from typing import TYPE_CHECKING, List, Literal, Optional if TYPE_CHECKING: @@ -29,6 +31,19 @@ logger = logging.getLogger(__name__) + +def _notify(msg: str) -> None: + """Surface a server lifecycle line on stderr. + + The stdio transport owns stdout (it is the MCP wire protocol), so anything meant for a + human watching the server has to go to stderr — which is where agents capture server + logs. Without this, a first run is silent through the model download and the initial + index, so the agent looks hung. Also logged for structured-logging setups. + """ + print(f"[coderag] {msg}", file=sys.stderr, flush=True) + logger.info(msg) + + _INSTRUCTIONS = ( "CodeRAG indexes this workspace for fast search. Two complementary search tools, both " "preferable to grep/glob/find/read loops:\n" @@ -356,6 +371,8 @@ def run_mcp( state = _State() mcp = build_mcp(cr, state=state) + _notify(f"starting — workspace: {cr.config.watched_dir}") + _notify("loading the embedding model (first run downloads it; may take a minute)…") _warm_up(cr) if auto_index: @@ -364,10 +381,21 @@ def run_mcp( state.indexing = True def _initial_index() -> None: + _notify( + "building the initial index in the background — search works now and " + "returns more as it finishes (call index_status to check progress)" + ) + started = time.monotonic() try: - cr.index() + stats = cr.index() except Exception: # pragma: no cover - defensive logger.exception("Initial MCP index failed.") + _notify("initial index FAILED — results may be incomplete (see logs)") + else: + _notify( + f"initial index ready: {stats.total_files} files / " + f"{stats.total_chunks} chunks in {time.monotonic() - started:.0f}s" + ) finally: state.indexing = False @@ -384,6 +412,7 @@ def _initial_index() -> None: daemon=True, ).start() + _notify("ready for requests") try: mcp.run(transport=transport) finally: diff --git a/coderag/surfaces/webui.py b/coderag/surfaces/webui.py index 3e45d05..fca0641 100644 --- a/coderag/surfaces/webui.py +++ b/coderag/surfaces/webui.py @@ -153,15 +153,15 @@ def _llm_status(config: "Config") -> Tuple[bool, str]: def _searcher_for(cr: "CodeRAG", dense: float, lexical: float) -> "HybridSearcher": """The facade's searcher, or an ad-hoc one when weights are tuned live. - Reuses the already-loaded provider/store/vectors so changing weights never reloads - the index — only the cheap fusion weighting differs. + Reuses the already-loaded provider/store so changing weights never reloads the index — + only the cheap fusion weighting differs. """ if dense == cr.config.dense_weight and lexical == cr.config.lexical_weight: return cr.searcher from coderag.retrieval.search import HybridSearcher cfg = cr.config.with_overrides(dense_weight=dense, lexical_weight=lexical) - return HybridSearcher(cfg, cr.provider, cr.store, cr.vectors) + return HybridSearcher(cfg, cr.provider, cr.store) def _apply_filters( diff --git a/deploy/README.md b/deploy/README.md index 702d812..0bf78d5 100644 --- a/deploy/README.md +++ b/deploy/README.md @@ -27,8 +27,8 @@ workspace, scheduled re-indexing, and sensible security defaults. ## How it's designed (read this first) -CodeRAG keeps its index in **SQLite** (the source of truth) plus a **FAISS** cache, and -the engine is a **single writer** — the FAISS file is written non-atomically, so two +CodeRAG keeps its index in a single embedded **LanceDB** store, and +the engine is a **single writer** — the store is written non-atomically, so two processes writing one index would corrupt it. The chart is built around that fact: - **One replica, `Recreate` strategy, `ReadWriteOnce` PVC.** Never scale the writer @@ -495,6 +495,6 @@ helm template coderag deploy/helm/coderag -f deploy/helm/coderag/ci/full-values. - **Single writer by design** — do not raise `replicas`. For higher search throughput, put a cache/load balancer in front of the read endpoints; the index itself stays single-writer. -- **`ReadWriteOnce`** ties the index to one node at a time; that's expected for SQLite. +- **`ReadWriteOnce`** ties the index to one node at a time; that's expected for the embedded store. - The **UI**, when enabled, maintains a *separate* index from the server. For a single shared index, run the server and point browsers/tools at its REST API. diff --git a/deploy/helm/coderag/Chart.yaml b/deploy/helm/coderag/Chart.yaml index 5a668bb..f4c9481 100644 --- a/deploy/helm/coderag/Chart.yaml +++ b/deploy/helm/coderag/Chart.yaml @@ -22,7 +22,7 @@ keywords: - rag - embeddings - semantic-search - - faiss + - lancedb maintainers: - name: Neverdecel url: https://github.com/Neverdecel diff --git a/deploy/helm/coderag/templates/configmap.yaml b/deploy/helm/coderag/templates/configmap.yaml index 34c2e25..7644156 100644 --- a/deploy/helm/coderag/templates/configmap.yaml +++ b/deploy/helm/coderag/templates/configmap.yaml @@ -7,8 +7,6 @@ metadata: data: CODERAG_PROVIDER: {{ .Values.config.provider | quote }} CODERAG_MODEL: {{ .Values.config.model | quote }} - CODERAG_INDEX_TYPE: {{ .Values.config.indexType | quote }} - CODERAG_IVF_THRESHOLD: {{ .Values.config.ivfThreshold | quote }} CODERAG_TOP_K: {{ .Values.config.topK | quote }} CODERAG_LLM_PROVIDER: {{ .Values.config.llmProvider | quote }} CODERAG_CHAT_MODEL: {{ .Values.config.chatModel | quote }} diff --git a/deploy/helm/coderag/templates/server-deployment.yaml b/deploy/helm/coderag/templates/server-deployment.yaml index 0a61bc3..bb38aa2 100644 --- a/deploy/helm/coderag/templates/server-deployment.yaml +++ b/deploy/helm/coderag/templates/server-deployment.yaml @@ -7,7 +7,7 @@ metadata: {{- include "coderag.labels" . | nindent 4 }} app.kubernetes.io/component: server spec: - # CodeRAG is a single SQLite/FAISS writer — keep this at 1. The FAISS index is + # CodeRAG is a single-writer store — keep this at 1. The LanceDB index is # written non-atomically, so two replicas on one volume would corrupt it. replicas: 1 strategy: diff --git a/deploy/helm/coderag/values.schema.json b/deploy/helm/coderag/values.schema.json index f57bb0d..dc90a6b 100644 --- a/deploy/helm/coderag/values.schema.json +++ b/deploy/helm/coderag/values.schema.json @@ -86,8 +86,6 @@ "properties": { "provider": { "type": "string", "enum": ["fastembed", "openai", "fake"] }, "model": { "type": "string" }, - "indexType": { "type": "string", "enum": ["auto", "flat", "ivf"] }, - "ivfThreshold": { "type": "integer", "minimum": 0 }, "topK": { "type": "integer", "minimum": 1 }, "llmProvider": { "type": "string", "enum": ["openai", "anthropic"] }, "chatModel": { "type": "string" }, diff --git a/deploy/helm/coderag/values.yaml b/deploy/helm/coderag/values.yaml index 906a020..1783422 100644 --- a/deploy/helm/coderag/values.yaml +++ b/deploy/helm/coderag/values.yaml @@ -68,7 +68,7 @@ workspace: # -- PVC name to mount when source=existingClaim. existingClaim: "" -# --- Persistent index (SQLite source-of-truth + FAISS cache + downloaded model) --- +# --- Persistent index (LanceDB store + downloaded model) --- # CodeRAG is a single-writer engine, so each writer (the server, or the UI when # enabled) gets its own ReadWriteOnce volume. Do not point two writers at one claim. persistence: @@ -109,8 +109,6 @@ config: # fastembed (local, no key) | openai | fake provider: fastembed model: BAAI/bge-small-en-v1.5 - indexType: auto - ivfThreshold: 50000 topK: 8 # LLM answer backend (only used by the optional `--answer` / UI answer feature). llmProvider: openai diff --git a/docs/configuration.md b/docs/configuration.md index 846e7d8..1762959 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -53,7 +53,7 @@ export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama's OpenAI-compatible export CODERAG_CHAT_MODEL=llama3.1 # the model name your server serves # 3. Search with a locally-generated answer: -coderag search "how is the FAISS index persisted" --answer +coderag search "how is the vector index persisted" --answer ``` Other local servers expose the same OpenAI-compatible API — only the base URL and model @@ -105,7 +105,7 @@ coderag index --provider openai ``` > Changing the embedding model (its dimension) triggers a one-time index rebuild — that's -> expected and safe (SQLite is the source of truth; the FAISS index is a rebuildable cache). +> expected and safe (the LanceDB store is rebuildable — re-indexing recreates it from source). ## Local embedding models @@ -158,7 +158,7 @@ optional. | Variable | Default | Meaning | | --- | --- | --- | | `CODERAG_WATCHED_DIR` | cwd | Codebase to index/search. | -| `CODERAG_STORE_DIR` | `./.coderag` | Where the SQLite DB + FAISS index live. | +| `CODERAG_STORE_DIR` | `./.coderag` | Where the LanceDB store lives. | | `CODERAG_INDEX_ALL_TEXT` | `false` | Index any UTF-8 text file (docs/config/extensionless), not just code. Binary files are always skipped. | ### Retrieval & quality @@ -185,8 +185,6 @@ optional. | Variable | Default | Meaning | | --- | --- | --- | -| `CODERAG_INDEX_TYPE` | `auto` | `auto` (Flat → IVF past the threshold) · `flat` (exact) · `ivf` (approximate). | -| `CODERAG_IVF_THRESHOLD` | `50000` | Vectors before `auto` switches Flat → IVF. | | `CODERAG_WORKERS` | `4` | Worker threads for chunking + embedding (`1` = serial; a big lever for remote/OpenAI embeddings). | ### HTTP API server (`coderag serve`) diff --git a/docs/research/lancedb-spike.md b/docs/research/lancedb-spike.md new file mode 100644 index 0000000..3f8d9fc --- /dev/null +++ b/docs/research/lancedb-spike.md @@ -0,0 +1,105 @@ +# LanceDB spike: can a dedicated embedded vector DB replace SQLite + FAISS? + +**Status:** ✅ **adopted** — CodeRAG now uses a single embedded LanceDB store +(`coderag/store/lance_store.py`); the SQLite store + FAISS index were removed. This document +is kept as the record of the investigation that led to that decision. The bake-off scripts +referenced below (`scripts/bench_lance.py`, `scripts/eval_lance.py`) were removed once the +migration landed; `scripts/bench_store.py` and `coderag eval` benchmark the live store. + +**Date:** 2026-06. + +## Question + +CodeRAG today stores everything in **SQLite** (chunk metadata + text + BM25 via FTS5 + the +vectors as BLOBs) and keeps a separate **FAISS** index for ANN. Could a single dedicated, +embedded, open-source vector DB be faster/better and simpler — replacing *both*? + +Constraint (fixed): the store must stay **embedded / zero-process** (CodeRAG ships via pipx +and runs as an MCP stdio server; a separate DB server is out). + +## Candidates (mid-2026) + +| Engine | Embedded | License | Scale | Vec + BM25 in one store | Verdict | +|---|---|---|---|---|---| +| **LanceDB** 0.33 | ✅ wheels, no server | Apache-2.0 | millions+ | ✅ ANN + BM25 + hybrid | **spiked** | +| sqlite-vec | ✅ (ext) | Apache/MIT | ~100K (brute-force) | vec only; keeps SQLite | ✗ scale / not a replacement | +| Qdrant | ⚠ local ≤20K / server | Apache-2.0 | millions (server) | ✅ (server) | ✗ not embedded | +| DuckDB VSS | ✅ | MIT | HNSW persistence *experimental* | + FTS ext | ✗ durability | +| Chroma | ✅ | Apache-2.0 | — | built on SQLite + hnswlib | ✗ not a replacement | + +LanceDB is the only embedded, OSS engine that scales to millions and unifies vector ANN + +BM25 + metadata/filtering in one store. + +## Method + +`coderag/store/lance_store.py` implements a self-contained LanceDB store (buffered/batched +writes, cosine vector search, Tantivy BM25, and CodeRAG's own RRF fusion so retrieval is +comparable to `HybridSearcher`, not to LanceDB's built-in hybrid). Two offline harnesses +drive the comparison with the **same chunker and embedding provider**, isolating the +storage/retrieval layer: + +- `scripts/bench_lance.py` — index throughput, on-disk size, query latency. +- `scripts/eval_lance.py` — retrieval quality via the existing eval harness. + +## Results — throughput (synthetic corpus, `fake` provider, so it measures the *store*, not the model) + +| corpus | backend | index time | on-disk | query (hybrid) | +|---|---|---|---|---| +| 6k files / 48k chunks | sqlite+faiss | 9.6 s | 19 MB | 8.7 ms | +| 6k files / 48k chunks | **lancedb** | **3.6 s** | **7.7 MB** | 19 ms | +| 20k files / 160k chunks | sqlite+faiss | 34.7 s | 55 MB | 24.7 ms | +| 20k files / 160k chunks | **lancedb** | **11.3 s** | **25.8 MB** | 30 ms | + +- **Index throughput: LanceDB ~3× faster, and the lead grows with scale** — its columnar + bulk write + single BM25 index build beats SQLite's per-file transactions plus per-chunk + FAISS adds. +- **Disk: ~2× smaller.** +- **Query latency: comparable** (within ~1.2–2× at these sizes). NOTE: measured with a tiny + 16-dim fake embedding and **no ANN index** built on either side at <50k, so this is *not* + representative of the million-vector / 384-dim regime — see open questions. +- **Critical fairness lesson:** per-file `add()` to LanceDB is pathological (120 s and + 1.7 GB at 6k files — fragment/version bloat). Writes **must** be batched; with buffering + the numbers above hold. A production integration must batch writes + call `optimize()`. + +## Results — quality + +Blocked in this environment: the embedding model download (`huggingface.co`) is not in the +network allowlist, so a real-embedding eval could not be run here. The pipeline is wired and +runs end-to-end (`eval_lance.py`); with random `fake` vectors the two backends are +neck-and-neck (BM25-dominated), a *hint* that LanceDB's Tantivy BM25 is in the same ballpark +as FTS5 — but **this must be confirmed with real embeddings** before any adoption: + +``` +pip install 'coderag[lance]' +# on a host where the embedding model is reachable (or pre-cached): +python scripts/eval_lance.py --repo . \ + --dataset coderag/eval/datasets/coderag_self.jsonl --level file +python scripts/eval_lance.py --repo /path/to/bigger/repo \ + --dataset .jsonl --level symbol +``` + +Adoption gate: LanceDB must be **≥ parity** on recall@k / nDCG / MRR at both file and +symbol level. + +## Open questions (before committing to a migration) + +1. **Quality parity** with real embeddings (the gate above) — Tantivy BM25 tokenization + differs from FTS5; verify identifier/code queries don't regress. +2. **Million-vector query latency** with proper ANN indexes (FAISS IVF vs LanceDB + IVF-PQ/HNSW) and real 384-dim vectors — the regime the throughput bench could not probe. +3. **Incremental freshness:** OSS LanceDB index maintenance is manual (`optimize()`); the + watcher path needs a sensible cadence (unindexed rows are flat-scanned until optimized). +4. **Dependency weight / wheels:** `lancedb` + `pyarrow` across Python 3.11–3.13 and + Linux/macOS-arm/Windows (no macOS-Intel wheel) vs. dropping `faiss-cpu`. + +## Recommendation + +The spike **strengthens** the case: LanceDB indexes markedly faster, uses less disk, keeps +latency competitive, and collapses two subsystems (SQLite store + FAISS + the hand-rolled +consistency/IVF code) into one. It is the right embedded candidate. But adoption should +remain **gated on the real-embedding quality eval and a million-scale latency check** — both +runnable with the harnesses here once the model is available. Because the index is a +rebuildable cache of the source tree, the migration stays low-risk and reversible. + +Next step if green: extract a shared `Store` interface and implement `LanceStore` against it +(replacing `vector_index.py` + the FTS5 schema), then drop `faiss-cpu`. diff --git a/example.env b/example.env index 2f0a4ee..f83a8f2 100644 --- a/example.env +++ b/example.env @@ -14,15 +14,9 @@ CODERAG_MODEL=BAAI/bge-small-en-v1.5 # --- Locations --- # The codebase to index/search (defaults to the current directory). CODERAG_WATCHED_DIR=/path/to/your/codebase -# Where the index + database are stored (defaults to ./.coderag). +# Where the LanceDB store is kept (defaults to ./.coderag). # CODERAG_STORE_DIR=./.coderag -# --- Vector index (scale) --- -# auto | flat | ivf. "auto" uses exact Flat search and switches to approximate -# IVF automatically once the corpus grows past CODERAG_IVF_THRESHOLD vectors. -# CODERAG_INDEX_TYPE=auto -# CODERAG_IVF_THRESHOLD=50000 - # --- Retrieval --- # CODERAG_TOP_K=8 # Structure-aware 1-hop call-graph expansion (opt-in): enrich results with the definitions diff --git a/pyproject.toml b/pyproject.toml index 93f6201..611c782 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -27,12 +27,15 @@ keywords = [ "llm", ] dependencies = [ - "faiss-cpu>=1.14.3,<1.15", + "lancedb>=0.33,<1", + "pylance>=0.10", + "pyarrow>=16,<25", "numpy>=2.4.6,<3", "python-dotenv>=1.2.2,<2", "tenacity>=9.1.4,<10", "watchdog>=6.0.0,<7", "fastembed>=0.8.0,<1", + "pathspec>=0.12,<2", "tree-sitter>=0.25.2,<0.26", "tree-sitter-python>=0.25.0,<0.26", "tree-sitter-javascript>=0.25.0,<0.26", @@ -69,6 +72,12 @@ mcp = [ openai = [ "openai>=2.41.1,<3", ] +# GPU embedding for the local fastembed backend on an NVIDIA box. Installs the CUDA +# onnxruntime so `CODERAG_EMBED_DEVICE=auto` (or `cuda`) runs embeddings on the GPU — +# typically 10-50x faster indexing. Install with: pip install 'coderag[gpu]'. +gpu = [ + "onnxruntime-gpu>=1.17,<2", +] anthropic = [ "anthropic>=0.109.2,<1", ] diff --git a/scripts/bench_store.py b/scripts/bench_store.py new file mode 100644 index 0000000..4aa9e30 --- /dev/null +++ b/scripts/bench_store.py @@ -0,0 +1,134 @@ +#!/usr/bin/env python +"""Throughput + memory benchmark for the indexing store, fully offline. + +Generates a synthetic source tree (with junk dirs and a ``.gitignore`` to exercise +filtering) and indexes it with the ``fake`` embedding provider, so the numbers reflect the +store/pipeline path (LanceDB) rather than model speed or the network. Reports first-index +wall-time, files/sec, peak RSS, and incremental re-index time. + +Usage: + python scripts/bench_store.py --files 5000 + python scripts/bench_store.py --files 125000 --reindex-frac 0.01 # the real target +""" + +from __future__ import annotations + +import argparse +import resource +import shutil +import tempfile +import time +from pathlib import Path + +from coderag.api import CodeRAG +from coderag.config import Config + +_FUNC = "def func_{i}(x):\n '''do {i}'''\n return x + {i}\n\n" +_CLS = "class Thing{i}:\n def m(self, v):\n return v * {i}\n\n" + + +def _peak_rss_mb() -> float: + # ru_maxrss is KiB on Linux, bytes on macOS — assume Linux (CI/servers). + return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024.0 + + +def make_tree(root: Path, n_files: int) -> None: + """Write ``n_files`` small Python files across a few packages, plus ignorable junk.""" + root.mkdir(parents=True, exist_ok=True) + (root / ".gitignore").write_text("build/\n*.log\n", encoding="utf-8") + # Junk that filtering must skip (should NOT be indexed). + for d in ("node_modules", ".cache", "site-packages", "build"): + p = root / d / "junk.py" + p.parent.mkdir(parents=True, exist_ok=True) + p.write_text("def junk():\n return 0\n", encoding="utf-8") + (root / "noisy.log").write_text("log line\n" * 50, encoding="utf-8") + # Real source: spread across packages, a few functions/classes each (multi-chunk). + per_pkg = 500 + for i in range(n_files): + pkg = root / f"pkg_{i // per_pkg:04d}" + pkg.mkdir(parents=True, exist_ok=True) + body = "".join(_FUNC.format(i=j) for j in range(i % 5 + 2)) + body += "".join(_CLS.format(i=j) for j in range(i % 3 + 1)) + (pkg / f"mod_{i:06d}.py").write_text(body, encoding="utf-8") + + +def _cr(repo: Path, store: Path, workers: int, batch: int) -> CodeRAG: + return CodeRAG( + Config( + provider="fake", + watched_dir=repo, + store_dir=store, + index_workers=workers, + embed_batch_size=batch, + ) + ) + + +def main() -> int: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--files", type=int, default=2000, help="Synthetic source files.") + ap.add_argument("--workers", type=int, default=4) + ap.add_argument("--batch", type=int, default=64) + ap.add_argument( + "--reindex-frac", + type=float, + default=0.01, + help="Fraction of files to touch for the incremental re-index pass.", + ) + ap.add_argument("--keep", action="store_true", help="Keep the temp tree.") + args = ap.parse_args() + + work = Path(tempfile.mkdtemp(prefix="coderag-bench-")) + repo, store = work / "repo", work / "store" + try: + t0 = time.monotonic() + make_tree(repo, args.files) + gen_s = time.monotonic() - t0 + print(f"Generated {args.files} files in {gen_s:.1f}s at {repo}") + + cr = _cr(repo, store, args.workers, args.batch) + t0 = time.monotonic() + stats = cr.index() + full_s = time.monotonic() - t0 + + # Junk dirs / .gitignore'd files are pruned during the walk (never counted), so + # confirm none leaked into the index rather than reading it off files_skipped. + leaked = [ + p + for p in cr.store.all_file_paths() + if any( + seg in p + for seg in ("node_modules", ".cache", "site-packages", "build/") + ) + ] + print("\n--- first full index ---") + print(f" files indexed : {stats.files_indexed}") + print(f" junk leaked : {len(leaked)} (expect 0 — filtering works)") + print(f" chunks : {stats.total_chunks}") + print(f" wall time : {full_s:.1f}s") + print(f" throughput : {stats.files_indexed / full_s:.0f} files/s") + print(f" peak RSS : {_peak_rss_mb():.0f} MiB") + size_mb = sum(p.stat().st_size for p in store.rglob("*") if p.is_file()) / 1e6 + print(f" index on disk : {size_mb:.1f} MB") + + # Incremental: touch a fraction and re-index (exercises the stat fast-path). + touched = max(1, int(args.files * args.reindex_frac)) + for i in range(touched): + mod = repo / f"pkg_{i // 500:04d}" / f"mod_{i:06d}.py" + mod.write_text(mod.read_text() + "\n# touch\n", encoding="utf-8") + t0 = time.monotonic() + rstats = cr.index() + reindex_s = time.monotonic() - t0 + print(f"\n--- incremental re-index ({touched} changed) ---") + print(f" files indexed : {rstats.files_indexed}") + print(f" files skipped : {rstats.files_skipped}") + print(f" wall time : {reindex_s:.2f}s") + cr.close() + return 0 + finally: + if not args.keep: + shutil.rmtree(work, ignore_errors=True) + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/tests/conftest.py b/tests/conftest.py index addc903..34997c6 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -21,7 +21,6 @@ def config(tmp_path: Path) -> Config: provider="fake", watched_dir=tmp_path / "repo", store_dir=tmp_path / "store", - ivf_threshold=20, # tiny so IVF-path tests don't need huge corpora ) diff --git a/tests/test_config_and_providers.py b/tests/test_config_and_providers.py index f60a0fb..a06cd2c 100644 --- a/tests/test_config_and_providers.py +++ b/tests/test_config_and_providers.py @@ -11,8 +11,7 @@ def test_config_defaults_and_derived_paths(tmp_path): cfg = Config(store_dir=tmp_path / ".coderag") assert cfg.provider == "fastembed" - assert cfg.db_path == tmp_path / ".coderag" / "coderag.db" - assert cfg.faiss_path == tmp_path / ".coderag" / "index.faiss" + assert cfg.store_dir == tmp_path / ".coderag" def test_config_is_immutable_and_copies(): @@ -37,6 +36,30 @@ def test_from_env_ignores_bad_ints(monkeypatch): assert cfg.top_k == 8 # falls back to default +def test_env_ignore_globs_append_to_defaults(monkeypatch): + from coderag.config import DEFAULT_IGNORE_GLOBS + + monkeypatch.setenv("CODERAG_IGNORE_GLOBS", "secret/*, *.bin") + cfg = Config.from_env() + assert set(DEFAULT_IGNORE_GLOBS) <= set(cfg.ignore_globs) # defaults kept + assert "secret/*" in cfg.ignore_globs and "*.bin" in cfg.ignore_globs + + +def test_default_ignores_cover_dependency_and_cache_dirs(): + from coderag.config import DEFAULT_IGNORE_GLOBS + + for junk in ("site-packages/*", ".cache/*", "node_modules/*", "target/*"): + assert junk in DEFAULT_IGNORE_GLOBS + + +def test_env_embed_device_and_threads(monkeypatch): + monkeypatch.setenv("CODERAG_EMBED_DEVICE", "cuda") + monkeypatch.setenv("CODERAG_EMBED_THREADS", "8") + cfg = Config.from_env() + assert cfg.embed_device == "cuda" + assert cfg.embed_threads == 8 + + def test_secrets_are_kept_out_of_repr(): cfg = Config( openai_api_key="sk-openai-secret", diff --git a/tests/test_fastembed_provider.py b/tests/test_fastembed_provider.py new file mode 100644 index 0000000..2db6c13 --- /dev/null +++ b/tests/test_fastembed_provider.py @@ -0,0 +1,44 @@ +"""Device/provider selection for the fastembed backend (no model load, no network).""" + +from __future__ import annotations + +from coderag.embeddings.fastembed_provider import FastEmbedProvider + + +def _provider(device: str) -> FastEmbedProvider: + return FastEmbedProvider("BAAI/bge-small-en-v1.5", device=device) + + +def test_cpu_device_forces_cpu_provider(): + assert _provider("cpu")._providers() == ["CPUExecutionProvider"] + + +def test_cuda_device_lists_cuda_then_cpu_fallback(): + # CPU is listed second so onnxruntime degrades gracefully if CUDA init fails. + assert _provider("cuda")._providers() == [ + "CUDAExecutionProvider", + "CPUExecutionProvider", + ] + + +def test_auto_uses_cpu_when_no_gpu(monkeypatch): + import onnxruntime as ort + + monkeypatch.setattr( + ort, "get_available_providers", lambda: ["CPUExecutionProvider"] + ) + assert _provider("auto")._providers() is None # library CPU default + + +def test_auto_uses_gpu_when_available(monkeypatch): + import onnxruntime as ort + + monkeypatch.setattr( + ort, + "get_available_providers", + lambda: ["CUDAExecutionProvider", "CPUExecutionProvider"], + ) + assert _provider("auto")._providers() == [ + "CUDAExecutionProvider", + "CPUExecutionProvider", + ] diff --git a/tests/test_ignore.py b/tests/test_ignore.py new file mode 100644 index 0000000..76dbdcb --- /dev/null +++ b/tests/test_ignore.py @@ -0,0 +1,68 @@ +"""Tests for the shared ignore-aware walker (:func:`coderag._ignore.walk_files`).""" + +from __future__ import annotations + +from coderag._ignore import walk_files +from coderag.config import DEFAULT_IGNORE_GLOBS +from tests.conftest import write + + +def _rels(root, **kw) -> set[str]: + return {rel for _, rel in walk_files(root, DEFAULT_IGNORE_GLOBS, **kw)} + + +def test_gitignore_negation_and_dir_only(tmp_path): + write(tmp_path / ".gitignore", "*.log\nout/\n!keep.log\n") + write(tmp_path / "a.py", "x\n") + write(tmp_path / "debug.log", "x\n") + write(tmp_path / "keep.log", "x\n") + write(tmp_path / "out" / "x.py", "x\n") # "out" is not a built-in default ignore + rels = _rels(tmp_path) + assert "a.py" in rels + assert "keep.log" in rels # re-included by negation + assert "debug.log" not in rels # ignored by *.log + assert not any(r.startswith("out/") for r in rels) # dir pruned by "out/" + + +def test_nested_gitignore_scopes_to_subtree(tmp_path): + write(tmp_path / "a.txt", "x\n") + write(tmp_path / "sub" / ".gitignore", "*.txt\n") + write(tmp_path / "sub" / "b.txt", "x\n") + write(tmp_path / "sub" / "c.py", "x\n") + rels = _rels(tmp_path) + assert "a.txt" in rels # root .txt unaffected by sub/.gitignore + assert "sub/c.py" in rels + assert "sub/b.txt" not in rels # ignored by the nested rule + + +def test_gitignore_can_be_disabled(tmp_path): + write(tmp_path / ".gitignore", "*.log\n") + write(tmp_path / "debug.log", "x\n") + assert "debug.log" not in _rels(tmp_path, use_gitignore=True) + assert "debug.log" in _rels(tmp_path, use_gitignore=False) + + +def test_indexer_and_fs_search_agree_on_gitignore(tmp_path): + # The shared-walker invariant: semantic index and exact search see the same files. + from coderag.api import CodeRAG + from coderag.config import Config + + repo = tmp_path / "repo" + write(repo / ".gitignore", "ignored/\n*.log\n") + write(repo / "keep.py", "def k():\n return 1\n") + write(repo / "ignored" / "x.py", "def x():\n return 1\n") + write(repo / "note.log", "hi\n") + cr = CodeRAG( + Config(provider="fake", watched_dir=repo, store_dir=tmp_path / "store") + ) + cr.index() + indexed = set(cr.store.all_file_paths()) + assert "keep.py" in indexed + assert not any(p.startswith("ignored/") for p in indexed) + + res = cr.search_files("*", target="files", use_ripgrep=False) + found = {row["path"] for row in res["results"]} + assert "keep.py" in found + assert not any(p.startswith("ignored/") for p in found) + assert "note.log" not in found # .log ignored, so exact search skips it too + cr.close() diff --git a/tests/test_indexer.py b/tests/test_indexer.py index 91b78c4..1d53b3a 100644 --- a/tests/test_indexer.py +++ b/tests/test_indexer.py @@ -2,6 +2,8 @@ from __future__ import annotations +from pathlib import Path + from coderag.api import CodeRAG from tests.conftest import write @@ -18,7 +20,7 @@ def test_index_creates_chunks(config): stats = cr.index() assert stats.files_indexed == 2 assert stats.total_chunks >= 2 - assert cr.vectors.ntotal == stats.total_chunks + assert cr.store.total_chunks() == stats.total_chunks def test_unchanged_files_are_skipped(config): @@ -36,19 +38,14 @@ def test_editing_a_file_does_not_duplicate(config): write(path, "def alpha():\n return 1\n") cr.index() chunks_before = cr.store.total_chunks() - vectors_before = cr.vectors.ntotal - assert chunks_before == vectors_before + assert chunks_before >= 1 # Edit and reindex. write(path, "def alpha():\n return 100\n\ndef gamma():\n return 3\n") stats = cr.index() assert stats.chunks_removed >= 1 # old chunks were deleted first - # Store and FAISS stay in lock-step (no stale/duplicate vectors). - assert cr.store.total_chunks() == cr.vectors.ntotal - # The new content is searchable; the stale content is gone. - rows = cr.store.hydrate( - cr.store.chunk_ids_for_file(cr.store.get_file("a.py")["id"]) - ) + # The new content is searchable; the stale content is gone (no duplicates). + rows = cr.store.hydrate(cr.store.chunk_ids_for_path("a.py")) joined = "\n".join(r["text"] for r in rows.values()) assert "return 100" in joined assert "return 1\n" not in joined or "return 100" in joined @@ -61,13 +58,13 @@ def test_deleted_file_is_pruned(config): write(a, "def alpha():\n return 1\n") write(b, "def beta():\n return 2\n") cr.index() - assert cr.store.total_chunks() == cr.vectors.ntotal + chunks_with_b = cr.store.total_chunks() b.unlink() stats = cr.index() assert stats.files_removed == 1 assert "b.py" not in cr.store.all_file_paths() - assert cr.store.total_chunks() == cr.vectors.ntotal + assert cr.store.total_chunks() < chunks_with_b # b's chunks are gone def test_ignored_dirs_are_skipped(config): @@ -82,6 +79,18 @@ def test_ignored_dirs_are_skipped(config): assert not any(".git" in p for p in paths) +def test_dependency_and_cache_dirs_are_skipped(config): + cr = _cr(config) + write(config.watched_dir / "src" / "a.py", "def alpha():\n return 1\n") + write(config.watched_dir / "site-packages" / "dep.py", "def dep():\n return 1\n") + write(config.watched_dir / ".cache" / "c.py", "def c():\n return 1\n") + cr.index() + paths = cr.store.all_file_paths() + assert "src/a.py" in paths + assert not any("site-packages" in p for p in paths) + assert not any(".cache" in p for p in paths) + + def test_full_rebuild_resets(config): cr = _cr(config) write(config.watched_dir / "a.py", "def alpha():\n return 1\n") @@ -89,7 +98,7 @@ def test_full_rebuild_resets(config): n1 = cr.store.total_chunks() stats = cr.index(full=True) assert stats.total_chunks == n1 # same content, rebuilt cleanly - assert cr.store.total_chunks() == cr.vectors.ntotal + assert cr.store.total_chunks() == n1 def test_get_file_line_numbers_use_chunk_convention(config): @@ -106,6 +115,45 @@ def test_get_file_line_numbers_use_chunk_convention(config): assert cr.get_file("f.txt", 1, 2) == "\n".join(expected[:2]) +def test_stat_skip_avoids_reread_of_unchanged_files(config, monkeypatch): + # On a re-index, an untouched file must be skipped via the cheap (size, mtime) check + # WITHOUT reading its bytes — the dominant cost saver for a large tree. + cr = _cr(config) + write(config.watched_dir / "a.py", "def alpha():\n return 1\n") + cr.index() + + orig_read = Path.read_bytes + + def boom(self: Path) -> bytes: + if self.name == "a.py": + raise AssertionError("re-read an unchanged file instead of stat-skipping") + return orig_read(self) + + monkeypatch.setattr(Path, "read_bytes", boom) + stats = cr.index() + assert stats.files_indexed == 0 + assert stats.files_skipped == 1 + + +def test_index_progress_is_reported(config, capsys): + # progress=True narrates the run on stderr so a long index isn't a silent wait. + cr = _cr(config) + write(config.watched_dir / "a.py", "def alpha():\n return 1\n") + cr.indexer.index(progress=True) + err = capsys.readouterr().err + assert "Scanning" in err # discovery phase is announced + assert "✓ Indexed" in err # final summary line + + +def test_index_progress_is_silent_when_off(config, capsys): + # progress=False (the default, and what the MCP background index uses) stays quiet. + cr = _cr(config) + write(config.watched_dir / "a.py", "def alpha():\n return 1\n") + cr.indexer.index(progress=False) + err = capsys.readouterr().err + assert "Scanning" not in err and "✓ Indexed" not in err + + def test_index_survives_reopen(config, tmp_path): cr = _cr(config) write(config.watched_dir / "a.py", "def alpha():\n return 1\n") @@ -114,5 +162,4 @@ def test_index_survives_reopen(config, tmp_path): cr.close() cr2 = CodeRAG(config) - assert cr2.store.total_chunks() == n - assert cr2.vectors.ntotal == n # FAISS cache reloaded, consistent + assert cr2.store.total_chunks() == n # persisted across reopen diff --git a/tests/test_install.py b/tests/test_install.py index de75daf..6b2570d 100644 --- a/tests/test_install.py +++ b/tests/test_install.py @@ -123,6 +123,43 @@ def test_wizard_collects_choices(home, monkeypatch): assert plans[0].tools == inst.DEFAULT_TOOLS +# --- workspace-scope guidance (large trees are supported) ----------------------------- + + +def test_default_workspace_prefers_git_root(home): + repo = Path.cwd() + (repo / ".git").mkdir() + deep = repo / "pkg" / "deep" + deep.mkdir(parents=True) + # Run from a subdirectory: the natural scope is the whole repo, not the subdir. + assert inst.default_workspace(deep) == repo.resolve() + + +def test_default_workspace_falls_back_to_start(home): + start = Path.cwd() / "loose" + start.mkdir() + assert inst.default_workspace(start) == start.resolve() + + +def test_is_broad_root_flags_home_and_system(home): + assert inst._is_broad_root(Path("/")) # filesystem root + assert inst._is_broad_root(Path("/usr")) + assert inst._is_broad_root(Path.home()) # the user's whole home + assert not inst._is_broad_root(Path.cwd()) # a normal project dir + + +def test_wizard_describes_large_tree_support(home, monkeypatch, capsys): + # Choosing "/" is a legitimate large-tree choice: the wizard sets expectations + # (background, takes longer) and flags the /proc footgun, without discouraging it. + answers = iter(["1", "/", ""]) # claude, watched=/, (no tools prompt for claude) + monkeypatch.setattr("builtins.input", lambda *_: next(answers)) + inst.run_wizard([], Path.cwd()) + out = capsys.readouterr().out + assert "large tree" in out + assert "/proc" in out # the one genuine footgun for "/" + assert "almost always what you want" not in out # no longer discourages + + # --- launcher resolution (the venv-activation footgun) -------------------------------- # # An agent launches the server from its own shell, so the command written into its config diff --git a/tests/test_lance_store.py b/tests/test_lance_store.py new file mode 100644 index 0000000..1acc45e --- /dev/null +++ b/tests/test_lance_store.py @@ -0,0 +1,143 @@ +"""Tests for the LanceDB store — the single backend (metadata + BM25 + vectors).""" + +from __future__ import annotations + +from coderag.embeddings.fake_provider import FakeEmbeddingProvider +from coderag.store.lance_store import LanceStore +from coderag.types import Chunk + + +def _chunk(text: str, sym: str = "f", kind: str = "function", start: int = 1) -> Chunk: + return Chunk( + text=text, + start_line=start, + end_line=start + 1, + language="python", + symbol=sym, + kind=kind, + ) + + +def _store(tmp_path): + prov = FakeEmbeddingProvider() + return LanceStore(tmp_path / "store", prov.dim), prov + + +def _add(st, prov, rel, chunks, *, replace=False, chash="h", mtime=1.0, size=10): + vecs = prov.embed_documents([c.text for c in chunks]) + return st.write_file( + rel, "python", chash, mtime, size, chunks, vecs, replace=replace + ) + + +def test_write_stats_lexical_and_hybrid(tmp_path): + st, prov = _store(tmp_path) + _add( + st, + prov, + "auth.py", + [_chunk("def authenticate(token): retry backoff", "authenticate")], + ) + _add(st, prov, "math.py", [_chunk("def add(a, b): return a + b", "add")]) + st.optimize() + + s = st.stats() + assert s.total_files == 2 and s.total_chunks == 2 + assert st.total_chunks() == 2 + + lex = st.lexical_search("authenticate", 5) + assert lex + top = lex[0][0] + assert st.hydrate([top])[top]["path"] == "auth.py" + + hits = st.search("authenticate token", prov, top_k=5) + assert hits and {h.path for h in hits} <= {"auth.py", "math.py"} + assert all(h.start_line >= 1 and h.text for h in hits) + + +def test_change_detection_metadata(tmp_path): + st, prov = _store(tmp_path) + _add(st, prov, "a.py", [_chunk("x")], chash="h1", mtime=12.5, size=4096) + meta = st.get_file_meta("a.py") + assert meta is not None + assert meta["content_hash"] == "h1" + assert meta["size"] == 4096 and abs(meta["mtime"] - 12.5) < 1e-9 + assert set(st.all_file_metas()) == {"a.py"} + assert st.all_file_paths() == ["a.py"] + assert st.get_file_meta("missing.py") is None + + +def test_replace_does_not_duplicate(tmp_path): + st, prov = _store(tmp_path) + _add(st, prov, "a.py", [_chunk("def alpha(): return 1", "alpha")]) + assert st.total_chunks() == 1 + added, removed = _add( + st, + prov, + "a.py", + [ + _chunk("def alpha(): return 100", "alpha"), + _chunk("def gamma(): return 3", "gamma"), + ], + replace=True, + chash="h2", + ) + assert added == 2 and removed == 1 + assert st.total_chunks() == 2 + rows = st.hydrate(st.chunk_ids_for_path("a.py")) + joined = "\n".join(r["text"] for r in rows.values()) + assert "return 100" in joined + assert "return 1\n" not in joined + + +def test_delete_file(tmp_path): + st, prov = _store(tmp_path) + _add(st, prov, "a.py", [_chunk("a")]) + _add(st, prov, "b.py", [_chunk("b")]) + assert st.delete_file("a.py") == 1 + assert st.all_file_paths() == ["b.py"] + assert st.total_chunks() == 1 + + +def test_bootstrap_clears_on_model_change(tmp_path): + st, prov = _store(tmp_path) + assert st.bootstrap(prov.dim, "fake-16") is False + _add(st, prov, "a.py", [_chunk("a")]) + st.optimize() + assert st.total_chunks() == 1 + assert st.bootstrap(prov.dim, "fake-16") is False # unchanged + assert st.total_chunks() == 1 + assert st.bootstrap(prov.dim, "other-model") is True # model changed -> cleared + assert st.total_chunks() == 0 + + +def test_symbol_index_caches_and_invalidates(tmp_path): + st, prov = _store(tmp_path) + _add(st, prov, "a.py", [_chunk("def compute_tax(): pass", "compute_tax")]) + idx = st.symbol_index() + assert "compute_tax" in idx + assert st.symbol_index() is idx # cached while nothing changed + _add(st, prov, "b.py", [_chunk("def brand_new(): pass", "brand_new")]) + idx2 = st.symbol_index() + assert idx2 is not idx and "brand_new" in idx2 + + +def test_distinct_and_fts_sanitization(tmp_path): + st, prov = _store(tmp_path) + _add(st, prov, "a.py", [_chunk("def parse_config(): return 1", "parse_config")]) + st.optimize() + assert st.distinct_languages() == ["python"] + assert "function" in st.distinct_kinds() + assert st.lexical_search("parse_config", 5) # plain token + assert st.lexical_search("parse_config::*", 5) # operators sanitized, no raise + assert st.lexical_search("", 5) == [] # empty query + + +def test_clear_empties_store(tmp_path): + st, prov = _store(tmp_path) + _add(st, prov, "a.py", [_chunk("a")]) + st.optimize() + assert st.total_chunks() == 1 + st.clear() + assert st.total_chunks() == 0 + assert st.all_file_paths() == [] diff --git a/tests/test_mcp.py b/tests/test_mcp.py index 208728a..912d9ff 100644 --- a/tests/test_mcp.py +++ b/tests/test_mcp.py @@ -194,7 +194,7 @@ def test_index_status_reports_totals_and_flag(tmp_path): cr, mcp, state, _ = _make(tmp_path, DEMO) r = _call(mcp, "index_status", {}) assert r["total_files"] == 2 - assert r["total_chunks"] == cr.vectors.ntotal + assert r["total_chunks"] == cr.store.total_chunks() assert r["indexing"] == "ready" state.indexing = True @@ -207,7 +207,7 @@ def test_reindex_picks_up_new_file_and_guards_concurrency(tmp_path): write(repo / "extra.py", "def extra():\n return 1\n") r = _call(mcp, "reindex", {}) assert r["total_files"] == 3 - assert cr.store.total_chunks() == cr.vectors.ntotal + assert cr.store.total_chunks() == cr.store.total_chunks() state.indexing = True # a run already in progress -> guarded assert "error" in _call(mcp, "reindex", {}) @@ -220,6 +220,16 @@ def test_warm_up_is_safe(tmp_path): cr.close() +def test_notify_keeps_stdout_clean(capsys): + # Lifecycle messages must go to stderr only — stdout is the stdio MCP wire protocol. + from coderag.surfaces.mcp_server import _notify + + _notify("indexing started") + captured = capsys.readouterr() + assert "indexing started" in captured.err + assert captured.out == "" + + # --- all-text (general file-directory) indexing --- @@ -273,7 +283,6 @@ def build(workers, sub): out = ( stats.total_chunks, cr.store.total_chunks(), - cr.vectors.ntotal, sorted(cr.store.all_file_paths()), ) cr.close() @@ -281,9 +290,9 @@ def build(workers, sub): serial = build(1, "store_serial") parallel = build(4, "store_parallel") + assert serial[0] == parallel[0] # stats agree assert serial[1] == parallel[1] > 0 # identical chunk count - assert serial[1] == serial[2] and parallel[1] == parallel[2] # store == FAISS - assert serial[3] == parallel[3] # identical file set + assert serial[2] == parallel[2] # identical file set def test_search_is_safe_during_concurrent_indexing(tmp_path): @@ -308,7 +317,7 @@ def hammer_search(): t = threading.Thread(target=hammer_search) t.start() try: - # Re-index (FAISS add/remove) while searches (FAISS reads) run concurrently. + # Re-index (store writes) while searches (store reads) run concurrently. for _ in range(3): for i in range(25, 45): write(repo / f"f{i}.py", "def g():\n return 'more tokens here'\n") @@ -321,7 +330,5 @@ def hammer_search(): t.join(timeout=5) assert not errors, errors - assert ( - cr.store.total_chunks() == cr.vectors.ntotal - ) # invariant holds after the race + assert cr.store.total_chunks() == 25 # the 20 churned files were pruned cr.close() diff --git a/tests/test_rerank.py b/tests/test_rerank.py index a2da22b..1cb5880 100644 --- a/tests/test_rerank.py +++ b/tests/test_rerank.py @@ -56,7 +56,7 @@ def test_get_reranker_built_when_enabled(config): def test_reranker_reorders_and_sets_score(config): cr = _indexed(config) searcher = HybridSearcher( - cr.config, cr.provider, cr.store, cr.vectors, reranker=KeywordReranker() + cr.config, cr.provider, cr.store, reranker=KeywordReranker() ) hits = searcher.search("validate session token", top_k=2) assert hits @@ -69,7 +69,7 @@ def test_reranker_reorders_and_sets_score(config): def test_rerank_trims_to_top_k(config): cr = _indexed(config) searcher = HybridSearcher( - cr.config, cr.provider, cr.store, cr.vectors, reranker=KeywordReranker() + cr.config, cr.provider, cr.store, reranker=KeywordReranker() ) assert len(searcher.search("token", top_k=1)) == 1 @@ -77,7 +77,7 @@ def test_rerank_trims_to_top_k(config): def test_reranker_empty_query(config): cr = _indexed(config) searcher = HybridSearcher( - cr.config, cr.provider, cr.store, cr.vectors, reranker=KeywordReranker() + cr.config, cr.provider, cr.store, reranker=KeywordReranker() ) assert searcher.search(" ", top_k=3) == [] diff --git a/tests/test_store.py b/tests/test_store.py deleted file mode 100644 index 8014a2e..0000000 --- a/tests/test_store.py +++ /dev/null @@ -1,161 +0,0 @@ -"""P1 tests: SQLite store + pluggable FAISS vector index.""" - -from __future__ import annotations - -import numpy as np - -from coderag.config import Config -from coderag.store.sqlite_store import SQLiteStore -from coderag.store.vector_index import FaissVectorIndex -from coderag.types import Chunk - - -def _store(tmp_path) -> SQLiteStore: - store = SQLiteStore(tmp_path / "coderag.db") - store.bootstrap(embed_dim=16, embed_model="fake-16") - return store - - -def _chunk(text: str, start: int = 1) -> Chunk: - return Chunk( - text=text, - start_line=start, - end_line=start + 2, - language="python", - symbol="f", - kind="function", - ) - - -def test_add_and_hydrate_chunks(tmp_path): - store = _store(tmp_path) - fid = store.upsert_file("a.py", "python", "hash1", 1.0) - vecs = np.ones((2, 16), dtype="float32") - ids = store.add_chunks( - fid, [_chunk("def f(): pass"), _chunk("x = 1", 5)], vecs, "fake-16" - ) - assert len(ids) == 2 - rows = store.hydrate(ids) - assert rows[ids[0]]["path"] == "a.py" - assert rows[ids[0]]["text"] == "def f(): pass" - - -def test_autoincrement_ids_never_reused(tmp_path): - store = _store(tmp_path) - fid = store.upsert_file("a.py", "python", "h", 1.0) - vecs = np.ones((1, 16), dtype="float32") - first = store.add_chunks(fid, [_chunk("a")], vecs, "fake-16") - store.delete_chunks_for_file(fid) - second = store.add_chunks(fid, [_chunk("b")], vecs, "fake-16") - assert second[0] > first[0] # id advanced, not recycled - - -def test_fts_search_finds_token_and_survives_operators(tmp_path): - store = _store(tmp_path) - fid = store.upsert_file("a.py", "python", "h", 1.0) - vecs = np.ones((1, 16), dtype="float32") - store.add_chunks(fid, [_chunk("def parse_config(): return 1")], vecs, "fake-16") - hits = store.fts_search("parse_config", limit=5) - assert len(hits) == 1 - # Operators in the query must not raise. - assert store.fts_search("parse_config::*", limit=5) - assert store.fts_search("", limit=5) == [] - - -def test_iter_vectors_round_trips(tmp_path): - store = _store(tmp_path) - fid = store.upsert_file("a.py", "python", "h", 1.0) - vecs = np.random.default_rng(0).standard_normal((3, 16)).astype("float32") - ids = store.add_chunks( - fid, [_chunk("a"), _chunk("b"), _chunk("c")], vecs, "fake-16" - ) - got_ids, got_vecs = next(store.iter_vectors()) - assert list(got_ids) == ids - np.testing.assert_allclose(got_vecs, vecs) - - -def test_model_change_triggers_rebuild_flag(tmp_path): - store = SQLiteStore(tmp_path / "coderag.db") - assert store.bootstrap(16, "fake-16") is False - store.upsert_file("a.py", "python", "h", 1.0) - # Re-bootstrap with a different dim/model: should clear and request rebuild. - assert store.bootstrap(384, "bge-small") is True - assert store.all_file_paths() == [] - - -def _vec_index(tmp_path, **cfg) -> tuple: - config = Config(store_dir=tmp_path, **cfg) - store = _store(tmp_path) - idx = FaissVectorIndex.open(config, dim=16) - return config, store, idx - - -def test_vector_add_search_remove(tmp_path): - _, _, idx = _vec_index(tmp_path) - rng = np.random.default_rng(1) - vecs = rng.standard_normal((5, 16)).astype("float32") - ids = np.array([10, 20, 30, 40, 50], dtype="int64") - idx.add(ids, vecs) - assert idx.ntotal == 5 - got_ids, scores = idx.search(vecs[2], k=3) - assert got_ids[0] == 30 # closest to itself - assert scores[0] > 0.99 - removed = idx.remove([30]) - assert removed == 1 - got_ids, _ = idx.search(vecs[2], k=3) - assert 30 not in got_ids - - -def test_rebuild_from_store_and_consistency(tmp_path): - config, store, idx = _vec_index(tmp_path) - fid = store.upsert_file("a.py", "python", "h", 1.0) - vecs = np.random.default_rng(2).standard_normal((4, 16)).astype("float32") - store.add_chunks(fid, [_chunk(str(i)) for i in range(4)], vecs, "fake-16") - # Index is empty but store has 4 chunks -> ensure_consistent rebuilds. - idx.ensure_consistent(store) - assert idx.ntotal == 4 - assert idx.kind == "flat" - - -def test_auto_upgrade_flat_to_ivf(tmp_path): - # ivf_threshold tiny so a small corpus crosses it. - config, store, idx = _vec_index(tmp_path, ivf_threshold=10) - fid = store.upsert_file("a.py", "python", "h", 1.0) - n = 30 - vecs = np.random.default_rng(3).standard_normal((n, 16)).astype("float32") - ids = store.add_chunks(fid, [_chunk(str(i)) for i in range(n)], vecs, "fake-16") - idx.add(np.array(ids, dtype="int64"), vecs) - assert idx.kind == "flat" - upgraded = idx.maybe_upgrade(store) - assert upgraded is True - assert idx.kind == "ivf" - assert idx.ntotal == n - # IVF still returns the self-match. - got_ids, _ = idx.search(vecs[0], k=1) - assert got_ids[0] == ids[0] - - -def test_index_persists_across_open(tmp_path): - config, store, idx = _vec_index(tmp_path) - vecs = np.random.default_rng(4).standard_normal((3, 16)).astype("float32") - idx.add(np.array([1, 2, 3], dtype="int64"), vecs) - idx.save() - reopened = FaissVectorIndex.open(config, dim=16) - assert reopened.ntotal == 3 - assert reopened.kind == "flat" - - -def test_rebuild_ivf_handles_degenerate_corpus(tmp_path): - # Forcing IVF over many identical vectors must not raise and must stay searchable - # (degenerate training falls back to flat rather than aborting indexing). - config, store, idx = _vec_index(tmp_path, index_type="ivf", ivf_threshold=1) - fid = store.upsert_file("a.py", "python", "h", 1.0) - n = 40 - vecs = np.ones((n, 16), dtype="float32") # all identical -> degenerate clustering - ids = store.add_chunks(fid, [_chunk(str(i)) for i in range(n)], vecs, "fake-16") - idx.rebuild_from_store(store) # must not raise - assert idx.ntotal == n - assert idx.kind in ("ivf", "flat") - got_ids, _ = idx.search(vecs[0], k=1) - assert len(got_ids) == 1 - assert got_ids[0] in ids diff --git a/tests/test_surfaces.py b/tests/test_surfaces.py index 3b8d52b..f0afe9c 100644 --- a/tests/test_surfaces.py +++ b/tests/test_surfaces.py @@ -148,12 +148,11 @@ def test_watcher_apply_handles_edit_and_delete(repo_with_code): write(new, "def extra():\n return 1\n") _apply(cr, str(new)) assert cr.store.total_chunks() > n0 - assert cr.store.total_chunks() == cr.vectors.ntotal new.unlink() _apply(cr, str(new)) assert "extra.py" not in cr.store.all_file_paths() - assert cr.store.total_chunks() == cr.vectors.ntotal + assert cr.store.total_chunks() == n0 # back to the pre-edit count def test_watcher_handler_collects_only_code_paths():