Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
- `coderag/config.py`, `coderag/types.py`: Immutable `Config` and shared dataclasses.
- `coderag/embeddings/`: `EmbeddingProvider` protocol + `fastembed` (default), `openai`, `fake`.
- `coderag/chunking/`: Symbol-aware chunking (`python_ast.py`, `treesitter.py`, line-window `base.py`).
- `coderag/store/`: `sqlite_store.py` (source of truth + FTS5) and `vector_index.py` (FAISS Flat/IVF cache).
- `coderag/store/`: `lance_store.py` — a single embedded LanceDB store (chunk metadata, BM25, and vectors).
- `coderag/retrieval/`: Hybrid dense + BM25 search fused with RRF.
- `coderag/indexer.py`, `coderag/watch.py`: Incremental indexing and the debounced watcher.
- `coderag/_ignore.py`: Shared ignore-glob matching used by both the indexer and `fs_search`.
Expand All @@ -29,10 +29,10 @@
- First-party module is `coderag`; surfaces must stay thin — no engine logic in `surfaces/`.

## Architecture Invariants
- SQLite is the source of truth; the FAISS index is a rebuildable cache (`rebuild_from_store`).
- `chunks.id` is the FAISS id and is `AUTOINCREMENT` (ids never reused).
- Incremental indexing is delete-before-add (no duplicate/stale vectors); unchanged files skip via content hash.
- Embedding dimension comes from the provider, not a constant; a model change triggers a rebuild.
- One embedded LanceDB store holds metadata + BM25 + vectors; it's rebuildable by re-indexing from source.
- `chunks.id` is a store-managed integer id used as the fusion/hydrate key.
- Incremental indexing is delete-before-add (no duplicate/stale rows); unchanged files skip via size+mtime then content hash.
- Embedding dimension comes from the provider, not a constant; a model change clears the store for a clean re-index.

## Testing Guidelines
- Place tests in `tests/` as `test_*.py`; keep them deterministic and offline (use the `fake` provider fixture).
Expand Down
27 changes: 13 additions & 14 deletions DEVELOPMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,29 +26,28 @@ coderag/
├── llm.py # Optional streamed LLM answer over retrieved chunks
├── embeddings/ # EmbeddingProvider protocol + fastembed / openai / fake
├── chunking/ # Symbol-aware chunking: python_ast, treesitter, line-window base
├── store/ # SQLite source of truth + pluggable FAISS vector index
│ ├── sqlite_store.py # files/chunks/vectors + FTS5 lexical search
│ └── vector_index.py # FaissVectorIndex: Flat (exact) / IVF (scale)
├── store/ # Single embedded LanceDB store
│ └── lance_store.py # files/chunks + BM25 (FTS) + vectors (ANN) in one place
├── retrieval/ # Hybrid search: dense + BM25, fused with RRF
└── surfaces/ # cli.py · http_api.py (FastAPI) · webui.py · mcp_server.py (MCP)
```

### Design invariants (don't break these)

- **SQLite is the source of truth; FAISS is a rebuildable cache.** Vectors are stored as
BLOBs in SQLite, so `FaissVectorIndex.rebuild_from_store()` can always reconstruct the
index. `ensure_consistent()` does this automatically when counts disagree.
- **`chunks.id` is the FAISS id and is `AUTOINCREMENT`** — ids are never reused, which keeps
a stale cache from resurrecting deleted content.
- **Delete-before-add.** A changed file's old chunks are removed from both SQLite and FAISS
before new ones are added (`Indexer._write`). This is the bug the old `monitor.py` had.
- **One LanceDB store holds everything** (chunk metadata, text/BM25, and vectors/ANN). It is
rebuildable from source: re-indexing recreates it, and a `--full` pass clears and rebuilds.
- **`chunks.id` is a store-managed integer id** used as the fusion/hydrate key; ids are not
reused within a run.
- **Delete-before-add.** A changed file's old rows are removed before new ones are added
(`Indexer._write` → `LanceStore.write_file(replace=True)`), so editing never accumulates
stale or duplicate rows.
- **The embedding dimension comes from the provider**, never a hard-coded constant. A model
change is detected via `meta.embed_dim` and triggers a clean rebuild.
change is detected via the store's `meta.json` and clears the store for a clean re-index.
- **Writes serialize; reads don't block.** All indexing/deletion goes through one lock on the
`CodeRAG` facade (`_index_lock`), and `FaissVectorIndex` guards its own add/remove/search/
rebuild — so the MCP server's background index and live watcher run safely alongside
`CodeRAG` facade (`_index_lock`); the store buffers writes on the writer and reads query
committed data — so the MCP server's background index and live watcher run safely alongside
concurrent agent searches. Indexing may parallelize chunk+embed across `index_workers`
threads, but the SQLite/FAISS writes stay single-writer (`Indexer._write`).
threads, but the store writes stay single-writer (`Indexer._write`).

## Quality gate

Expand Down
33 changes: 15 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Coding agents like Claude Code and Codex locate code by *running searches* — g
repeat — which burns tokens and round-trips and reduces to literal keyword matching. CodeRAG
turns the workspace into a **warm, pre-indexed** engine: a single query returns the right
functions and files ranked by **meaning *and* keyword**, with exact `path:line` citations. The
embedding model loads once, so each query is one in-process lookup (FAISS + BM25 + fusion), not
embedding model loads once, so each query is one in-process lookup (vector ANN + BM25 + fusion), not
a multi-round shell loop — and over MCP (`coderag mcp`, below) it becomes the agent's search tool.

**Proof from the eval harness** — this repo's 24 natural-language → file queries (90 files /
Expand Down Expand Up @@ -72,7 +72,7 @@ and the honest caveats — is in [`docs/eval.md`](docs/eval.md).
- **Drop-in for AI coding agents — one command.** `coderag install` wires the **MCP server** into **Claude Code**, **Hermes**, and **Codex** (auto-detect or an interactive wizard, idempotent, with backups) so they search a warm, pre-indexed workspace instead of slow grep/glob/read loops — ranked `path:line` results from a single call, index kept live as you edit. Works on a plain file directory too, not just code.
- **Measured, not guessed.** A built-in **evaluation harness** (`coderag eval`) scores retrieval quality — recall@k, MRR, nDCG@k at file *or* symbol level — and can mine a benchmark straight from your git history. Every default (1:1 hybrid, reranker opt-in, adaptive fusion off) is the choice the harness validated, including across an external repo.
- **Incremental & live.** Content-hashed indexing only re-embeds files that changed; a debounced watcher keeps the index current as you code. No duplicate or stale vectors.
- **Built to scale.** Exact `Flat` search for small repos, automatic switch to approximate `IVF` past a threshold so it stays fast at 100k+ chunks.
- **Built to scale.** An embedded [LanceDB](https://github.com/lancedb/lancedb) store: brute-force exact search for small repos, automatic ANN indexing past a threshold so it stays fast at 100k+ chunks.
- **Five surfaces, one engine.** CLI · Python library · HTTP/REST · web UI · MCP server — all thin wrappers over the same `CodeRAG` object.

### ⚡ One line: install + wire into your agent
Expand Down Expand Up @@ -158,7 +158,7 @@ from coderag import CodeRAG, Config
cr = CodeRAG(Config.from_env(watched_dir="/path/to/repo"))
cr.index()

for hit in cr.search("how is the FAISS index persisted?"):
for hit in cr.search("how is the vector index persisted?"):
print(f"{hit.location} {hit.symbol} (sim={hit.similarity:.2f})")
print(hit.text)
```
Expand Down Expand Up @@ -197,7 +197,7 @@ See it live (read-only, indexing this repo): **<https://coderag-ui.neverdecel.co
Tools like Claude Code and Codex locate code with iterative `grep`/`glob`/read loops. CodeRAG
exposes the same workspace as a **Model Context Protocol** server, so an agent gets fast,
ranked `path:line` results from a single call against a **warm, pre-indexed** workspace — the
embedding model loads once and every query is then one in-process lookup (FAISS + BM25 +
embedding model loads once and every query is then one in-process lookup (vector ANN + BM25 +
fusion), not a multi-round shell search.

```bash
Expand Down Expand Up @@ -327,21 +327,20 @@ scheduled reindex — in [`deploy/README.md`](deploy/README.md).
graph LR
A[Source files] --> B[Symbol-aware chunking<br/>ast / tree-sitter]
B --> C[Embeddings<br/>fastembed · OpenAI · self-hosted]
C --> D[(SQLite store<br/>chunks + vectors + FTS5)]
D --> E[FAISS index<br/>Flat → IVF]
C --> D[(LanceDB store<br/>chunks + vectors + BM25)]
Q[Query] --> F[Dense + BM25]
E --> F
D --> F
F --> G[Reciprocal Rank Fusion]
G --> H[Ranked hits<br/>path:line + score]
```

- **SQLite is the source of truth** (chunk text, line ranges, symbols, content hashes, and the
raw vectors). The **FAISS index is a rebuildable cache** — it can always be reconstructed
from SQLite, so switching models or index types never corrupts your data.
- Each file's content is **hashed**; unchanged files are skipped on re-index. A changed file's
old chunks are removed from *both* the store and the vector index **before** new ones are
added — so editing never accumulates stale or duplicate vectors.
- **One embedded LanceDB store** holds everything — chunk text, line ranges, symbols, content
hashes, the vectors (ANN), and the BM25 index — so there is no separate cache to keep in
sync. The store is also a rebuildable view of your code: it can always be re-indexed from
source, so switching embedding models never corrupts your data.
- Each file's content is **hashed**; unchanged files are skipped on re-index (a cheap
size+mtime check avoids even reading them). A changed file's old chunks are removed
**before** new ones are added — so editing never accumulates stale or duplicate vectors.

## ⚙️ Configuration

Expand Down Expand Up @@ -372,7 +371,7 @@ no API key needed for a local server:
ollama serve && ollama pull llama3.1
export OPENAI_BASE_URL=http://localhost:11434/v1 # Ollama's OpenAI-compatible endpoint
export CODERAG_CHAT_MODEL=llama3.1
coderag search "how is the FAISS index persisted" --answer # answer written locally
coderag search "how is the vector index persisted" --answer # answer written locally
```

### Common settings
Expand All @@ -385,9 +384,7 @@ table is in [`docs/configuration.md`](docs/configuration.md).
| `CODERAG_PROVIDER` | `fastembed` | Embedding backend: `fastembed` (local) · `openai` (OpenAI API **or** any OpenAI-compatible/local server) · `fake` |
| `CODERAG_MODEL` | `BAAI/bge-small-en-v1.5` | Local embedding model (`coderag eval --list-models`) |
| `CODERAG_WATCHED_DIR` | cwd | Codebase to index |
| `CODERAG_STORE_DIR` | `./.coderag` | Where the DB + index live |
| `CODERAG_INDEX_TYPE` | `auto` | `auto` · `flat` · `ivf` |
| `CODERAG_IVF_THRESHOLD` | `50000` | Vectors before switching Flat → IVF |
| `CODERAG_STORE_DIR` | `./.coderag` | Where the LanceDB store lives |
| `CODERAG_TOP_K` | `8` | Results returned |
| `OPENAI_BASE_URL` | – | Point at a self-hosted / local OpenAI-compatible server (Ollama, vLLM, LM Studio, LocalAI) — enables local embeddings **and** local answers |
| `OPENAI_API_KEY` | – | OpenAI **cloud** embeddings / answers (optional for a local server) |
Expand Down Expand Up @@ -431,7 +428,7 @@ Apache License 2.0 — see [LICENSE](LICENSE).

## 🙏 Acknowledgments

[FAISS](https://github.com/facebookresearch/faiss) · [fastembed](https://github.com/qdrant/fastembed) ·
[LanceDB](https://github.com/lancedb/lancedb) · [fastembed](https://github.com/qdrant/fastembed) ·
[tree-sitter](https://tree-sitter.github.io/tree-sitter/) · [FastAPI](https://fastapi.tiangolo.com/) ·
[Jinja](https://jinja.palletsprojects.com/) · [Pygments](https://pygments.org/) · [watchdog](https://github.com/gorakhargosh/watchdog)

Expand Down
4 changes: 2 additions & 2 deletions coderag/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@

if TYPE_CHECKING:
# Re-exported lazily at runtime via __getattr__ below (keeps ``import coderag``
# light — no faiss/fastembed pulled in at import). Declared here only so type
# light — no lancedb/fastembed pulled in at import). Declared here only so type
# checkers and static analysis see ``CodeRAG`` as a defined export of __all__.
from coderag.api import CodeRAG

Expand All @@ -28,7 +28,7 @@


def __getattr__(name: str) -> object:
# Lazy re-export so ``import coderag`` stays light (no faiss/fastembed at import).
# Lazy re-export so ``import coderag`` stays light (no lancedb/fastembed at import).
if name == "CodeRAG":
from coderag.api import CodeRAG

Expand Down
133 changes: 127 additions & 6 deletions coderag/_ignore.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,23 @@
"""Shared ignore-glob matching for indexing and exact filesystem search.
"""Shared file-walking + ignore matching for indexing and exact filesystem search.

Both the :class:`~coderag.indexer.Indexer` and the exact filesystem search
(:mod:`coderag.fs_search`) must skip the *same* set of paths — vendored deps, VCS
directories, build output — or the two would disagree about what "the workspace" is.
The matching rule lives here so both callers stay in lock-step instead of each
re-implementing it.
(:mod:`coderag.fs_search`) must enumerate the *same* set of paths — skipping vendored
deps, VCS directories, build output, and (optionally) anything matched by ``.gitignore`` —
or the two would disagree about what "the workspace" is. The single :func:`walk_files`
generator below is the one place that decision is made, so both callers stay in lock-step.
"""

from __future__ import annotations

import fnmatch
from typing import Iterable, Set
import logging
import os
from pathlib import Path
from typing import Iterable, Iterator, List, Optional, Set, Tuple

logger = logging.getLogger(__name__)

GITIGNORE_FILE = ".gitignore"


def ignore_dir_names(ignore_globs: Iterable[str]) -> Set[str]:
Expand All @@ -33,3 +40,117 @@ def is_ignored(rel: str, ignore_globs: Iterable[str], ignore_dirs: Set[str]) ->
if ignore_dirs.intersection(parts):
return True
return any(fnmatch.fnmatch(rel, g) for g in ignore_globs)


def _is_ancestor(base: str, dir_rel: str) -> bool:
"""Whether a ``.gitignore`` at ``base`` still applies at ``dir_rel`` (``""`` = root)."""
if base == "":
return True
return dir_rel == base or dir_rel.startswith(base + "/")


class _GitignoreMatcher:
"""Honor nested ``.gitignore`` files during a top-down walk (nearest rule wins).

A ``.gitignore`` at directory ``B`` scopes its patterns to paths under ``B``; the
closest file's rules take precedence and may re-include via ``!``. We keep a stack of
``(base_rel, spec)`` ordered root→leaf, trimmed to the current directory's ancestors as
the (DFS pre-order) walk moves, and test a path nearest-first using pathspec's
tri-state ``check_file`` (ignore / negated-include / no-match). A no-op if pathspec is
somehow unavailable, so indexing never hard-fails on a missing optional dependency.
"""

def __init__(self) -> None:
try:
from pathspec import GitIgnoreSpec
except ImportError: # pragma: no cover - pathspec is a declared dependency
logger.warning(
"pathspec not installed; .gitignore files will not be honored."
)
self._spec_cls = None
else:
self._spec_cls = GitIgnoreSpec
self._stack: List[Tuple[str, object]] = []

@property
def enabled(self) -> bool:
return self._spec_cls is not None

def enter(self, dir_rel: str, dir_abs: Path) -> None:
"""Refresh the active-rule stack for ``dir_rel`` and load its ``.gitignore``."""
if self._spec_cls is None:
return
# Drop rules from sibling subtrees we've left; keep only ancestors of dir_rel.
self._stack = [
(base, spec) for base, spec in self._stack if _is_ancestor(base, dir_rel)
]
try:
text = (dir_abs / GITIGNORE_FILE).read_text(
encoding="utf-8", errors="replace"
)
except OSError:
return # no .gitignore here (or unreadable)
self._stack.append((dir_rel, self._spec_cls.from_lines(text.splitlines())))

def match(self, rel: str, *, is_dir: bool) -> bool:
"""True if ``rel`` (root-relative POSIX) is ignored by the active rules."""
if not self._stack:
return False
suffix = "/" if is_dir else ""
for base, spec in reversed(self._stack):
sub = rel if base == "" else rel[len(base) + 1 :]
result = spec.check_file(sub + suffix) # type: ignore[attr-defined]
if result.include is not None:
return bool(result.include)
return False


def walk_files(
start: Path,
ignore_globs: Iterable[str],
*,
root: Optional[Path] = None,
use_gitignore: bool = True,
) -> Iterator[Tuple[Path, str]]:
"""Yield ``(absolute_path, posix_rel)`` for every non-ignored file under ``start``.

``rel`` is relative to ``root`` (defaults to ``start``) so every caller shares one
notion of the workspace. Ignored directories are pruned *before descending* (the big
win at ``/home`` scale), honoring ``ignore_globs`` (dir-name prune + path globs) and,
when ``use_gitignore``, nested ``.gitignore`` files.
"""
start = Path(start)
root = Path(root) if root is not None else start
globs = tuple(ignore_globs)
ignore_dirs = ignore_dir_names(globs)
matcher = _GitignoreMatcher() if use_gitignore else None
active = matcher if (matcher is not None and matcher.enabled) else None

for dirpath, dirnames, filenames in os.walk(start):
d_abs = Path(dirpath)
try:
d_rel = "" if d_abs == root else d_abs.relative_to(root).as_posix()
except ValueError: # pragma: no cover - start outside root
continue
if active is not None:
active.enter(d_rel, d_abs)

kept: List[str] = []
for name in dirnames:
if name in ignore_dirs:
continue
rel = name if d_rel == "" else f"{d_rel}/{name}"
if is_ignored(rel, globs, ignore_dirs):
continue
if active is not None and active.match(rel, is_dir=True):
continue
kept.append(name)
dirnames[:] = kept

for name in filenames:
rel = name if d_rel == "" else f"{d_rel}/{name}"
if is_ignored(rel, globs, ignore_dirs):
continue
if active is not None and active.match(rel, is_dir=False):
continue
yield d_abs / name, rel
Loading
Loading