Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .gitleaks.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# gitleaks configuration for CodeRAG.
#
# Keeps the full default ruleset and only adds a narrow allowlist for fake,
# secret-shaped strings used in test fixtures (e.g. the redaction test feeds a
# dummy `token = "..."` line to confirm it gets masked). These are not real
# credentials; without this, gitleaks' generic-api-key rule fails CI on test data.

[extend]
useDefault = true

[allowlist]
description = "Fake secret-shaped strings in test fixtures (not real secrets)"
# Match if the finding is in this test file OR is the known dummy literal.
paths = [
'''tests/test_fs_search\.py''',
]
regexes = [
'''abcdef123456''',
]
3 changes: 3 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@
- `coderag/store/`: `sqlite_store.py` (source of truth + FTS5) and `vector_index.py` (FAISS Flat/IVF cache).
- `coderag/retrieval/`: Hybrid dense + BM25 search fused with RRF.
- `coderag/indexer.py`, `coderag/watch.py`: Incremental indexing and the debounced watcher.
- `coderag/_ignore.py`: Shared ignore-glob matching used by both the indexer and `fs_search`.
- `coderag/fs_search.py`: Exact regex/glob search (ripgrep-backed, Python fallback) — the literal-match complement to hybrid search; powers the MCP `search_files` tool.
- `coderag/install.py`: `coderag install` — registers the MCP server into Claude Code / Hermes / Codex.
- `coderag/surfaces/`: `cli.py`, `http_api.py` (FastAPI), `webui.py`, `mcp_server.py` (MCP, for AI agents) — thin adapters over the facade.
- `tests/`: pytest suite (offline by default via the `fake` provider; real model behind `-m integration`).
- `example.env` → copy to `.env`; CI lives in `.github/`.
Expand Down
32 changes: 29 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ coderag watch # index, then keep it live as files change
coderag serve --port 8000 # run the HTTP API (needs [server])
coderag ui # launch the web UI (needs [ui])
coderag mcp # MCP server for AI agents (needs [mcp]); --all-text for any dir
coderag install [TARGET] # wire the MCP server into Claude Code / Hermes / Codex
coderag status # index stats (files, chunks, model, index type)
coderag eval --dataset d.jsonl --compare # retrieval quality: dense vs BM25 vs hybrid
```
Expand Down Expand Up @@ -185,10 +186,25 @@ coderag mcp --all-text # index ALL text files (docs/notes/config), not just

It auto-indexes the working directory on startup (in the **background**, so it's responsive
immediately) and keeps the index live with the watcher — zero manual steps. Tools exposed:
**`search_code`** (hybrid search, compact snippets + `path:line`), **`get_file`** (read a
precise range of an indexed file), **`index_status`** (coverage/freshness), and **`reindex`**.
**`search_code`** (hybrid semantic search, compact snippets + `path:line`), **`search_files`**
(exact regex/glob search, ripgrep-backed — the literal-match complement to `search_code`),
**`get_file`** (read a precise range of an indexed file, optional line numbers + "did you
mean?" hints), **`index_status`** (coverage/freshness), and **`reindex`**.

Wire it into an agent (the server defaults to the directory it's launched in):
#### One-command install (`coderag install`)

Register the server into an agent without hand-editing any config:

```bash
coderag install # auto-detect installed agents and wire them up
coderag install --wizard # interactive: pick agents, workspace, exposed tools
coderag install hermes --print # preview the exact config change without writing
```

Supported targets: **Claude Code** (`.mcp.json`), **Hermes** (`~/.hermes/config.yaml`, with
`tools.include`), and **Codex** (`~/.codex/config.toml`). It is idempotent and backs up any
file it changes to `*.bak`. The equivalent manual config (the server defaults to the
directory it's launched in):

```bash
# Claude Code
Expand All @@ -207,6 +223,16 @@ command = "coderag"
args = ["mcp"]
```

```yaml
# Hermes: ~/.hermes/config.yaml
mcp_servers:
coderag:
command: coderag
args: [mcp]
tools:
include: [search_code, search_files, get_file, index_status, reindex]
```

> If `coderag` isn't on the launcher's PATH, use an absolute path (or `python -m coderag.surfaces.cli mcp`).
> To index a directory other than where the client launches, add `"--watched-dir", "/abs/path"` to `args`.
> Fast by default (local `bge-small`, no reranker); set `CODERAG_RERANK=1` to trade ~30 ms/query for sharper top results.
Expand Down
35 changes: 35 additions & 0 deletions coderag/_ignore.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
"""Shared ignore-glob matching for indexing and exact filesystem search.

Both the :class:`~coderag.indexer.Indexer` and the exact filesystem search
(:mod:`coderag.fs_search`) must skip the *same* set of paths — vendored deps, VCS
directories, build output — or the two would disagree about what "the workspace" is.
The matching rule lives here so both callers stay in lock-step instead of each
re-implementing it.
"""

from __future__ import annotations

import fnmatch
from typing import Iterable, Set


def ignore_dir_names(ignore_globs: Iterable[str]) -> Set[str]:
"""Top-level directory names that can be pruned wholesale during a walk.

Derived from ``"<name>/*"`` globs (e.g. ``"node_modules/*"`` -> ``"node_modules"``)
so ``os.walk`` can drop the whole subtree without visiting every entry, and so a
*nested* ``node_modules`` is ignored too (matched by path component, not just prefix).
"""
return {g[:-2] for g in ignore_globs if g.endswith("/*") and "/" not in g[:-2]}


def is_ignored(rel: str, ignore_globs: Iterable[str], ignore_dirs: Set[str]) -> bool:
"""True if the POSIX relative path ``rel`` should be skipped.

A path is ignored if any of its components is an ignored directory name, or if the
whole relative path matches one of ``ignore_globs``.
"""
parts = rel.split("/")
if ignore_dirs.intersection(parts):
return True
return any(fnmatch.fnmatch(rel, g) for g in ignore_globs)
32 changes: 31 additions & 1 deletion coderag/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
import logging
import threading
from pathlib import Path
from typing import TYPE_CHECKING, List, Optional, Union
from typing import TYPE_CHECKING, Any, List, Optional, Union

from coderag._lines import split_lines
from coderag.config import Config
Expand Down Expand Up @@ -128,6 +128,36 @@ def search(self, query: str, top_k: Optional[int] = None) -> List[SearchHit]:
"""Hybrid (dense + lexical) search over the indexed codebase."""
return self.searcher.search(query, top_k or self.config.top_k)

def search_files(self, pattern: str, **kwargs: Any) -> dict:
"""Exact regex/glob search over the workspace (the complement to ``search``).

Thin pass-through to :func:`coderag.fs_search.search_files`, wired to the
configured ``watched_dir`` and ``ignore_globs`` so it sees exactly the same
files the indexer does. See that function for the keyword arguments.
"""
from coderag.fs_search import search_files

return search_files(
self.config.watched_dir,
pattern,
ignore_globs=self.config.ignore_globs,
**kwargs,
)

def suggest_paths(self, path: Union[str, Path], n: int = 3) -> List[str]:
"""Indexed paths whose name is closest to ``path`` — for "did you mean?" hints."""
import difflib

name = Path(str(path)).name
candidates = self.store.all_file_paths()
# Match on basename first (agents often pass a bare filename), then full path.
by_name = {c: Path(c).name for c in candidates}
close = difflib.get_close_matches(name, list(by_name.values()), n=n, cutoff=0.5)
hits = [c for c, base in by_name.items() if base in close]
if not hits:
hits = difflib.get_close_matches(str(path), candidates, n=n, cutoff=0.4)
return hits[:n]

def get_file(
self,
path: Union[str, Path],
Expand Down
Loading
Loading