Skip to content

feat: CodeRAG v1.0 — standalone, local-first semantic code-search engine#4

Merged
Neverdecel merged 5 commits into
masterfrom
feat/v1-revamp
Jun 1, 2026
Merged

feat: CodeRAG v1.0 — standalone, local-first semantic code-search engine#4
Neverdecel merged 5 commits into
masterfrom
feat/v1-revamp

Conversation

@Neverdecel
Copy link
Copy Markdown
Owner

What & why

Reworks CodeRAG from a file-averaged, OpenAI-only POC into a standalone, local-first semantic code-search engine for large, custom, or private codebases. It runs with no API key (local ONNX embeddings by default) and is usable via CLI, a Python library, an HTTP/REST server, and a Streamlit UI.

Deliberately not an IDE/MCP plugin — those tools already do code RAG. This is for codebases too big/custom/private for off-the-shelf assistants, or for embedding search into your own tools.

Bugs fixed from the old design

  • Duplicate/stale vectors: the old watchdog monitor appended a new vector on every save with no delete (and IndexFlatIP can't delete). Now delete-before-add with a deletable index — a test proves the store and FAISS stay in lock-step.
  • Mushy retrieval: whole files were embedded and chunk vectors averaged into one. Now symbol-aware chunks (functions/classes/methods) with file:line citations.
  • Wasted cost: full re-embed on every boot → content-hash incremental indexing (unchanged files skipped).
  • Fragile storage: per-query disk reload + pickled metadata.npySQLite source of truth, FAISS as a rebuildable cache.

New capabilities

  • Local-first embeddings (fastembed/ONNX), OpenAI opt-in; dim comes from the provider, model switch auto-rebuilds.
  • Hybrid dense + BM25 retrieval fused with Reciprocal Rank Fusion.
  • Multi-language: Python (ast) + JS/TS/Go/Rust/Java (tree-sitter), line-window fallback for the rest.
  • Built to scale: pluggable Flat→IVF vector index (auto-switch past a threshold).
  • One engine (CodeRAG facade) behind four thin surfaces.

Surfaces

```bash
coderag index | search "..." | watch | serve | ui | status # CLI
from coderag import CodeRAG # library
coderag serve -> GET /search /status /file, POST /index # HTTP/REST ([server])
coderag ui # Streamlit ([ui])
```

Verification

  • black / isort / flake8 / mypy clean.
  • 54 tests pass (53 offline + 1 real-fastembed integration test). Offline tests use a deterministic fake embedder — no network, no downloads.
  • CI runs a 3.11 / 3.12 matrix, offline (-m "not integration").
  • Smoke-tested end-to-end on this repo: 45 files → 312 symbol-level chunks; re-index skips all unchanged files.

Notes for reviewers

  • Clean break on the on-disk format (new SQLite store); old indexes should be rebuilt — just run coderag index.
  • Switched tree-sitter from tree-sitter-language-pack (shipped an incompatible bundled binding) to the official per-language grammar wheels on the modern 0.25 API.
  • Old entry points removed: main.py, app.py, prompt_flow.py, coderag/{index,search,monitor,embeddings,cli}.py, scripts/.

Commits are split: core engine → surfaces → tests → docs/CI.

🤖 Generated with Claude Code

ridel550 and others added 5 commits June 1, 2026 20:00
…bol-aware chunking, hybrid retrieval

Replace the file-averaged, OpenAI-only, append-only POC engine with a
chunk-level, incremental, hybrid-search engine:

- EmbeddingProvider abstraction (fastembed local default, OpenAI opt-in, fake for tests);
  embedding dim comes from the provider, not a hardcoded constant.
- SQLite is the source of truth (files/chunks/vectors/FTS5); FAISS is a rebuildable cache.
- Pluggable vector index: exact Flat, auto-switching to IVF past a scale threshold.
- Symbol-aware chunking: Python via ast, JS/TS/Go/Rust/Java via tree-sitter, line-window fallback.
- Incremental indexing with content hashing and delete-before-add (fixes the duplicate/stale
  vector bug from the old watchdog monitor).
- Hybrid dense + BM25 retrieval fused with Reciprocal Rank Fusion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All four surfaces are thin adapters over the CodeRAG facade:
- coderag CLI: index / search / watch / serve / ui / status (entry point 'coderag').
- FastAPI server (coderag serve): GET /search /status /file, POST /index ([server] extra).
- Streamlit UI (coderag ui): streamed answers, file:line citations, scores, reindex ([ui] extra).

Removes the old main.py / app.py / prompt_flow.py / cli.py / scripts entry points.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers config/providers, SQLite store + Flat/IVF vector index, chunking across
languages, incremental indexing + no-duplicate invariant, RRF + hybrid search,
and the CLI/HTTP/watcher surfaces. Default run is fully offline via a deterministic
fake embedder; the real fastembed model is exercised only under -m integration.

Removes the old smoke tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…OPMENT, modernize CI

- README: drop the 'made obsolete by Cursor' framing; present CodeRAG as a standalone,
  local-first semantic code-search engine for large/custom codebases, with CLI / library /
  HTTP / UI quickstarts and an architecture diagram.
- AGENTS.md / DEVELOPMENT.md: document the new module layout and design invariants.
- pyproject: v1.0.0, new deps (fastembed, tree-sitter grammars, tqdm), extras
  (server/ui/openai), single 'coderag' entry point, pytest integration marker.
- CI: 3.11/3.12 matrix running black/isort/flake8/mypy + offline pytest (-m 'not integration').
- example.env rewritten around CODERAG_* config; .flake8 added; tooling and .gitignore updated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI runs 'pytest' directly (not 'python -m pytest'), so the repo root wasn't on
sys.path and 'from tests.conftest import ...' failed with ModuleNotFoundError.
Add pythonpath=['.'] to the pytest config so it resolves regardless of invocation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Neverdecel Neverdecel merged commit c685a2f into master Jun 1, 2026
2 checks passed
@Neverdecel Neverdecel deleted the feat/v1-revamp branch June 1, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants