Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
de3e45d
feat(embed): gated split-identifier words in embed text + distillatio…
anvanster Jun 26, 2026
eb60736
refactor(embed): Embedder trait — swappable backend behind VectorEngine
anvanster Jun 26, 2026
a30df5e
feat(embed): StaticEmbedding backend — model2vec-format lookup, no ONNX
anvanster Jun 26, 2026
1da5309
test(embed): throughput example — static vs ONNX
anvanster Jun 26, 2026
f883283
docs(embed): record progress + 103x release throughput in the plan
anvanster Jun 26, 2026
86fa334
test(embed): retrieval quality micro-eval — static floor vs BGE
anvanster Jun 26, 2026
68b1c8e
feat(embed): F16 + SIF-weighted static loading + Jina-Code distill sc…
anvanster Jun 26, 2026
28e7e88
docs(embed): record jina-code-static A/B — 70x faster, micro-eval sat…
anvanster Jun 26, 2026
4083113
test(embed): real 965-way retrieval eval — static ~65% of BGE (pure s…
anvanster Jun 26, 2026
56ff954
test(embed): complete static lever sweep — gap is the contextualizati…
anvanster Jun 26, 2026
a651f79
test(embed): hybrid (BM25+semantic) eval — static is ~90% of BGE end-…
anvanster Jun 26, 2026
55fd388
test(embed): embed_throughput can load a real corpus (CODEGRAPH_THROU…
anvanster Jun 27, 2026
7e5d866
feat(embed): EmbeddingBackend — select static via --embedding-model s…
anvanster Jun 28, 2026
2db0f08
docs(embed): expose static mode in VS Code + MCP + READMEs
anvanster Jun 28, 2026
c14174c
test(embed): server-side eval driver + full-body corpus extractor
anvanster Jun 28, 2026
49289d0
docs(embed): server-side eval — static ~94% of BGE through real symbo…
anvanster Jun 28, 2026
a7a6c33
perf(parser): gate 6 zero-usage grammars behind `extra-languages` (-2…
anvanster Jun 29, 2026
1ce7c7e
docs: note COBOL/Fortran/Perl/Dart/Zig/R are extra-languages-gated
anvanster Jun 29, 2026
865c994
feat(embed): static-model telemetry + release-independent model fetch
anvanster Jun 29, 2026
f0a6dbb
chore: bump version to 0.19.0 (static embeddings feature)
anvanster Jun 30, 2026
1391ebc
docs(embed): note static model ships bundled in VS Code extension
anvanster Jun 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 20 additions & 3 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ members = [
]

[workspace.package]
version = "0.18.6"
version = "0.19.0"
edition = "2021"
license = "Apache-2.0"
repository = "https://github.com/codegraph-ai/codegraph"
Expand Down
27 changes: 25 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,13 +77,29 @@ one tool and exits without the MCP stdio handshake — ideal for scripting.
|------|---------|-------------|
| `--workspace <path>` | current dir | Directories to index (repeatable for multi-project) |
| `--exclude <dir>` | — | Directories to skip (repeatable) |
| `--embedding-model <model>` | `bge-small` | `bge-small` (384d, fast), `jina-code-v2` (768d, 6× slower), or `granite-97m` (384d, 32K ctx, ~3× slower) |
| `--embedding-model <model>` | `bge-small` | `bge-small` (384d, fast), `jina-code-v2` (768d, 6× slower), `granite-97m` (384d, 32K ctx, ~3× slower), or `static` (model2vec, 256d — ~100× faster indexing, no ONNX; needs a local model dir, see below) |
| `--full-body-embedding` | `true` | Embed full function body (~50 lines) for better semantic search and duplicate detection |
| `--max-files <n>` | 5000 | Maximum files to index |
| `--profile <name>` | `all` | Filter the exposed MCP tool surface to a named subset (see below) |
| `--graph-only` | off | Skip embedding generation — build the graph and serve structural tools only. No ONNX model load, 10-50× faster indexing. Semantic search unavailable. For CI / one-shot graph queries. |
| `--run-tool <name>` | — | One-shot mode: index, run a single tool, print its result, exit. No MCP handshake. Pair with `--tool-args '<json>'`. |

#### `--embedding-model static` — model2vec fast indexing

Static (model2vec) embeddings replace the ONNX transformer with a token→vector
lookup table: indexing is **~100× faster** (this repo's 5,873 symbols embed in
~1 s vs ~3.4 min with BGE) and there's **no ONNX runtime or 1.5 GB RAM gate**.
Retrieval stays **hybrid (BM25 + semantic)**, so end-to-end quality is **~90% of
BGE**. The VS Code extension ships the model bundled, so `static` works there
with no setup. For the CLI/MCP server it needs a local model directory
(`config.json` + `tokenizer.json` + `model.safetensors`):

- Point at it with `CODEGRAPH_STATIC_MODEL=/path/to/model` (or the VS Code
`codegraph.staticModelPath` setting to override the bundled model). Default:
`~/.codegraph/static_models/jina-code-static-256`.
- Distill one from any sentence-transformer (Apache-2.0 Jina-Code by default) in
~30 s on CPU: `python scripts/distill_static_model.py`.

#### `--profile` — narrow the MCP tool surface

The full 32-tool surface is convenient but inflates the agent's prompt-context cost. A profile exposes only the slice you need (also settable via the `CODEGRAPH_TOOL_PROFILE` env var):
Expand All @@ -103,7 +119,8 @@ The full 32-tool surface is convenient but inflates the agent's prompt-context c
"codegraph.indexOnStartup": true,
"codegraph.indexPaths": ["/path/to/project-a", "/path/to/project-b"],
"codegraph.excludePatterns": ["**/cmake-build-debug/**", "**/generated/**"],
"codegraph.embeddingModel": "bge-small",
"codegraph.embeddingModel": "bge-small", // or "static" for ~100× faster indexing
"codegraph.staticModelPath": "", // model2vec model dir when embeddingModel is "static"
"codegraph.maxFileSizeKB": 1024,
"codegraph.debug": false
}
Expand Down Expand Up @@ -313,6 +330,12 @@ Additional tools available in [CodeGraph Pro](https://codegraph.astudioplus.com/

HTTP handler detection: Python (FastAPI/Flask/Django), TypeScript (NestJS), Java (Spring/JAX-RS), Go (stdlib/Gin/Echo/Fiber), C# (ASP.NET), Ruby (Rails), PHP (Laravel/Symfony).

> **Community vs full builds:** COBOL, Fortran, Perl, Dart, Zig, and R are
> compiled only with `--features extra-languages`. The default community binary
> omits them — they had zero usage in telemetry and their tree-sitter grammars
> add ~25 MB (COBOL's parse tables alone are 30 MB). The other 32 languages are
> always available.

---

## Architecture
Expand Down
6 changes: 6 additions & 0 deletions crates/codegraph-memory/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,12 @@ anyhow = "1.0"
# Logging
log = "0.4"

# Static (lookup-table) embeddings: HuggingFace tokenizer + a safetensors
# token->vector matrix, mean-pooled. No ONNX — the fast indexing path.
tokenizers = "0.21"
safetensors = "0.4"
half = "2"

# Embeddings - fastembed with BGE-Small-EN-v1.5
# macOS/Linux: static link ONNX Runtime (ort-download-binaries)
# Windows: load onnxruntime DLL at runtime (ort-load-dynamic, avoids CRT /MT vs /MD mismatch)
Expand Down
Loading
Loading