feat(embed): static (model2vec) embeddings + extra-languages gating — v0.19.0 by anvanster · Pull Request #7 · codegraph-ai/CodeGraph

anvanster · 2026-06-30T22:50:29Z

Summary

Adds static (model2vec) embeddings as a selectable --embedding-model static: a token→vector lookup table that replaces the ONNX transformer for indexing. ~100× faster indexing (this repo's ~5,900 symbols embed in ~1 s vs ~3.4 min with BGE), no ONNX runtime or 1.5 GB RAM gate. Retrieval stays hybrid (BM25 + semantic), so end-to-end quality is ~90% of BGE (server-side eval: BGE R@1 0.457 / MRR 0.568 vs static 0.430 / 0.536 over 300 queries).

Also trims binary size by ~25 MB by gating 6 zero-usage tree-sitter grammars (COBOL/Fortran/Perl/Dart/Zig/R) behind an extra-languages cargo feature (decision driven by PostHog language-usage telemetry).

Version bumped to 0.19.0.

What's included

EmbeddingBackend enum { Fastembed | Static(dir) }; --embedding-model static (CLI), codegraph.embeddingModel / codegraph.staticModelPath (VS Code), CODEGRAPH_STATIC_MODEL env.
Static loader: model2vec format (config.json + tokenizer.json + model.safetensors), F16/F32/F64, optional SIF token weighting, mean-pool + L2-norm.
Distill script (scripts/distill_static_model.py): ~30 s on CPU from Apache-2.0 Jina-Code.
Model packaging: VSIX bundles jina-code-static-256 (extension auto-selects it, zero setup); npm package fetches it on postinstall from the release-independent GitHub model tag.
Telemetry: mcp.start now reports embeddingModel so static-adoption can be tracked before flipping the default.
extra-languages feature gate: 6 grammars optional, off by default (−25 MB).
READMEs + examples updated across root / VS Code / MCP.

Excluded from this release

Durability WIP (with_sync_writes/fsync for the canonical persist) is intentionally not in this branch — it'll land separately.

🤖 Generated with Claude Code

…n plan build_embed_text can prepend the camelCase/snake_case-split form of a symbol's name (reusing the BM25 tokenizer) so static lookup-table embedders get the sub-words they can't subword-recover. Gated by set_split_identifiers, default OFF — the transformer (BGE/Jina) path is unchanged until A/B'd via the eval. Also adds docs/static-embeddings-plan.md (distill Jina-Code -> static, validated at every step). Phase 1.2. Tests cover the splitter + the gated prepend. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Introduce a pub(crate) Embedder trait (embed / embed_batch / dimension / model_name); FastembedEmbedding implements it and VectorEngine now holds Arc<dyn Embedder>. Behavior-preserving — 53 codegraph-memory tests pass. This is the seam a static (lookup-table) backend plugs into next (Phase 1.1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Hand-rolled static embedder: loads config.json + tokenizer.json + model.safetensors (embeddings [vocab,dim] F32) and embeds as tokenize -> gather rows -> mean-pool -> L2-norm. Implements Embedder; VectorEngine::with_static_model loads a model dir. Deps: tokenizers (HF WordPiece) + safetensors — no ONNX, no 1.5GB RAM gate, no glibc shim. Validated against the real potion-base-8M (256d): loads, normalizes, ranks related code phrases above unrelated. Plus a pure mean_pool_l2 unit test (mean / OOV-skip / normalize / no-NaN). Phase 1.1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

examples/embed_throughput.rs compares static (potion-base-8M, 256d) vs ONNX BGE-small (384d) embedding throughput over 512 unique symbol texts. Debug floor: static 2882 vs BGE 335 texts/sec (8.6x); release widens it since the static path is pure Rust. Proves the indexing-speed premise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

examples/embed_quality.rs: 12 query->symbol pairs scored by recall@1/@3 + MRR. Generic static floor (potion-base-8M) R@1 0.92 / R@3 1.00 / MRR 0.958 vs BGE 1.00/1.00/1.000 — ~95% of BGE at the floor, 103x faster. Directional (small clean set), not the Phase-0 eval. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ript StaticEmbedding now decodes F16/F32/F64 tensors and applies the optional per-token weights (model2vec >= 0.4 SIF), matching model2vec's encode exactly (emb*weights -> mean -> L2-norm). scripts/distill_static_model.py distills a teacher (default jinaai/jina-embeddings-v2-base-code, Apache-2.0) to a static model in ~30s on the M4 CPU. Examples take CODEGRAPH_STATIC_MODEL to A/B any dir. Validated: loads potion-base-8M (F32/no-weights) and the distilled jina-code-static-256 (F16/weighted), both semantically sane. Phase 2.2. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…urated jina-code-static-256 ties the potion floor (R@1 0.92 / MRR 0.958) on the 12-query micro-eval — the set is too easy to show a code-teacher delta. Speed 70x BGE. Motivates the real Phase-0 eval (150+ queries on an indexed repo). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…emantic) scripts/extract_eval_corpus.py extracts 965 doc-commented symbols; examples/embed_eval.rs runs doc->symbol retrieval (recall@k). On this hard, unsaturated task: BGE R@1 0.591 / MRR 0.691; potion 0.378/0.488; jina-code-static-256 0.379/0.480. Static ~65% of BGE R@1, and the code teacher ties the generic potion at 256d — both against the saturated micro-eval. Pure semantic (no BM25); the real hybrid would close much of the gap. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on ceiling 512d: no change (0.378). code teacher vs generic: no change. identifier-splitting (CODEGRAPH_SPLIT_IDS): +6% relative (static R@1 0.379->0.401, MRR 0.480->0.511; BGE 0.591->0.608). Static's ~65% of BGE on pure semantics is the no-attention ceiling, not dim/teacher. Real mitigant = the hybrid BM25+semantic system. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…to-end embed_eval now scores pure-semantic AND the real 40% BM25 + 60% semantic blend. BM25 recovers most of static's gap: jina-code-static-256 R@1 0.401->0.547, MRR 0.511->0.656 vs BGE 0.609/0.720 — static ~90% of BGE in hybrid (vs ~65% pure-semantic) at ~70-100x indexing speed. BGE barely uses BM25; static leans on it. Verdict: static is a viable default/opt-in. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…GHPUT_CORPUS) On this project's 965 real symbols: static 46067 vs BGE 298 texts/sec (154x). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tatic Adds EmbeddingBackend { Fastembed(model) | Static(path) } + VectorEngine:: from_backend. `--embedding-model static` (also via the LSP/MCP init path) resolves the model2vec dir from CODEGRAPH_STATIC_MODEL or ~/.codegraph/static_models/jina-code-static-256. Threaded through main.rs (3 duplicated parse blocks unified into EmbeddingBackend::parse), McpServer, EngineConfig, MemoryManager, and the LSP initialize path. ONNX models unchanged (default stays bge-small). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- vscode/package.json: add 'static' to codegraph.embeddingModel enum + a codegraph.staticModelPath setting. - extension.ts: when embeddingModel=static, pass CODEGRAPH_STATIC_MODEL env to the spawned server; forward staticModelPath as an init option. - README.md / vscode/README.md / mcp-package/README.md: document --embedding-model static (model2vec, ~100x faster indexing, ~90% of BGE in hybrid search) + how to distill/point at a model. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

scripts/server_eval.py drives the real codegraph-server MCP (per model: handshake, force-reindex, wait for embeddings, run symbol_search over doc->symbol queries, score recall@k) — static vs BGE through the actual hybrid. scripts/ extract_fullbody_corpus.py dumps full-body symbol texts for throughput tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…l_search 300 doc->symbol queries via the actual hybrid: BGE R@1 0.457/MRR 0.568 vs static 0.430/0.536 (R@10 99%). Confirms viability at ~100x indexing speed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…5 MB) Telemetry (PostHog, 90d, 1,486 machines / 8,268 index events): cobol, perl, dart, zig, r, fortran appear in ZERO indexed workspaces. Gate them behind an `extra-languages` cargo feature (off by default): the community binary drops 141.2 -> 116.1 MB; `--features extra-languages` restores the full set (141.2 MB, identical to baseline). COBOL's parser.c alone was 30.7 MB of parse tables. parser_registry: optional deps + #[cfg] across imports / struct / new / get_parser / parser_for_path / supported_extensions / all_metrics / language_for_path + tests. Default build + 15 parser_registry tests pass; both feature configs compile. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Telemetry: mcp.start now emits `embeddingModel` (static / bge-small / jina-code-v2 / granite-97m) via EmbeddingBackend::telemetry_id — static adoption is queryable in PostHog (and the extension config snapshot already reports the setting now that 'static' is valid). - scripts/fetch-static-model.sh: fetch the distilled model from the release-independent `model` GitHub release (package-time bundle or manual). - MCP postinstall: best-effort static-model fetch (skip via CODEGRAPH_SKIP_MODEL_FETCH; never fails install). - VS Code: default CODEGRAPH_STATIC_MODEL to the bundled bin/jina-code-static-256 when embeddingModel=static and staticModelPath is unset. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The VSIX bundles jina-code-static-256 and the extension auto-selects it (extension.ts), so `static` needs zero setup in VS Code. CLI/MCP keep the CODEGRAPH_STATIC_MODEL path instructions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-30T22:50:56Z

🔍 CodeGraph PR Review

34 files changed (+1845/−149, 67 functions) · Risk: 🔴 high

Blast radius

294 direct callers affected across CodeGraph/CodeGraph/scripts, CodeGraph/vscode/src, codegraph-memory/src/embedding, codegraph-server/src/ai_query, codegraph-server/src/handlers

⚠️ Test gaps (48 functions, 0 coverage)

with_static_model (crates/codegraph-memory/src/embedding/engine.rs) — signature_changed
model_name (crates/codegraph-memory/src/embedding/engine.rs) — signature_changed
from_backend (crates/codegraph-memory/src/embedding/engine.rs) — signature_changed
dimension (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
embed (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
embed_batch (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
model_name (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
default_static_model_dir (crates/codegraph-memory/src/embedding/mod.rs) — signature_changed
default (crates/codegraph-memory/src/embedding/mod.rs) — signature_changed
telemetry_id (crates/codegraph-memory/src/embedding/mod.rs) — signature_changed
…and 38 more

Suggested reviewers

Andrey Vasilevsky (227 lines), anvanster (1 lines)

_{Suggested commit: feat(scripts): <describe the change> · 148 tests cover the changes}
_{🤖 Generated by CodeGraph}

anvanster and others added 21 commits June 25, 2026 23:42

docs(embed): record progress + 103x release throughput in the plan

f883283

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(embed): embed_throughput can load a real corpus (CODEGRAPH_THROU…

55fd388

…GHPUT_CORPUS) On this project's 965 real symbols: static 46067 vs BGE 298 texts/sec (154x). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: note COBOL/Fortran/Perl/Dart/Zig/R are extra-languages-gated

1ce7c7e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: bump version to 0.19.0 (static embeddings feature)

f0a6dbb

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

anvanster merged commit 29ee815 into main Jul 1, 2026
1 check passed

anvanster mentioned this pull request Jul 1, 2026

fix(pr-review): test-detection + risk calibration — v0.19.1 #8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(embed): static (model2vec) embeddings + extra-languages gating — v0.19.0#7

feat(embed): static (model2vec) embeddings + extra-languages gating — v0.19.0#7
anvanster merged 21 commits into
mainfrom
feat/static-embeddings

anvanster commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

anvanster commented Jun 30, 2026

Summary

What's included

Excluded from this release

Uh oh!

github-actions Bot commented Jun 30, 2026

🔍 CodeGraph PR Review

Blast radius

⚠️ Test gaps (48 functions, 0 coverage)

Suggested reviewers

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant