Skip to content

feat(embed): static (model2vec) embeddings + extra-languages gating — v0.19.0#7

Merged
anvanster merged 21 commits into
mainfrom
feat/static-embeddings
Jul 1, 2026
Merged

feat(embed): static (model2vec) embeddings + extra-languages gating — v0.19.0#7
anvanster merged 21 commits into
mainfrom
feat/static-embeddings

Conversation

@anvanster

Copy link
Copy Markdown
Member

Summary

Adds static (model2vec) embeddings as a selectable --embedding-model static: a token→vector lookup table that replaces the ONNX transformer for indexing. ~100× faster indexing (this repo's ~5,900 symbols embed in ~1 s vs ~3.4 min with BGE), no ONNX runtime or 1.5 GB RAM gate. Retrieval stays hybrid (BM25 + semantic), so end-to-end quality is ~90% of BGE (server-side eval: BGE R@1 0.457 / MRR 0.568 vs static 0.430 / 0.536 over 300 queries).

Also trims binary size by ~25 MB by gating 6 zero-usage tree-sitter grammars (COBOL/Fortran/Perl/Dart/Zig/R) behind an extra-languages cargo feature (decision driven by PostHog language-usage telemetry).

Version bumped to 0.19.0.

What's included

  • EmbeddingBackend enum { Fastembed | Static(dir) }; --embedding-model static (CLI), codegraph.embeddingModel / codegraph.staticModelPath (VS Code), CODEGRAPH_STATIC_MODEL env.
  • Static loader: model2vec format (config.json + tokenizer.json + model.safetensors), F16/F32/F64, optional SIF token weighting, mean-pool + L2-norm.
  • Distill script (scripts/distill_static_model.py): ~30 s on CPU from Apache-2.0 Jina-Code.
  • Model packaging: VSIX bundles jina-code-static-256 (extension auto-selects it, zero setup); npm package fetches it on postinstall from the release-independent GitHub model tag.
  • Telemetry: mcp.start now reports embeddingModel so static-adoption can be tracked before flipping the default.
  • extra-languages feature gate: 6 grammars optional, off by default (−25 MB).
  • READMEs + examples updated across root / VS Code / MCP.

Excluded from this release

Durability WIP (with_sync_writes/fsync for the canonical persist) is intentionally not in this branch — it'll land separately.

🤖 Generated with Claude Code

anvanster and others added 21 commits June 25, 2026 23:42
…n plan

build_embed_text can prepend the camelCase/snake_case-split form of a symbol's
name (reusing the BM25 tokenizer) so static lookup-table embedders get the
sub-words they can't subword-recover. Gated by set_split_identifiers, default
OFF — the transformer (BGE/Jina) path is unchanged until A/B'd via the eval.

Also adds docs/static-embeddings-plan.md (distill Jina-Code -> static, validated
at every step). Phase 1.2. Tests cover the splitter + the gated prepend.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Introduce a pub(crate) Embedder trait (embed / embed_batch / dimension /
model_name); FastembedEmbedding implements it and VectorEngine now holds
Arc<dyn Embedder>. Behavior-preserving — 53 codegraph-memory tests pass. This
is the seam a static (lookup-table) backend plugs into next (Phase 1.1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hand-rolled static embedder: loads config.json + tokenizer.json +
model.safetensors (embeddings [vocab,dim] F32) and embeds as tokenize -> gather
rows -> mean-pool -> L2-norm. Implements Embedder; VectorEngine::with_static_model
loads a model dir. Deps: tokenizers (HF WordPiece) + safetensors — no ONNX, no
1.5GB RAM gate, no glibc shim.

Validated against the real potion-base-8M (256d): loads, normalizes, ranks
related code phrases above unrelated. Plus a pure mean_pool_l2 unit test
(mean / OOV-skip / normalize / no-NaN). Phase 1.1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
examples/embed_throughput.rs compares static (potion-base-8M, 256d) vs ONNX
BGE-small (384d) embedding throughput over 512 unique symbol texts. Debug floor:
static 2882 vs BGE 335 texts/sec (8.6x); release widens it since the static path
is pure Rust. Proves the indexing-speed premise.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
examples/embed_quality.rs: 12 query->symbol pairs scored by recall@1/@3 + MRR.
Generic static floor (potion-base-8M) R@1 0.92 / R@3 1.00 / MRR 0.958 vs BGE
1.00/1.00/1.000 — ~95% of BGE at the floor, 103x faster. Directional (small
clean set), not the Phase-0 eval.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ript

StaticEmbedding now decodes F16/F32/F64 tensors and applies the optional
per-token weights (model2vec >= 0.4 SIF), matching model2vec's encode exactly
(emb*weights -> mean -> L2-norm). scripts/distill_static_model.py distills a
teacher (default jinaai/jina-embeddings-v2-base-code, Apache-2.0) to a static
model in ~30s on the M4 CPU. Examples take CODEGRAPH_STATIC_MODEL to A/B any dir.

Validated: loads potion-base-8M (F32/no-weights) and the distilled
jina-code-static-256 (F16/weighted), both semantically sane. Phase 2.2.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…urated

jina-code-static-256 ties the potion floor (R@1 0.92 / MRR 0.958) on the 12-query
micro-eval — the set is too easy to show a code-teacher delta. Speed 70x BGE.
Motivates the real Phase-0 eval (150+ queries on an indexed repo).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…emantic)

scripts/extract_eval_corpus.py extracts 965 doc-commented symbols;
examples/embed_eval.rs runs doc->symbol retrieval (recall@k). On this hard,
unsaturated task: BGE R@1 0.591 / MRR 0.691; potion 0.378/0.488;
jina-code-static-256 0.379/0.480. Static ~65% of BGE R@1, and the code teacher
ties the generic potion at 256d — both against the saturated micro-eval. Pure
semantic (no BM25); the real hybrid would close much of the gap.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on ceiling

512d: no change (0.378). code teacher vs generic: no change. identifier-splitting
(CODEGRAPH_SPLIT_IDS): +6% relative (static R@1 0.379->0.401, MRR 0.480->0.511;
BGE 0.591->0.608). Static's ~65% of BGE on pure semantics is the no-attention
ceiling, not dim/teacher. Real mitigant = the hybrid BM25+semantic system.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to-end

embed_eval now scores pure-semantic AND the real 40% BM25 + 60% semantic blend.
BM25 recovers most of static's gap: jina-code-static-256 R@1 0.401->0.547,
MRR 0.511->0.656 vs BGE 0.609/0.720 — static ~90% of BGE in hybrid (vs ~65%
pure-semantic) at ~70-100x indexing speed. BGE barely uses BM25; static leans on
it. Verdict: static is a viable default/opt-in.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…GHPUT_CORPUS)

On this project's 965 real symbols: static 46067 vs BGE 298 texts/sec (154x).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tatic

Adds EmbeddingBackend { Fastembed(model) | Static(path) } + VectorEngine::
from_backend. `--embedding-model static` (also via the LSP/MCP init path)
resolves the model2vec dir from CODEGRAPH_STATIC_MODEL or
~/.codegraph/static_models/jina-code-static-256. Threaded through main.rs (3
duplicated parse blocks unified into EmbeddingBackend::parse), McpServer,
EngineConfig, MemoryManager, and the LSP initialize path. ONNX models unchanged
(default stays bge-small).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- vscode/package.json: add 'static' to codegraph.embeddingModel enum + a
  codegraph.staticModelPath setting.
- extension.ts: when embeddingModel=static, pass CODEGRAPH_STATIC_MODEL env to
  the spawned server; forward staticModelPath as an init option.
- README.md / vscode/README.md / mcp-package/README.md: document --embedding-model
  static (model2vec, ~100x faster indexing, ~90% of BGE in hybrid search) + how
  to distill/point at a model.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
scripts/server_eval.py drives the real codegraph-server MCP (per model: handshake,
force-reindex, wait for embeddings, run symbol_search over doc->symbol queries,
score recall@k) — static vs BGE through the actual hybrid. scripts/
extract_fullbody_corpus.py dumps full-body symbol texts for throughput tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l_search

300 doc->symbol queries via the actual hybrid: BGE R@1 0.457/MRR 0.568 vs static
0.430/0.536 (R@10 99%). Confirms viability at ~100x indexing speed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…5 MB)

Telemetry (PostHog, 90d, 1,486 machines / 8,268 index events): cobol, perl, dart,
zig, r, fortran appear in ZERO indexed workspaces. Gate them behind an
`extra-languages` cargo feature (off by default): the community binary drops
141.2 -> 116.1 MB; `--features extra-languages` restores the full set (141.2 MB,
identical to baseline). COBOL's parser.c alone was 30.7 MB of parse tables.

parser_registry: optional deps + #[cfg] across imports / struct / new /
get_parser / parser_for_path / supported_extensions / all_metrics /
language_for_path + tests. Default build + 15 parser_registry tests pass; both
feature configs compile.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Telemetry: mcp.start now emits `embeddingModel` (static / bge-small /
  jina-code-v2 / granite-97m) via EmbeddingBackend::telemetry_id — static
  adoption is queryable in PostHog (and the extension config snapshot already
  reports the setting now that 'static' is valid).
- scripts/fetch-static-model.sh: fetch the distilled model from the
  release-independent `model` GitHub release (package-time bundle or manual).
- MCP postinstall: best-effort static-model fetch (skip via
  CODEGRAPH_SKIP_MODEL_FETCH; never fails install).
- VS Code: default CODEGRAPH_STATIC_MODEL to the bundled bin/jina-code-static-256
  when embeddingModel=static and staticModelPath is unset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The VSIX bundles jina-code-static-256 and the extension auto-selects it
(extension.ts), so `static` needs zero setup in VS Code. CLI/MCP keep the
CODEGRAPH_STATIC_MODEL path instructions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown

🔍 CodeGraph PR Review

34 files changed (+1845/−149, 67 functions) · Risk: 🔴 high

Blast radius

294 direct callers affected across CodeGraph/CodeGraph/scripts, CodeGraph/vscode/src, codegraph-memory/src/embedding, codegraph-server/src/ai_query, codegraph-server/src/handlers

⚠️ Test gaps (48 functions, 0 coverage)

  • with_static_model (crates/codegraph-memory/src/embedding/engine.rs) — signature_changed
  • model_name (crates/codegraph-memory/src/embedding/engine.rs) — signature_changed
  • from_backend (crates/codegraph-memory/src/embedding/engine.rs) — signature_changed
  • dimension (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
  • embed (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
  • embed_batch (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
  • model_name (crates/codegraph-memory/src/embedding/fastembed_embed.rs) — signature_changed
  • default_static_model_dir (crates/codegraph-memory/src/embedding/mod.rs) — signature_changed
  • default (crates/codegraph-memory/src/embedding/mod.rs) — signature_changed
  • telemetry_id (crates/codegraph-memory/src/embedding/mod.rs) — signature_changed
  • …and 38 more

Suggested reviewers

Andrey Vasilevsky (227 lines), anvanster (1 lines)

Suggested commit: feat(scripts): <describe the change> · 148 tests cover the changes
🤖 Generated by CodeGraph

@anvanster anvanster merged commit 29ee815 into main Jul 1, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant