Existing LLM tools (Cursor, Claude Code, Copilot, and the rest) pick what to put in the context window with heuristics: grep for some symbols, embed and cosine-similarity search, or stuff as many files as will fit. Few of them treat it as an explicit selection problem.
This project is one attempt at treating it as one.
Runs entirely local with no LLM calls, API keys, or cloud dependencies, and supports Python, JavaScript, TypeScript, Go, Rust, Java, Ruby, C, and C++.
Think of it this way:
| Classic OS | LLM Equivalent | Current State |
|---|---|---|
| RAM | Context window | Manually managed |
| Virtual memory / page swaps | Context eviction + retrieval | Crude summarization |
| Process scheduler | Agent orchestration | Hand-coded loops |
| File system cache | Knowledge retrieval | Cosine similarity |
| Memory allocator | Token budget allocation | Nobody does this |
Early computers had programmers managing memory addresses by hand. Then virtual memory shipped, and it changed what was possible.
Right now LLM context is managed by hand: grep, RAG, cram-everything. Cognitive-cache is a stab at an algorithm in that gap. Whether it ends up being the right one is an open question; the point is that the gap is real.
Given a task (like a GitHub issue) and a codebase, cognitive-cache picks which files to include in the LLM's context window across nine languages. It combines multiple signals (symbol matching, dependency graph distance, git recency, semantic similarity, redundancy penalties, file role awareness) into a scoring function and runs greedy submodular optimization to select the highest-value set of files that fits within a token budget.
The key insight is treating context selection as a constrained optimization problem rather than a retrieval problem. RAG systems ask "what's most similar to the query?" but what you actually want is "what maximizes the chance the model gets this right?" Those are different questions.
In practice, the honest job it does well is helping an agent orient faster on a cold start: a single call that returns a ranked set of likely-relevant files in an unfamiliar codebase, instead of several grep-and-read roundtrips to find where to begin. It is not a replacement for an agent's own search on tasks where you already know roughly where to look (see the benchmark for exactly where the line is).
An honest answer needs two things: a metric that isolates ranking quality, and a confidence interval so you can tell a real edge from noise. The harness here is LLM-free and deterministic. For each issue it clones the repo at the pre-fix commit, ranks every file, and scores the ranking against the files the fix actually modified, using rank-aware metrics (recall@k, MRR). Every head-to-head difference comes with a paired bootstrap 95% confidence interval; a CI that excludes zero means the difference is real, not luck.
The dataset is SWE-bench, the standard set of real GitHub issues paired with their gold solution patch (no API keys; pulled over the HuggingFace datasets-server).
| Strategy | recall@1 | recall@5 | recall@10 | MRR |
|---|---|---|---|---|
| cognitive-cache | 0.231 | 0.538 | 0.654 | 0.375 |
| grep (keyword count) | 0.103 | 0.359 | 0.487 | 0.236 |
| lexical (TF-IDF) | 0.141 | 0.410 | 0.462 | 0.253 |
cognitive-cache wins on every metric, and the edge over grep is statistically real: recall@5 difference +0.179, 95% CI [+0.064, +0.295]. Over TF-IDF: +0.128, CI [+0.013, +0.244].
On a harder set of multi-file fixes (n=91, two or more files changed) the edge holds: recall@5 0.365 vs grep 0.252 (CI [+0.056, +0.174]) and vs TF-IDF 0.184 (CI [+0.120, +0.247]).
"Beats grep" means beats a naive keyword search, not a capable coding agent. In a head-to-head pilot, a fresh agent with only grep/read and no cognitive-cache localized the fix file in 5 of 5 single-file SWE-bench issues, ~3 tool calls each. cognitive-cache got 4 of 5. On self-contained tasks, an agent's own search is already excellent, and a static ranker adds no accuracy.
So the value is not "more accurate than the agent." It is cold-start orientation: one call that returns a ranked starting set in unfamiliar code, instead of several grep/read roundtrips to find your footing. It returns file contents, so it front-loads context rather than shrinking the prompt; the win, where there is one, is fewer roundtrips to get oriented, not fewer tokens.
Zeroing each signal and re-measuring showed symbol_overlap and lexical_sim carry the result. graph_distance (import-graph distance from task-mentioned files, formerly weighted 0.20) turned out inert on both datasets. A diagnostic found why: 87% of ground-truth fix files are themselves the files the issue names (graph distance 0), so import distance is redundant with symbol matching rather than additive. It is now weighted 0 by default, kept configurable for a future tree-sitter reference graph. Genuine multi-hop value, if it exists, belongs in an interactive traversal tool, not a static ranking signal.
- grep: counts issue symbols/keywords appearing in each file and ranks by count. A stand-in for naive symbol search.
- lexical (TF-IDF): scikit-learn TF-IDF cosine similarity between issue text and file content. A simpler stand-in for embedding RAG; a neural embedding model would likely score somewhat higher.
Six signals score each file, with configurable weights:
| Signal | Weight | What it does |
|---|---|---|
| symbol overlap | 0.45 | Does this file define or mention identifiers from the task? Dominant signal when it fires |
| graph distance | 0.0 | Import-graph distance from task-mentioned files (networkx, resolves imports across all languages incl. TS path aliases). Weighted 0 by default: benchmarking found it redundant with symbol overlap (87% of fix files are themselves the named files). Kept configurable for a future reference-graph upgrade |
| change recency | 0.03 | Recently changed in git? Only fires when the file also has structural relevance, to prevent recently-touched-but-unrelated files from flooding results |
| redundancy penalty | 0.10 | Already selected a file with similar symbols? This one is worth less, preventing budget waste on duplicate context |
| lexical similarity | 0.15 | TF-IDF cosine similarity between task and file content. Not neural embeddings; a neural upgrade is an open path gated on a benchmark win |
| file role prior | 0.07 | Source files score higher than test files by default. Test files get boosted only when the task mentions testing |
These get combined into a weighted score, and a two-phase greedy selector picks files: first by absolute score (core files), then by value-per-token (supporting context). Redundancy is re-evaluated after each pick. If a file is too large to fit (like a 13K token app.py when your budget is 12K), it gets chunked to extract just the relevant functions.
Test files are automatically excluded unless the task mentions testing keywords (test, spec, coverage, fixture, mock, stub), though you can override this with include_tests=True/False.
pip install cognitive-cacheOr with uv:
uv add cognitive-cachefrom cognitive_cache import select_context_from_repo
result = select_context_from_repo(".", "fix the login redirect bug")
for item in result.selected:
print(f"{item.source.path} (score: {item.score:.3f})")For repeated queries against the same repo, build the index once and reuse it:
from cognitive_cache import RepoIndex, select_context
index = RepoIndex.build(".")
r1 = select_context(index, "fix the login bug")
r2 = select_context(index, "add rate limiting to the API")All the scoring parameters are exposed if you need control over them:
from cognitive_cache import RepoIndex, select_context
from cognitive_cache.core.value_function import WeightConfig
index = RepoIndex.build(".")
result = select_context(
index,
"add test coverage for the auth module",
budget=20000,
include_tests=True,
max_files=10,
min_score=0.15,
weights=WeightConfig(symbol_overlap=0.50, lexical_sim=0.20),
)cognitive-cache select --repo . --task "fix the login redirect bug"
cognitive-cache select --repo . --task "fix login" --json # machine-readable
cognitive-cache select --repo . --task "fix login" --output ctx.txt # dump context to file
cognitive-cache select --repo . --task "fix login" --include-tests no # exclude test files
cognitive-cache select --repo . --task "fix login" --max-files 5 --min-score 0.2
Claude Code (registers at user scope, available in all projects):
claude mcp add --scope user cognitive-cache -- uvx --from "cognitive-cache[mcp]" cognitive-cache-mcpCursor / Windsurf / other editors (add to your MCP config file):
{
"mcpServers": {
"cognitive-cache": {
"command": "uvx",
"args": ["--from", "cognitive-cache[mcp]", "cognitive-cache-mcp"]
}
}
}The tool is most useful when the relevant files aren't obvious: cross-cutting bugs, unfamiliar parts of a large codebase, or tasks that span multiple layers. For small repos or tasks where you already know which files to touch, it doesn't add much over grep.
Add this to your CLAUDE.md (or equivalent system prompt / rules file):
## context selection
When a task spans multiple files or you're not sure where to look, call `select_context_tool`
from the `cognitive-cache` MCP server before reading files manually:
- `repo_path`: absolute path to the repo root
- `task`: specific description of the task (the more precise, the better the results)
- `budget`: token budget for returned context (default 12000; raise for complex tasks)
- `include_tests`: true to always include test files, false to exclude, null for auto-detection
- `max_files`: cap on number of files returned (default 15)
- `min_score`: minimum relevance score threshold (default 0.0)
The tool returns file contents directly, so use them instead of separate file reads.
Call it at the start of investigation; the index is cached and follow-up calls are fast.
Skip it when you already know which files to read.A precise task description outperforms a vague one because symbol overlap and lexical similarity both depend on the exact words used. If nothing clears the internal confidence floor, the selector falls back to returning the top-scoring files with a stderr warning rather than an empty result. "users get 401 after OIDC callback when session token is present" will score better than "fix auth bug".
The included workflow (.github/workflows/context-suggest.yml) automatically comments on new issues with the most relevant files. It runs entirely in CI with no API keys required.
uv sync --dev
uv run pytest tests/ # 159 tests
The primary benchmark is the LLM-free retrieval scoreboard: it clones each issue's repo (cached), ranks files, and reports recall@k, MRR, a per-signal ablation, and paired bootstrap confidence intervals. No API keys, no model server.
# import the dataset (SWE-bench, over the HuggingFace datasets-server, no token)
uv run python -m benchmark.import_swebench --max-per-repo 8 --out benchmark/dataset/swebench_subset.json
# score it
uv run python -m benchmark.retrieval_eval --dataset benchmark/dataset/swebench_subset.json
# sweep weight configs against the default, with significance
uv run python -m benchmark.tune_weights --dataset benchmark/dataset/swebench_subset.json
An older end-to-end harness that feeds the selected context to a local llama.cpp model and scores the generated patch also exists (benchmark/run_local.py, configured via LLAMACPP_BASE_URL / LLAMACPP_MODEL), but the retrieval scoreboard is faster, deterministic, and the one to trust for ranking changes.
src/cognitive_cache/
api.py # public API: RepoIndex, select_context
cli.py # CLI entry point
mcp_server.py # MCP server for Claude Code / Cursor
models.py # core types (Source, Task, ScoredSource, SelectionResult)
indexer/ # turns a repo directory into a list of Source objects
signals/ # the six scoring signals
core/ # value function, greedy selector, file chunker
baselines/ # the five baseline strategies we benchmark against
llm/ # adapters for calling LLMs (claude, openai, llama.cpp)
benchmark/
dataset/ # curated github issues with known fixes
runner.py # orchestrates benchmark runs
evaluator.py # computes recall and efficiency metrics
The benchmarking made the honest priorities clearer than the original roadmap:
- Tree-sitter symbol extraction to replace the current regex-based parsing. This is the foundational accuracy lever; the symbol signal is what carries the results, so better symbols help most.
- A signature-level repo-map output mode (return a token-budgeted skeleton of definitions instead of full file bodies). This is the one design where "saves context" is genuinely true, and it leans into the cold-start-orientation job. Prior art exists (Aider's repo-map and several MCP clones), so it is an execution bet, not a novel one.
- Multi-language benchmark coverage. SWE-bench is Python-heavy; the Go/Rust/Java indexing is implemented but not yet measured on real issues in those languages.
Context windows keep growing (1M tokens, more soon) and the temptation is to assume that fixes the problem. It doesn't: more capacity means more choices, and stuffing everything in burns compute and dilutes attention. Selection gets more useful as windows grow, not less.
Apache 2.0. See LICENSE.
