Repo‑level RAG for GitHub repositories.
Clone any repo, parse it with Tree‑sitter, build semantic code indexes and graphs, then answer architectural questions via an LLM with file‑level citations.
git clone https://github.com/purvanshh/github-rag.git
cd Github-Rag
python3 -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtSet your OpenAI key and (optionally) override defaults:
export OPENAI_API_KEY=sk-...
export EMBEDDING_MODEL=text-embedding-3-large # optional
export LLM_MODEL=gpt-4o # optional
export REPOS_DIR=./repos # optional
export CHROMA_PERSIST_DIR=./chroma_db # optionalpython main.py ingest https://github.com/karpathy/nanoGPTThis will:
- clone the repo into
REPOS_DIR/nanoGPT - parse & chunk the code
- embed chunks & store them in Chroma
- build dependency and call graphs
python main.py query "Where is the training loop implemented?"python main.py serve
# or:
uvicorn api.server:app --host 0.0.0.0 --port 8000Health check:
curl http://localhost:8000/healthstreamlit run ui/streamlit_app.pyIn the UI you can:
- paste a GitHub repo URL and click Analyze Repository
- ask questions like:
How does authentication work?Where is authenticate_user used?Explain file auth/service.py
- see an architecture dashboard (summary, dependency hubs, call‑graph hotspots, directory tree)
GitHub Repo URL
↓
RepoIngestionPipeline (ingestion/repo_pipeline.py)
1. Clone repo (GitPython)
2. Parse code (Tree-sitter)
3. Extract symbols (functions / classes / methods / imports)
4. Smart code chunking (by symbol, not token count)
5. Embeddings (OpenAI text-embedding-3-large)
6. Vector DB (ChromaDB)
7. Dependency graph (NetworkX)
8. Call graph (NetworkX)
↓
Repo ready for analysis
User Query
↓
QueryRouter (intent classification)
↓
RepoAnalyzer (per-repo orchestrator)
↓
GraphAwareRetriever
- Vector similarity search
- Expand via dependency graph (imports & dependents)
- Expand via call graph (callers & callees)
- Cross-encoder reranking (bge-reranker-large)
↓
AnswerGenerator (LLM)
- Builds prompt with retrieved context
- Calls GPT-4o
- Normalizes sources for UI
↓
Answer + file/symbol/line citations
github-rag/
├── ingestion/
│ ├── clone_repo.py # Clone/pull GitHub repos via GitPython
│ ├── parse_code.py # Tree-sitter parsing & symbol extraction
│ ├── chunk_code.py # Semantic code chunking around symbols
│ └── repo_pipeline.py # End-to-end ingestion orchestration
│
├── indexing/
│ ├── embedder.py # OpenAI + local embedding backends
│ └── vector_store.py # ChromaDB vector store abstraction
│
├── retrieval/
│ ├── retriever.py # Basic hybrid retriever (vector + reranker)
│ ├── graph_aware_retriever.py # Graph-aware hybrid retriever
│ └── reranker.py # Cross-encoder reranking (bge-reranker-large)
│
├── graphs/
│ ├── dependency_graph.py # File-level import/dependency graph (NetworkX)
│ └── call_graph.py # Function-level call graph (NetworkX + Tree-sitter)
│
├── reasoning/
│ ├── prompt_templates.py # Structured prompts for QA & architecture
│ ├── answer_generator.py # GPT-4o answer generation + normalized sources
│ ├── architecture_summarizer.py # LLM-based repo architecture summaries
│ ├── repo_analyzer.py # High-level orchestration for a single repo
│ └── query_router.py # Intent classification & routing to RepoAnalyzer
│
├── api/
│ └── server.py # FastAPI REST API (ingest/query/overview/graphs)
│
├── ui/
│ └── streamlit_app.py # Streamlit UI: ingestion, QA, dashboard
│
├── graphs/__init__.py
├── ingestion/__init__.py
├── main.py # CLI entry point (ingest / query / serve)
├── config.py # Centralized configuration via env vars
├── requirements.txt # Python dependencies
└── README.md
| Layer | Technology |
|---|---|
| Language | Python 3.10+ |
| Code parsing | Tree-sitter (Python, JS, TS) |
| Embeddings | OpenAI text-embedding-3-large |
| Vector DB | ChromaDB (local) |
| LLM | GPT-4o |
| Reranker | BAAI/bge-reranker-large |
| Graphs | NetworkX |
| API | FastAPI + Uvicorn |
| UI | Streamlit |
| Git integration | GitPython |
-
RepoIngestionPipeline
- Single entrypoint to prepare a repo:
- clone → parse → chunk → embed → index → build graphs → store metadata.
- Single entrypoint to prepare a repo:
-
RepoAnalyzer
- Per‑repo orchestrator:
ask_question(query)get_architecture_summary()find_function_usage(function_name)get_file_dependencies(file_path)explain_file(file_path)get_repo_overview()
- Per‑repo orchestrator:
-
QueryRouter
- Classifies queries into:
architecture,function_usage,file_dependencies,file_explanation,repo_overview,code_question
- Routes to the appropriate
RepoAnalyzermethod.
- Classifies queries into:
-
GraphAwareRetriever
- Vector similarity search in Chroma.
- Graph expansion via:
- dependency graph (imported/importing files)
- call graph (callers/callees)
- Deduplicates candidates and reranks with
bge-reranker-large.
-
AnswerGenerator
- Builds prompts from retrieved context.
- Calls GPT‑4o via OpenAI SDK.
- Returns:
answer(markdown)sources(file/symbol/type/lines)model
- Project structure & CLI
- Tree-sitter AST symbol extraction (Python, JS, TS)
- Semantic chunking pipeline
- Embedding & indexing pipeline (Chroma)
- Hybrid retriever + cross-encoder reranker
- Graph-aware retrieval (dependency + call graphs)
- LLM answer generation with citations
- Dependency graph builder
- Function call graph builder
- API server wiring (FastAPI)
- Streamlit UI (ingestion, QA, dashboard)
- Architecture summary generation
MIT