🧠 GitHub Codebase Intelligence System

Repo‑level RAG for GitHub repositories.
Clone any repo, parse it with Tree‑sitter, build semantic code indexes and graphs, then answer architectural questions via an LLM with file‑level citations.

⚡ Quick Start

1. Clone & set up environment

git clone https://github.com/purvanshh/github-rag.git
cd Github-Rag

python3 -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

pip install -r requirements.txt

2. Configure environment

Set your OpenAI key and (optionally) override defaults:

export OPENAI_API_KEY=sk-...
export EMBEDDING_MODEL=text-embedding-3-large      # optional
export LLM_MODEL=gpt-4o                            # optional
export REPOS_DIR=./repos                           # optional
export CHROMA_PERSIST_DIR=./chroma_db              # optional

3. Ingest a repository (CLI)

python main.py ingest https://github.com/karpathy/nanoGPT

This will:

clone the repo into REPOS_DIR/nanoGPT
parse & chunk the code
embed chunks & store them in Chroma
build dependency and call graphs

4. Ask questions (CLI)

python main.py query "Where is the training loop implemented?"

5. Start the API server

python main.py serve
# or:
uvicorn api.server:app --host 0.0.0.0 --port 8000

Health check:

curl http://localhost:8000/health

6. Launch the Streamlit UI

streamlit run ui/streamlit_app.py

In the UI you can:

paste a GitHub repo URL and click Analyze Repository
ask questions like:
- How does authentication work?
- Where is authenticate_user used?
- Explain file auth/service.py
see an architecture dashboard (summary, dependency hubs, call‑graph hotspots, directory tree)

🏗️ System Architecture

Ingestion pipeline

GitHub Repo URL
     ↓
RepoIngestionPipeline (ingestion/repo_pipeline.py)
  1. Clone repo (GitPython)
  2. Parse code (Tree-sitter)
  3. Extract symbols (functions / classes / methods / imports)
  4. Smart code chunking (by symbol, not token count)
  5. Embeddings (OpenAI text-embedding-3-large)
  6. Vector DB (ChromaDB)
  7. Dependency graph (NetworkX)
  8. Call graph (NetworkX)
     ↓
Repo ready for analysis

Query pipeline

User Query
     ↓
QueryRouter (intent classification)
     ↓
RepoAnalyzer (per-repo orchestrator)
     ↓
GraphAwareRetriever
  - Vector similarity search
  - Expand via dependency graph (imports & dependents)
  - Expand via call graph (callers & callees)
  - Cross-encoder reranking (bge-reranker-large)
     ↓
AnswerGenerator (LLM)
  - Builds prompt with retrieved context
  - Calls GPT-4o
  - Normalizes sources for UI
     ↓
Answer + file/symbol/line citations

📁 Project Structure

github-rag/
├── ingestion/
│   ├── clone_repo.py         # Clone/pull GitHub repos via GitPython
│   ├── parse_code.py         # Tree-sitter parsing & symbol extraction
│   ├── chunk_code.py         # Semantic code chunking around symbols
│   └── repo_pipeline.py      # End-to-end ingestion orchestration
│
├── indexing/
│   ├── embedder.py           # OpenAI + local embedding backends
│   └── vector_store.py       # ChromaDB vector store abstraction
│
├── retrieval/
│   ├── retriever.py          # Basic hybrid retriever (vector + reranker)
│   ├── graph_aware_retriever.py  # Graph-aware hybrid retriever
│   └── reranker.py           # Cross-encoder reranking (bge-reranker-large)
│
├── graphs/
│   ├── dependency_graph.py   # File-level import/dependency graph (NetworkX)
│   └── call_graph.py         # Function-level call graph (NetworkX + Tree-sitter)
│
├── reasoning/
│   ├── prompt_templates.py   # Structured prompts for QA & architecture
│   ├── answer_generator.py   # GPT-4o answer generation + normalized sources
│   ├── architecture_summarizer.py # LLM-based repo architecture summaries
│   ├── repo_analyzer.py      # High-level orchestration for a single repo
│   └── query_router.py       # Intent classification & routing to RepoAnalyzer
│
├── api/
│   └── server.py             # FastAPI REST API (ingest/query/overview/graphs)
│
├── ui/
│   └── streamlit_app.py      # Streamlit UI: ingestion, QA, dashboard
│
├── graphs/__init__.py
├── ingestion/__init__.py
├── main.py                   # CLI entry point (ingest / query / serve)
├── config.py                 # Centralized configuration via env vars
├── requirements.txt          # Python dependencies
└── README.md

🔧 Tech Stack

Layer	Technology
Language	Python 3.10+
Code parsing	Tree-sitter (Python, JS, TS)
Embeddings	OpenAI `text-embedding-3-large`
Vector DB	ChromaDB (local)
LLM	GPT-4o
Reranker	`BAAI/bge-reranker-large`
Graphs	NetworkX
API	FastAPI + Uvicorn
UI	Streamlit
Git integration	GitPython

🧠 Core Components

RepoIngestionPipeline
- Single entrypoint to prepare a repo:
  - clone → parse → chunk → embed → index → build graphs → store metadata.
RepoAnalyzer
- Per‑repo orchestrator:
  - ask_question(query)
  - get_architecture_summary()
  - find_function_usage(function_name)
  - get_file_dependencies(file_path)
  - explain_file(file_path)
  - get_repo_overview()
QueryRouter
- Classifies queries into:
  - architecture, function_usage, file_dependencies, file_explanation, repo_overview, code_question
- Routes to the appropriate RepoAnalyzer method.
GraphAwareRetriever
- Vector similarity search in Chroma.
- Graph expansion via:
  - dependency graph (imported/importing files)
  - call graph (callers/callees)
- Deduplicates candidates and reranks with bge-reranker-large.
AnswerGenerator
- Builds prompts from retrieved context.
- Calls GPT‑4o via OpenAI SDK.
- Returns:
  - answer (markdown)
  - sources (file/symbol/type/lines)
  - model

🧪 Recommended Test Repos

🗺️ Roadmap (High Level)

📄 License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 GitHub Codebase Intelligence System

⚡ Quick Start

1. Clone & set up environment

2. Configure environment

3. Ingest a repository (CLI)

4. Ask questions (CLI)

5. Start the API server

6. Launch the Streamlit UI

🏗️ System Architecture

Ingestion pipeline

Query pipeline

📁 Project Structure

🔧 Tech Stack

🧠 Core Components

🧪 Recommended Test Repos

🗺️ Roadmap (High Level)

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
api		api
graphs		graphs
indexing		indexing
ingestion		ingestion
reasoning		reasoning
retrieval		retrieval
ui		ui
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🧠 GitHub Codebase Intelligence System

⚡ Quick Start

1. Clone & set up environment

2. Configure environment

3. Ingest a repository (CLI)

4. Ask questions (CLI)

5. Start the API server

6. Launch the Streamlit UI

🏗️ System Architecture

Ingestion pipeline

Query pipeline

📁 Project Structure

🔧 Tech Stack

🧠 Core Components

🧪 Recommended Test Repos

🗺️ Roadmap (High Level)

📄 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages