Production-grade local RAG system for chatting with any GitHub repository.
RepoRAG ingests a local repository, builds a semantic + structural index using AST-aware chunking and hybrid search, and supports conversational multi-turn querying β all fully offline.
Selected Architecture: B) Microkernel + Plugin. Each component (parser, embedder, retriever, generator) implements an abstract interface and is swappable without re-plumbing the pipeline. This avoids the rigidity of a monolith and the over-engineering of a DAG for a local tool.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RepoRAG Pipeline β
β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββ βββββββββββββ β
β β Ingest ββββΆβ AST Chunker ββββΆβ Embedder ββββΆβ Index β β
β β (Loader) β β (tree-sitter)β β (BGE/etc) β β (Vector + β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββ β BM25 + β β
β β Graph) β β
β βββββββ¬ββββββ β
β β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββ β β
β β Query ββββΆβ Hybrid ββββΆβ Reranker βββββββββββ β
β β Underst. β β Retriever β β (CrossEnc) β β
β ββββββββββββ ββββββββββββββββ ββββββββ¬ββββββ β
β β β β
β β βββββββββββββββββββββββΌββββββββββββββββββ β
β ββββββββββββββββΆβ LLM Generator β β
β β (Ollama / OpenAI-compatible) β β
β ββββββββββββ β + Context Compression β β
β β Memory βββββββββΆβ + Citation Extraction β β
β ββββββββββββ βββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Repository Files
β
βΌ
File Loader (language detection, binary filtering)
β
βΌ
AST Chunker (tree-sitter) ββββ Fallback: Text Chunker
β
ββββΆ Vector Store (ChromaDB, cosine similarity)
ββββΆ BM25 Store (rank_bm25, code-aware tokenization)
ββββΆ Dependency Graph (networkx, import resolution)
User Query
β
βΌ
Intent Classification βββΆ Retrieval Depth Adjustment
β
βΌ
Query Reformulation (for follow-ups)
β
βΌ
Hybrid Retrieval
βββ Dense: Vector similarity search
βββ Sparse: BM25 keyword search
βββ Reciprocal Rank Fusion (RRF)
β
βΌ
Cross-Encoder Reranking
β
βΌ
Multi-Hop Graph Expansion (optional)
β
βΌ
Context Compression + Deduplication
β
βΌ
LLM Generation (with citations)
β
βΌ
Answer + [file:line] Citations
Naive fixed-size chunking (e.g., 512-token windows with overlap) is fundamentally broken for code:
- Semantic boundary violation: A 512-token window can split a function in half, losing the connection between signature and body.
- Context loss: The chunk loses import context, class membership, and docstrings.
- Redundant overlap: Sliding-window overlap wastes embedding space on duplicated content.
- No structural metadata: Fixed chunks can't carry symbol names, file paths, or dependency references.
- Cross-language inconsistency: Code structure varies by language; fixed windows ignore this.
AST-aware chunking extracts function, class, and method boundaries directly from the parse tree, preserving:
- Complete function/method bodies as atomic units
- Docstrings attached to their parent symbols
- Import context for dependency resolution
- Structural metadata (symbol kind, parent class, line numbers)
Pure vector search fails on:
- Exact symbol names: "Where is
calculateTaxRatedefined?" β vector search may miss the exact token match. - Rare identifiers: Unusual function names have poor embedding coverage.
- Boolean precision: "Find all files importing
redis" needs keyword matching.
BM25 excels at exact matches; vector search excels at semantic similarity. Reciprocal Rank Fusion (RRF) combines them:
score(d) = Ξ£ 1/(k + rank_i(d)) for each retriever i
This is rank-based (not score-based), making it robust to different score distributions.
| Decision | Choice | Justification |
|---|---|---|
| AST Parser | tree-sitter | Uniform multi-language support, fast C bindings, consistent node metadata |
| Embedding Model | BAAI/bge-base-en-v1.5 | 768d, strong general performance, sentence-transformers compatible |
| Vector DB | ChromaDB | Local-first, built-in persistence, metadata filtering, adequate for repo-scale |
| BM25 | rank_bm25 | Simple, reliable, code-aware tokenization (camelCase/snake_case splitting) |
| Reranker | BAAI/bge-reranker-base | Cross-encoder quality boost, bounded candidate set (β€30) |
| Similarity | Cosine (normalized) | Standard for text embeddings, L2-normalized in store |
| LLM | Ollama (local) | Fully offline, OpenAI-compatible API, any local model |
| Graph | networkx DiGraph | Lightweight, file-level import resolution, BFS traversal |
# Clone and install
git clone <repo-url> && cd gitrag
pip install -e .
# Or with Docker
docker build -t gitrag .- Python 3.10+
- Ollama running locally (for LLM generation)
# Pull an LLM model
ollama pull llama3.1:8b# Index a repository
gitrag index /path/to/your/repo
# Single-shot query
gitrag query /path/to/your/repo "How does authentication work?"
# Interactive chat
gitrag chat /path/to/your/repo
# Check index status
gitrag status /path/to/your/repoInside gitrag chat:
/quitβ Exit/clearβ Clear conversation history/statsβ Show index statistics
gitrag index /path/to/repo --config my_config.yaml# Start the API
uvicorn gitrag.api.server:create_app --factory --host 0.0.0.0 --port 8000
# Or with Docker
docker run -p 8000:8000 -v /path/to/repos:/repos gitragEndpoints:
| Method | Path | Description |
|---|---|---|
| POST | /index |
Index a repository |
| POST | /query |
Query with conversation support |
| GET | /status/{repo_path} |
Index status |
| GET | /health |
Health check |
# Example: Index
curl -X POST http://localhost:8000/index \
-H "Content-Type: application/json" \
-d '{"repo_path": "/repos/my-project"}'
# Example: Query
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"repo_path": "/repos/my-project", "question": "What does the main function do?"}'All settings in config.yaml. Key sections:
embeddings:
model_name: "BAAI/bge-base-en-v1.5" # Local embedding model
device: "" # auto-detect (cuda/mps/cpu)
generation:
provider: "ollama" # or "openai_compatible"
model: "llama3.1:8b" # any Ollama model
base_url: "http://localhost:11434"
temperature: 0.1 # low for deterministic answers
retrieval:
enable_reranking: true # disable for faster but lower quality
rerank_top_k: 10 # final results countEnvironment variable overrides: GITRAG_<SECTION>_<KEY>, e.g.:
export GITRAG_GENERATION_MODEL=codellama:13b
export GITRAG_EMBEDDINGS_DEVICE=cudagitrag/
βββ core/
β βββ types.py # All data types (CodeChunk, RetrievalResult, etc.)
β βββ pipeline.py # Main orchestrator
βββ ingest/
β βββ loader.py # Repository file discovery
β βββ language.py # Language detection
β βββ filters.py # Binary/ignore filtering
βββ chunking/
β βββ ast_chunker.py # tree-sitter AST-aware chunking
β βββ text_chunker.py # Fallback for docs/config
βββ embeddings/
β βββ local.py # Sentence-transformers embedder
βββ index/
β βββ vector_store.py # ChromaDB vector index
β βββ bm25_store.py # BM25 sparse index
β βββ graph_store.py # Dependency graph (networkx)
βββ retrieval/
β βββ hybrid.py # Hybrid retriever orchestrator
β βββ fusion.py # Reciprocal Rank Fusion
β βββ reranker.py # Cross-encoder reranking
βββ query/
β βββ intent.py # Intent classification
β βββ reformulator.py # Follow-up query reformulation
β βββ multi_hop.py # Graph-based context expansion
βββ memory/
β βββ conversation.py # Multi-turn conversation state
βββ generation/
β βββ llm.py # LLM client (Ollama/OpenAI)
β βββ prompts.py # Intent-specific prompt templates
β βββ context.py # Context compression
βββ evaluation/
β βββ metrics.py # Precision@k, MRR, NDCG, faithfulness
βββ api/
β βββ server.py # FastAPI REST API
βββ cli.py # Click CLI
Built-in metrics for measuring retrieval and generation quality:
Retrieval Metrics:
- Precision@k, Recall@k
- MRR (Mean Reciprocal Rank)
- NDCG@k (Normalized Discounted Cumulative Gain)
Generation Metrics:
- Faithfulness score (token-overlap with context)
- Citation coverage (paragraphs with citations)
- Hallucination score (1 - faithfulness + code-block penalty)
Benchmarking:
- Synthetic query generation from indexed chunks
- Latency measurement per pipeline stage
- Low temperature (0.1): Reduces creative/hallucinated responses.
- Strict system prompt: "You MUST base answers strictly on provided context."
- Citation requirement: Every claim must reference
[file:line]. - Context grounding: Only retrieved, verified code chunks are in the prompt.
- Faithfulness scoring: Post-hoc validation that answer tokens appear in context.
Three-tier conversation memory:
- Short-term buffer: Last N turns kept verbatim (default: 20).
- Rolling summary: Older turns compressed into extractive summary every 5 turns.
- Context window optimization: Summary + recent turns fit within token budget; oldest dropped first.
Query reformulation detects follow-up queries (pronouns, short queries) and prepends context from prior turns to create standalone queries.
- Incremental indexing (only re-index changed files via content hash)
- Code-optimized embedding model (jina-embeddings-v2-base-code)
- Symbol-level call graph (not just file-level imports)
- Streaming LLM responses
- Web UI (React frontend)
- Multi-repo support (query across repositories)
- Git blame integration (who changed what)
- Learned fusion weights (instead of fixed RRF)
- FAISS backend option for larger repos
- IDE plugins (VS Code extension)
MIT