An accurate Retrieval-Augmented Generation (RAG) system that analyzes multi-language codebases using Tree-sitter, builds comprehensive knowledge graphs, and enables natural language querying of codebase structure and relationships as well as editing capabilities.
combined.mp4
Use the Makefile for:
- make install: Install project dependencies with full language support.
- make python: Install dependencies for Python only.
- make dev: Setup dev environment (install deps + pre-commit hooks).
- make test: Run all tests.
- make clean: Clean up build artifacts and cache.
- make help: Show available commands.
-
π Multi-Language Support:
Language Status Extensions Functions Classes/Structs Modules Package Detection Additional Features β Python Fully Supported .py
β β β __init__.py
Type inference, decorators, nested functions β JavaScript Fully Supported .js
,.jsx
β β β - ES6 modules, CommonJS, prototype methods, object methods, arrow functions β TypeScript Fully Supported .ts
,.tsx
β β β - Interfaces, type aliases, enums, namespaces, ES6/CommonJS modules β C++ Fully Supported .cpp
,.h
,.hpp
,.cc
,.cxx
,.hxx
,.hh
,.ixx
,.cppm
,.ccm
β β (classes/structs/unions/enums) β CMakeLists.txt, Makefile Constructors, destructors, operator overloading, templates, lambdas, C++20 modules, namespaces β Lua Fully Supported .lua
β β (tables/modules) β - Local/global functions, metatables, closures, coroutines β Rust Fully Supported .rs
β β (structs/enums) β - impl blocks, associated functions β Java Fully Supported .java
β β (classes/interfaces/enums) β package declarations Generics, annotations, modern features (records/sealed classes), concurrency, reflection π§ Go In Development .go
β β (structs) β - Methods, type declarations π§ Scala In Development .scala
,.sc
β β (classes/objects/traits) β package declarations Case classes, objects π§ C# In Development .cs
- - - - Classes, interfaces, generics (planned) -
π³ Tree-sitter Parsing: Uses Tree-sitter for robust, language-agnostic AST parsing
-
π Knowledge Graph Storage: Uses Memgraph to store codebase structure as an interconnected graph
-
π£οΈ Natural Language Querying: Ask questions about your codebase in plain English
-
π€ AI-Powered Cypher Generation: Supports both cloud models (Google Gemini), local models (Ollama), and OpenAI models for natural language to Cypher translation
-
π€ OpenAI Integration: Leverage OpenAI models to enhance AI functionalities.
-
π Code Snippet Retrieval: Retrieves actual source code snippets for found functions/methods
-
βοΈ Advanced File Editing: Surgical code replacement with AST-based function targeting, visual diff previews, and exact code block modifications
-
β‘οΈ Shell Command Execution: Can execute terminal commands for tasks like running tests or using CLI tools.
-
π Interactive Code Optimization: AI-powered codebase optimization with language-specific best practices and interactive approval workflow
-
π Reference-Guided Optimization: Use your own coding standards and architectural documents to guide optimization suggestions
-
π Dependency Analysis: Parses
pyproject.toml
to understand external dependencies -
π― Nested Function Support: Handles complex nested functions and class hierarchies
-
π Language-Agnostic Design: Unified graph schema across all supported languages
The system consists of two main components:
- Multi-language Parser: Tree-sitter based parsing system that analyzes codebases and ingests data into Memgraph
- RAG System (
codebase_rag/
): Interactive CLI for querying the stored knowledge graph
- Python 3.12+
- Docker & Docker Compose (for Memgraph)
- cmake (required for building pymgclient dependency)
- For cloud models: Google Gemini API key
- For local models: Ollama installed and running
uv
package manager
On macOS:
brew install cmake
On Linux (Ubuntu/Debian):
sudo apt-get update
sudo apt-get install cmake
On Linux (CentOS/RHEL):
sudo yum install cmake
# or on newer versions:
sudo dnf install cmake
git clone https://github.com/vitali87/code-graph-rag.git
cd code-graph-rag
- Install dependencies:
For basic Python support:
uv sync
For full multi-language support:
uv sync --extra treesitter-full
For development (including tests and pre-commit hooks):
make dev
This installs all dependencies and sets up pre-commit hooks automatically.
This installs Tree-sitter grammars for all supported languages (see Multi-Language Support section).
- Set up environment variables:
cp .env.example .env
# Edit .env with your configuration (see options below)
# .env file
GEMINI_API_KEY=your_gemini_api_key_here
Get your free API key from Google AI Studio.
# .env file
OPENAI_API_KEY=your_openai_api_key_here
# .env file
LOCAL_MODEL_ENDPOINT=http://localhost:11434/v1
LOCAL_ORCHESTRATOR_MODEL_ID=llama3
LOCAL_CYPHER_MODEL_ID=llama3
LOCAL_MODEL_API_KEY=ollama
Install and run Ollama:
# Install Ollama (macOS/Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull required models
ollama pull llama3
# Or try other models like:
# ollama pull llama3.1
# ollama pull mistral
# ollama pull codellama
# Ollama will automatically start serving on localhost:11434
Note: Local models provide privacy and no API costs, but may have lower accuracy compared to cloud models like Gemini.
- Start Memgraph database:
docker-compose up -d
The Graph-Code system offers four main modes of operation:
- Parse & Ingest: Build knowledge graph from your codebase
- Interactive Query: Ask questions about your code in natural language
- Export & Analyze: Export graph data for programmatic analysis
- AI Optimization: Get AI-powered optimization suggestions for your code.
- Editing: Perform surgical code replacements and modifications with precise targeting.
Parse and ingest a multi-language repository into the knowledge graph:
For the first repository (clean start):
python -m codebase_rag.main start --repo-path /path/to/repo1 --update-graph --clean
For additional repositories (preserve existing data):
python -m codebase_rag.main start --repo-path /path/to/repo2 --update-graph
python -m codebase_rag.main start --repo-path /path/to/repo3 --update-graph
The system automatically detects and processes files for all supported languages (see Multi-Language Support section).
Start the interactive RAG CLI:
python -m codebase_rag.main start --repo-path /path/to/your/repo
Specify Custom Models:
# Use specific local models
python -m codebase_rag.main start --repo-path /path/to/your/repo \
--orchestrator-model llama3.1 \
--cypher-model codellama
# Use specific Gemini models
python -m codebase_rag.main start --repo-path /path/to/your/repo \
--orchestrator-model gemini-2.0-flash-thinking-exp-01-21 \
--cypher-model gemini-2.5-flash-lite-preview-06-17
Example queries (works across all supported languages):
- "Show me all classes that contain 'user' in their name"
- "Find functions related to database operations"
- "What methods does the User class have?"
- "Show me functions that handle authentication"
- "List all TypeScript components"
- "Find Rust structs and their methods"
- "Show me Go interfaces and implementations"
- "Find all C++ operator overloads in the Matrix class"
- "Show me C++ template functions with their specializations"
- "List all C++ namespaces and their contained classes"
- "Find C++ lambda expressions used in algorithms"
- "Add logging to all database connection functions"
- "Refactor the User class to use dependency injection"
- "Convert these Python functions to async/await pattern"
- "Add error handling to authentication methods"
- "Optimize this function for better performance"
For programmatic access and integration with other tools, you can export the entire knowledge graph to JSON:
Export during graph update:
python -m codebase_rag.main start --repo-path /path/to/repo --update-graph --clean -o my_graph.json
Export existing graph without updating:
python -m codebase_rag.main export -o my_graph.json
Working with exported data:
from codebase_rag.graph_loader import load_graph
# Load the exported graph
graph = load_graph("my_graph.json")
# Get summary statistics
summary = graph.summary()
print(f"Total nodes: {summary['total_nodes']}")
print(f"Total relationships: {summary['total_relationships']}")
# Find specific node types
functions = graph.find_nodes_by_label("Function")
classes = graph.find_nodes_by_label("Class")
# Analyze relationships
for func in functions[:5]:
relationships = graph.get_relationships_for_node(func.node_id)
print(f"Function {func.properties['name']} has {len(relationships)} relationships")
Example analysis script:
python examples/graph_export_example.py my_graph.json
This provides a reliable, programmatic way to access your codebase structure without LLM restrictions, perfect for:
- Integration with other tools
- Custom analysis scripts
- Building documentation generators
- Creating code metrics dashboards
For AI-powered codebase optimization with best practices guidance:
Basic optimization for a specific language:
python -m codebase_rag.main optimize python --repo-path /path/to/your/repo
Optimization with reference documentation:
python -m codebase_rag.main optimize python \
--repo-path /path/to/your/repo \
--reference-document /path/to/best_practices.md
Using specific models for optimization:
python -m codebase_rag.main optimize javascript \
--repo-path /path/to/frontend \
--orchestrator-model gemini-2.0-flash-thinking-exp-01-21
Supported Languages for Optimization:
All supported languages: python
, javascript
, typescript
, rust
, go
, java
, scala
, cpp
How It Works:
- Analysis Phase: The agent analyzes your codebase structure using the knowledge graph
- Pattern Recognition: Identifies common anti-patterns, performance issues, and improvement opportunities
- Best Practices Application: Applies language-specific best practices and patterns
- Interactive Approval: Presents each optimization suggestion for your approval before implementation
- Guided Implementation: Implements approved changes with detailed explanations
Example Optimization Session:
Starting python optimization session...
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β The agent will analyze your python codebase and propose specific β
β optimizations. You'll be asked to approve each suggestion before β
β implementation. Type 'exit' or 'quit' to end the session. β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π Analyzing codebase structure...
π Found 23 Python modules with potential optimizations
π‘ Optimization Suggestion #1:
File: src/data_processor.py
Issue: Using list comprehension in a loop can be optimized
Suggestion: Replace with generator expression for memory efficiency
[y/n] Do you approve this optimization?
Reference Document Support: You can provide reference documentation (like coding standards, architectural guidelines, or best practices documents) to guide the optimization process:
# Use company coding standards
python -m codebase_rag.main optimize python \
--reference-document ./docs/coding_standards.md
# Use architectural guidelines
python -m codebase_rag.main optimize java \
--reference-document ./ARCHITECTURE.md
# Use performance best practices
python -m codebase_rag.main optimize rust \
--reference-document ./docs/performance_guide.md
The agent will incorporate the guidance from your reference documents when suggesting optimizations, ensuring they align with your project's standards and architectural decisions.
Common CLI Arguments:
--orchestrator-model
: Specify model for main operations--cypher-model
: Specify model for graph queries--repo-path
: Path to repository (defaults to current directory)--reference-document
: Path to reference documentation (optimization only)
The knowledge graph uses the following node types and relationships:
- Project: Root node representing the entire repository
- Package: Language packages (Python:
__init__.py
, etc.) - Module: Individual source code files (
.py
,.js
,.jsx
,.ts
,.tsx
,.rs
,.go
,.scala
,.sc
,.java
) - Class: Class/Struct/Enum definitions across all languages
- Function: Module-level functions and standalone functions
- Method: Class methods and associated functions
- Folder: Regular directories
- File: All files (source code and others)
- ExternalPackage: External dependencies
- Python:
function_definition
,class_definition
- JavaScript/TypeScript:
function_declaration
,arrow_function
,class_declaration
- C++:
function_definition
,template_declaration
,lambda_expression
,class_specifier
,struct_specifier
,union_specifier
,enum_specifier
- Rust:
function_item
,struct_item
,enum_item
,impl_item
- Go:
function_declaration
,method_declaration
,type_declaration
- Scala:
function_definition
,class_definition
,object_definition
,trait_definition
- Java:
method_declaration
,class_declaration
,interface_declaration
,enum_declaration
CONTAINS_PACKAGE
: Project or Package contains Package nodesCONTAINS_FOLDER
: Project, Package, or Folder contains Folder nodesCONTAINS_FILE
: Project, Package, or Folder contains File nodesCONTAINS_MODULE
: Project, Package, or Folder contains Module nodesDEFINES
: Module defines classes/functionsDEFINES_METHOD
: Class defines methodsDEPENDS_ON_EXTERNAL
: Project depends on external packagesCALLS
: Function or Method calls other functions/methods
Configuration is managed through environment variables in .env
file:
GEMINI_API_KEY
: Required when using Google models.GEMINI_MODEL_ID
: Main model for orchestration (default:gemini-2.5-pro
)MODEL_CYPHER_ID
: Model for Cypher generation (default:gemini-2.5-flash-lite-preview-06-17
)
LOCAL_MODEL_ENDPOINT
: Ollama endpoint (default:http://localhost:11434/v1
)LOCAL_ORCHESTRATOR_MODEL_ID
: Model for main RAG orchestration (default:llama3
)LOCAL_CYPHER_MODEL_ID
: Model for Cypher query generation (default:llama3
)LOCAL_MODEL_API_KEY
: API key for local models (default:ollama
)
MEMGRAPH_HOST
: Memgraph hostname (default:localhost
)MEMGRAPH_PORT
: Memgraph port (default:7687
)TARGET_REPO_PATH
: Default repository path (default:.
)
- tree-sitter: Core Tree-sitter library for language-agnostic parsing
- tree-sitter-{language}: Language-specific grammars (Python, JS, TS, Rust, Go, Scala, Java)
- pydantic-ai: AI agent framework for RAG orchestration
- pymgclient: Memgraph Python client for graph database operations
- loguru: Advanced logging with structured output
- python-dotenv: Environment variable management
The agent is designed with a deliberate workflow to ensure it acts with context and precision, especially when modifying the file system.
The agent has access to a suite of tools to understand and interact with the codebase:
query_codebase_knowledge_graph
: The primary tool for understanding the repository. It queries the graph database to find files, functions, classes, and their relationships based on natural language.get_code_snippet
: Retrieves the exact source code for a specific function or class.read_file_content
: Reads the entire content of a specified file.create_new_file
: Creates a new file with specified content.replace_code_surgically
: Surgically replaces specific code blocks in files. Requires exact target code and replacement. Only modifies the specified block, leaving rest of file unchanged. True surgical patching.execute_shell_command
: Executes a shell command in the project's environment.
The agent uses AST-based function targeting with Tree-sitter for precise code modifications. Features include:
- Visual diff preview before changes
- Surgical patching that only modifies target code blocks
- Multi-language support across all supported languages
- Security sandbox preventing edits outside project directory
- Smart function matching with qualified names and line numbers
- Python: Full support including nested functions, methods, classes, decorators, type hints, and package structure
- JavaScript: ES6 modules, CommonJS modules, prototype-based methods, object methods, arrow functions, classes, and JSX support
- TypeScript: All JavaScript features plus interfaces, type aliases, enums, namespaces, generics, and advanced type inference
- C++: Comprehensive support including functions, classes, structs, unions, enums, constructors, destructors, operator overloading, templates, lambdas, namespaces, C++20 modules, inheritance, method calls, and modern C++ features
- Lua: Functions, local/global variables, tables, metatables, closures, coroutines, and object-oriented patterns
- Rust: Functions, structs, enums, impl blocks, traits, and associated functions
- Go: Functions, methods, type declarations, interfaces, and struct definitions
- Scala: Functions, methods, classes, objects, traits, case classes, implicits, and Scala 3 syntax
- Java: Methods, constructors, classes, interfaces, enums, annotations, generics, modern features (records, sealed classes, switch expressions), concurrency patterns, reflection, and enterprise frameworks
Graph-Code makes it easy to add support for any language that has a Tree-sitter grammar. The system automatically handles grammar compilation and integration.
β οΈ Recommendation: While you can add languages yourself, we recommend waiting for official full support to ensure optimal parsing quality, comprehensive feature coverage, and robust integration. The languages marked as "In Development" above will receive dedicated optimization and testing.
π‘ Request Support: If you want a specific language to be officially supported, please submit an issue with your language request.
Use the built-in language management tool to add any Tree-sitter supported language:
# Add a language using the standard tree-sitter repository
python -m codebase_rag.tools.language add-grammar <language-name>
# Examples:
python -m codebase_rag.tools.language add-grammar c-sharp
python -m codebase_rag.tools.language add-grammar php
python -m codebase_rag.tools.language add-grammar ruby
python -m codebase_rag.tools.language add-grammar kotlin
For languages hosted outside the standard tree-sitter organization:
# Add a language from a custom repository
python -m codebase_rag.tools.language add-grammar --grammar-url https://github.com/custom/tree-sitter-mylang
When you add a language, the tool automatically:
- Downloads the Grammar: Clones the tree-sitter grammar repository as a git submodule
- Detects Configuration: Auto-extracts language metadata from
tree-sitter.json
- Analyzes Node Types: Automatically identifies AST node types for:
- Functions/methods (
method_declaration
,function_definition
, etc.) - Classes/structs (
class_declaration
,struct_declaration
, etc.) - Modules/files (
compilation_unit
,source_file
, etc.) - Function calls (
call_expression
,method_invocation
, etc.)
- Functions/methods (
- Compiles Bindings: Builds Python bindings from the grammar source
- Updates Configuration: Adds the language to
codebase_rag/language_config.py
- Enables Parsing: Makes the language immediately available for codebase analysis
$ python -m codebase_rag.tools.language add-grammar c-sharp
π Using default tree-sitter URL: https://github.com/tree-sitter/tree-sitter-c-sharp
π Adding submodule from https://github.com/tree-sitter/tree-sitter-c-sharp...
β
Successfully added submodule at grammars/tree-sitter-c-sharp
Auto-detected language: c-sharp
Auto-detected file extensions: ['cs']
Auto-detected node types:
Functions: ['destructor_declaration', 'method_declaration', 'constructor_declaration']
Classes: ['struct_declaration', 'enum_declaration', 'interface_declaration', 'class_declaration']
Modules: ['compilation_unit', 'file_scoped_namespace_declaration', 'namespace_declaration']
Calls: ['invocation_expression']
β
Language 'c-sharp' has been added to the configuration!
π Updated codebase_rag/language_config.py
# List all configured languages
python -m codebase_rag.tools.language list-languages
# Remove a language (this also removes the git submodule unless --keep-submodule is specified)
python -m codebase_rag.tools.language remove-language <language-name>
The system uses a configuration-driven approach for language support. Each language is defined in codebase_rag/language_config.py
with the following structure:
"language-name": LanguageConfig(
name="language-name",
file_extensions=[".ext1", ".ext2"],
function_node_types=["function_declaration", "method_declaration"],
class_node_types=["class_declaration", "struct_declaration"],
module_node_types=["compilation_unit", "source_file"],
call_node_types=["call_expression", "method_invocation"],
),
Grammar not found: If the automatic URL doesn't work, use a custom URL:
python -m codebase_rag.tools.language add-grammar --grammar-url https://github.com/custom/tree-sitter-mylang
Version incompatibility: If you get "Incompatible Language version" errors, update your tree-sitter package:
uv add tree-sitter@latest
Missing node types: The tool automatically detects common node patterns, but you can manually adjust the configuration in language_config.py
if needed.
You can build a binary of the application using the build_binary.py
script. This script uses PyInstaller to package the application and its dependencies into a single executable.
python build_binary.py
The resulting binary will be located in the dist
directory.
-
Check Memgraph connection:
- Ensure Docker containers are running:
docker-compose ps
- Verify Memgraph is accessible on port 7687
- Ensure Docker containers are running:
-
View database in Memgraph Lab:
- Open http://localhost:3000
- Connect to memgraph:7687
-
For local models:
- Verify Ollama is running:
ollama list
- Check if models are downloaded:
ollama pull llama3
- Test Ollama API:
curl http://localhost:11434/v1/models
- Check Ollama logs:
ollama logs
- Verify Ollama is running:
Please see CONTRIBUTING.md for detailed contribution guidelines.
Good first PRs are from TODO issues.
For issues or questions:
- Check the logs for error details
- Verify Memgraph connection
- Ensure all environment variables are set
- Review the graph schema matches your expectations