A modular, production-ready example of using GEPA (Generative Evolutionary Prompt Adaptation) to optimize prompts in DSPy.
# 1. Create and activate virtual environment (recommended)
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Set your OpenAI API key
export OPENAI_API_KEY='your-api-key-here'
# 4. Run the sentiment classification example
python main.py --task sentiment
# Or run the question answering example
python main.py --task qadspy-gepa-example/
├── config.py # Language model configuration
├── datasets/ # Dataset definitions (per-task organization)
│ ├── __init__.py
│ ├── sentiment.py # Sentiment classification data
│ └── qa.py # Question answering data
├── models/ # Model signatures and modules (per-task)
│ ├── __init__.py
│ ├── sentiment.py # Sentiment models
│ └── qa.py # QA models
├── metrics/ # Evaluation metrics (per-task)
│ ├── __init__.py
│ ├── sentiment.py # Sentiment metrics
│ ├── qa.py # QA metrics
│ └── common.py # Shared utilities
├── tasks.py # Task registry (glues everything together)
├── main.py # Main tutorial orchestration
└── requirements.txt # Project dependencies
This project uses GEPA to optimize prompts for multiple tasks:
- Classify text as positive or negative
- Single-input task demonstrating basic GEPA usage
- GEPA optimization level: "light"
- Answer questions based on context
- Multi-input task (question + context)
- GEPA optimization level: "medium"
- Baseline Evaluation - Test unoptimized Chain of Thought model
- GEPA Optimization - Automatically improve prompts through evolution
- Optimized Evaluation - Measure performance gains
- Comparison - Quantify improvement
The per-task file organization makes it easy to:
- Understand what code belongs to which task
- Add new tasks without touching existing ones
- Experiment with different models and datasets
- Scale to production use cases
- Python 3.9 or higher
- OpenAI API key (or another LLM provider supported by DSPy)
- Get one at: https://platform.openai.com/api-keys
-
Create a virtual environment (recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
-
Set your API key (required):
export OPENAI_API_KEY='your-api-key-here'
Or for other providers:
export ANTHROPIC_API_KEY='your-api-key-here'
Note: The script will not work without an API key set.
Note: Make sure you've activated your virtual environment and set your API key before running!
source venv/bin/activate # Activate virtual environment
export OPENAI_API_KEY='your-api-key-here' # Set API keypython main.pyOr explicitly:
python main.py --task sentimentpython main.py --task qaEdit config.py or modify the get_default_lm() function:
from config import configure_lm
# Use Anthropic Claude
configure_lm(provider="anthropic", model="claude-3-5-sonnet-20241022")
# Use Together AI
configure_lm(provider="together", model="meta-llama/Llama-3-70b-chat-hf")The per-task file organization makes adding new tasks straightforward. Each task needs 3 files:
Create datasets/your_task.py:
"""Your task dataset."""
import dspy
YOUR_TASK_TRAIN_DATA = [
("input 1", "output 1"),
("input 2", "output 2"),
# ...
]
YOUR_TASK_DEV_DATA = [
("input 1", "output 1"),
# ...
]
def get_data():
"""Get your task train and dev datasets."""
train = []
for input_val, output_val in YOUR_TASK_TRAIN_DATA:
ex = dspy.Example(input=input_val, output=output_val)
train.append(ex.with_inputs("input"))
dev = []
for input_val, output_val in YOUR_TASK_DEV_DATA:
ex = dspy.Example(input=input_val, output=output_val)
dev.append(ex.with_inputs("input"))
return train, devUpdate datasets/__init__.py:
from .your_task import get_data as get_your_task_data
__all__ = [..., "get_your_task_data"]Create models/your_task.py:
"""Your task models."""
import dspy
class YourTaskSignature(dspy.Signature):
"""Description of your task."""
input: str = dspy.InputField(desc="Input description")
output: str = dspy.OutputField(desc="Output description")
class YourTaskModule(dspy.Module):
"""Your task module with Chain of Thought reasoning."""
def __init__(self):
super().__init__()
self.predictor = dspy.ChainOfThought(YourTaskSignature)
def forward(self, input):
return self.predictor(input=input)Update models/__init__.py:
from .your_task import YourTaskSignature, YourTaskModule
__all__ = [..., "YourTaskSignature", "YourTaskModule"]Create metrics/your_task.py:
"""Your task metrics."""
def accuracy(gold, pred, trace=None, pred_name=None, pred_trace=None) -> bool:
"""Check if prediction is correct."""
return gold.output.lower() == pred.output.lower()Update metrics/__init__.py:
from .your_task import accuracy as your_task_accuracy
__all__ = [..., "your_task_accuracy"]Add to the TASKS dictionary in tasks.py:
TASKS = {
# ... existing tasks ...
"your_task": {
"name": "Your Task Name",
"get_data": get_your_task_data,
"model_class": YourTaskModule,
"metric": your_task_accuracy,
"gepa_auto": "medium", # or "light", "heavy"
"input_fields": ["input"],
"output_field": "output",
},
}Then run:
python main.py --task your_task
# Or: python3 main.py --task your_taskmetric: Function to evaluate prompt qualityauto: Optimization intensity level ("light", "medium", "heavy")reflection_lm: Separate LM for generating instruction variations- Higher
autolevels = more exploration and refinement iterations
| Task | Auto Level | Rationale |
|---|---|---|
| Sentiment | "light" | Simple task, single input field |
| QA | "medium" | Complex task, multiple inputs need more optimization |
configure_lm(): Configure DSPy with any LLM providerget_default_lm(): Quick setup with OpenAI GPT-4o-miniPROVIDER_CONFIGS: Pre-configured settings for common providers
Each task has its own dataset file:
sentiment.py: Sentiment classification data and loaderqa.py: Question answering data and loader- Add new tasks by creating new files
Each task has its own model file:
sentiment.py:SentimentClassificationsignature andSentimentClassifiermoduleqa.py:QuestionAnsweringsignature andQAModule- Add new tasks by creating new files
Each task has its own metrics file:
sentiment.py:accuracy()metricqa.py:accuracy()metriccommon.py: Shared utilities (exact_match(),evaluate_model())
- Task configuration registry (
TASKSdictionary) - Imports and organizes all task components
- Generic evaluation functions that work with all tasks
- Command-line interface for task selection
- Complete tutorial workflow
Running the sentiment task will show:
- Baseline model performance on dev set
- GEPA optimization progress (breadth=2, depth=1)
- Optimized model performance on dev set
- Performance comparison and improvement metrics
- Demo predictions on new examples
Running the QA task will show the same workflow but with:
- Multi-field inputs (question + context)
- More intensive GEPA optimization (breadth=3, depth=2)
Make sure you've installed dependencies and activated your virtual environment:
source venv/bin/activate
pip install -r requirements.txtUse python3 instead:
python3 main.py --task sentimentEnsure your API key is set as an environment variable:
# Check if it's set
echo $OPENAI_API_KEY
# Set it if needed
export OPENAI_API_KEY='your-api-key-here'The GEPA optimization can take a few minutes. Be patient and watch for the progress bars.
If you encounter RateLimitError or quota exceeded errors:
Error: "You exceeded your current quota" This means you've hit your OpenAI billing/usage cap:
- Check your usage at https://platform.openai.com/usage
- Verify you have credits or add more at https://platform.openai.com/settings/organization/billing
- Create a new API key if needed at https://platform.openai.com/api-keys
Error: "Rate limit exceeded" You're making too many requests per minute. Solutions:
-
The code already includes retry logic with exponential backoff (configured in
config.pyandmain.py) -
Reduce LLM call volume by optimizing task parameters:
# In models/math.py (or your task file) # Reduce ReAct iterations max_iters=2 # Instead of 5 # In tasks.py # Use lighter GEPA optimization "gepa_auto": "light", # Instead of "medium" or "heavy" # In datasets/your_task.py # Use fewer training examples TRAIN_DATA = [...] # Reduce from 10 to 5 examples
-
Increase retry parameters in
config.py:configure_lm( model="gpt-5-mini", num_retries=10, # Increase from 5 timeout=120.0 # Increase timeout )
-
Switch to faster/cheaper models:
- Use
gpt-5-nanoinstead ofgpt-5-minifor even faster inference - Or use
gpt-4o-minifor lower costs
- Use
Understanding LLM Call Volume:
- ReAct tasks (like math) make multiple calls per example (up to
max_itersiterations) - GEPA optimization tests multiple prompt variations on your training set
- Total calls ≈ (GEPA variations) × (training examples) × (max_iters)
- Example: 3 variations × 5 examples × 2 iters = 30 calls during optimization
This is a starter template. Feel free to:
- Add new tasks and datasets (just create 3 new files!)
- Experiment with different models (ReAct, ProgramOfThought, etc.)
- Try different optimizers (BootstrapFewShot, COPRO, MIPROv2)
- Extend evaluation metrics
- Share your improvements!