This system generates high-quality fine-tuning data for Azure OpenAI by analyzing the containerd repository. It creates comprehensive training datasets from both source code and GitHub issues using Azure OpenAI GPT-4o. The generated data is specifically formatted for supervised fine-tuning of Azure OpenAI models to create containerd domain experts.
Note: While configured for containerd, this system can be easily adapted to work with any code repository by changing the repository path and adjusting the priority scoring logic.
-
Prerequisites:
- Azure OpenAI resource with GPT-4o deployment
- Azure CLI installed and configured (
az login) - Python 3.8+
- GitHub token (for issue mining)
-
Install dependencies:
cd /workspace/containerd-agent pip install -r requirements.txt -
Configure environment:
cp .env.template .env # Edit .env with your Azure OpenAI endpoint, deployment, and GitHub token -
Test the setup:
python3 code-scanner/test_code_training_generator.py
Generate training data from containerd source code:
# Test run (10 Q&A pairs from 5 files)
python3 code-scanner/generate_code_training_data.py \
--max-files 5 --max-qa-entries 10 --max-files-per-minute 3
# Production run (6,000 Q&A pairs from 855 files, ~$180, 5 hours)
python3 code-scanner/generate_code_training_data.py \
--max-files 855 --max-qa-entries 6000 --max-files-per-minute 6
# Maximum coverage (12,000 Q&A pairs, ~$360, 10 hours)
python3 code-scanner/generate_code_training_data.py \
--max-files 855 --max-qa-entries 12000 --max-files-per-minute 6Generate training data from GitHub issues:
# Step 1: Fetch and prioritize issues (one-time)
python3 issue-miner/prioritize_github_issues.py
# Step 2: Generate training data (3,000 Q&A pairs, ~$25, 3 hours)
python3 issue-miner/generate_issue_training_data.py \
--max-qa-entries 3000 --max-issues-per-minute 6For long-running jobs:
# Code training (background)
nohup python3 -u code-scanner/generate_code_training_data.py \
--max-files 855 --max-qa-entries 12000 --max-files-per-minute 6 \
> output/code_generation.log 2>&1 &
# Issue training (background)
nohup python3 -u issue-miner/generate_issue_training_data.py \
--max-qa-entries 3000 --max-issues-per-minute 6 \
> output/issue_generation.log 2>&1 &- Priority-based: Higher priority items get more Q&A pairs
- Quota management: Total Q&A pairs distributed according to global quota
- Fair distribution: Lower priority items still get at least 1 Q&A pair
- Per-file limits: Prevents any single file from dominating (max 20 Q&A per file)
- TPM-aware: Respects Azure OpenAI 100K TPM quotas
- Cost estimation: Real-time cost tracking and estimates
- Efficient processing: 6 files/issues per minute for stable processing
- Environment variables: All sensitive config in
.env(git-excluded) - Azure AD authentication: Uses Azure CLI credentials
- No hardcoded secrets: All endpoints and tokens via environment variables
code-scanner/: AI-powered training data from Go source filesissue-miner/: GitHub issues mining and training data generationutils/: Shared Q&A allocation and processing utilitiesoutput/: Generated JSONL files and metadata
| Component | Q&A Pairs | Cost | Time |
|---|---|---|---|
| Issue data | 3,000 | ~$25 | 3 hours |
| Code data (balanced) | 6,000 | ~$180 | 5 hours |
| Code data (maximum) | 12,000 | ~$360 | 10 hours |
| Combined maximum | 15,000 | ~$385 | 13 hours |
Generated training data is in JSONL format compatible with Azure OpenAI fine-tuning:
{
"messages": [
{
"role": "system",
"content": "You are an expert in containerd..."
},
{
"role": "user",
"content": "How do I configure containerd snapshotter?"
},
{
"role": "assistant",
"content": "To configure containerd snapshotter..."
}
],
"metadata": {
"source": "code_file",
"file_path": "client/client.go",
"priority_score": 32.0
}
}Check logs for real-time progress:
# Watch current generation
tail -f output/code_generation.log
tail -f output/issue_generation.log
# Check allocation summary
grep "Q&A Allocation Summary" output/*.logThis system creates comprehensive, high-quality training data for fine-tuning Azure OpenAI models on containerd expertise.