Skip to content

Commit 31efc0f

Browse files
committed
launch
0 parents  commit 31efc0f

File tree

654 files changed

+1102537
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

654 files changed

+1102537
-0
lines changed

.env.example

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Example .env file for eval protocol development
2+
# Copy this file to .env (in your project root) and fill in your actual values.
3+
# IMPORTANT: Add .env to your .gitignore file to avoid committing secrets!
4+
5+
# Fireworks AI Credentials (for interacting with the Fireworks platform)
6+
# These are used by eval protocol to deploy evaluators to Fireworks, preview, etc.
7+
FIREWORKS_API_KEY="your_fireworks_api_key_here"
8+
FIREWORKS_ACCOUNT_ID="your_fireworks_account_id_here" # e.g., "fireworks" or your specific account
9+
10+
# Optional: If targeting a non-production Fireworks API endpoint
11+
# FIREWORKS_API_BASE="https://dev.api.fireworks.ai"
12+
13+
# GCP Configuration (for --target gcp-cloud-run if not using eval-protocol.yaml or CLI args)
14+
# Note: It's generally recommended to set these in eval-protocol.yaml or pass via CLI for specific projects.
15+
# However, you can set global defaults here if you frequently work with the same GCP setup.
16+
# GCP_PROJECT_ID="your_default_gcp_project_id"
17+
# GCP_REGION="your_default_gcp_region" # e.g., us-central1
18+
# GCP_AR_REPO="your_default_artifact_registry_repo_name" # e.g., eval-protocol-evaluators
19+
20+
# E2B API Key (if working with E2B code execution features)
21+
# E2B_API_KEY="your_e2b_api_key_here"
22+
23+
# Other environment variables your custom reward functions might need
24+
# MY_CUSTOM_SERVICE_API_KEY="some_other_key"

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 119
3+
ignore = E203, W503

.gitattributes

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
eval_protocol/_version.py export-subst
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
---
2+
name: Bug Report
3+
about: Create a report to help us improve
4+
title: "[BUG] Brief description of bug"
5+
labels: bug
6+
assignees: ''
7+
8+
---
9+
10+
**Describe the bug**
11+
A clear and concise description of what the bug is.
12+
13+
**To Reproduce**
14+
Steps to reproduce the behavior:
15+
1. Go to '...'
16+
2. Click on '....'
17+
3. Scroll down to '....'
18+
4. See error
19+
20+
**Expected behavior**
21+
A clear and concise description of what you expected to happen.
22+
23+
**Screenshots (if applicable)**
24+
If applicable, add screenshots to help explain your problem.
25+
26+
**Environment (please complete the following information):**
27+
- OS: [e.g. macOS, Windows, Linux]
28+
- Python version: [e.g. 3.9, 3.10]
29+
- Eval Protocol version: [e.g. 0.1.0, or commit SHA if from source]
30+
- How installed: [e.g. pip, from source]
31+
32+
**Additional context**
33+
Add any other context about the problem here. For example, are you using it with a specific LLM provider, or in a particular environment?
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
---
2+
name: Feature Request
3+
about: Suggest an idea for this project
4+
title: "[FEAT] Brief description of feature"
5+
labels: enhancement
6+
assignees: ''
7+
8+
---
9+
10+
**Is your feature request related to a problem? Please describe.**
11+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12+
13+
**Describe the solution you'd like**
14+
A clear and concise description of what you want to happen.
15+
16+
**Describe alternatives you've considered**
17+
A clear and concise description of any alternative solutions or features you've considered.
18+
19+
**Additional context**
20+
Add any other context or screenshots about the feature request here.

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
---
2+
name: Pull Request
3+
about: Propose changes to the codebase
4+
title: "Brief description of changes"
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
## Description
11+
12+
Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.
13+
14+
Fixes # (issue)
15+
Implements # (issue)
16+
17+
## Type of change
18+
19+
Please delete options that are not relevant.
20+
21+
- [ ] Bug fix (non-breaking change which fixes an issue)
22+
- [ ] New feature (non-breaking change which adds functionality)
23+
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
24+
- [ ] This change requires a documentation update
25+
- [ ] Refactoring/Code cleanup
26+
- [ ] Build/CI/CD related changes
27+
- [ ] Other (please describe):
28+
29+
## How Has This Been Tested?
30+
31+
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
32+
33+
- [ ] Test A
34+
- [ ] Test B
35+
36+
**Test Configuration**:
37+
* Firmware version:
38+
* Hardware:
39+
* Toolchain:
40+
* SDK:
41+
42+
## Checklist:
43+
44+
- [ ] My code follows the style guidelines of this project (ran `black .`, `isort .`, `flake8 .`)
45+
- [ ] I have performed a self-review of my own code
46+
- [ ] I have commented my code, particularly in hard-to-understand areas
47+
- [ ] I have made corresponding changes to the documentation
48+
- [ ] My changes generate no new warnings
49+
- [ ] I have added tests that prove my fix is effective or that my feature works
50+
- [ ] New and existing unit tests pass locally with my changes
51+
- [ ] Any dependent changes have been merged and published in downstream modules
52+
- [ ] I have checked my code and corrected any misspellings
53+
54+
## Screenshots (if applicable)
55+
56+
If applicable, add screenshots to help showcase your changes.
57+
58+
## Additional context
59+
60+
Add any other context about the PR here.

.github/workflows/ci.yml

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
name: Python CI
2+
3+
concurrency:
4+
group: ${{ github.workflow }}-${{ github.ref }}
5+
cancel-in-progress: true
6+
7+
on:
8+
push:
9+
branches: [main]
10+
paths-ignore:
11+
- "docs/**"
12+
- "*.md"
13+
pull_request:
14+
branches: [main]
15+
paths-ignore:
16+
- "docs/**"
17+
- "*.md"
18+
workflow_dispatch:
19+
20+
jobs:
21+
lint-and-type-check:
22+
name: Lint & Type Check
23+
runs-on: ubuntu-latest
24+
steps:
25+
- uses: actions/checkout@v4
26+
with:
27+
fetch-depth: 0 # Fetch all history for all tags and branches
28+
29+
- name: Set up Python 3.12
30+
uses: actions/setup-python@v5
31+
with:
32+
python-version: "3.12"
33+
34+
- name: Install uv
35+
uses: astral-sh/setup-uv@v6
36+
with:
37+
enable-cache: true
38+
39+
- name: Install the project
40+
run: uv sync --locked --all-extras --dev
41+
42+
- name: Install tau2 for testing
43+
run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
44+
45+
- name: Lint with flake8
46+
run: uv run flake8 eval_protocol tests examples scripts --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics
47+
48+
- name: Type check with mypy
49+
run: uv run mypy eval_protocol
50+
51+
test-core:
52+
name: Core Tests (Python ${{ matrix.python-version }})
53+
runs-on: ubuntu-latest
54+
needs: lint-and-type-check
55+
strategy:
56+
fail-fast: false
57+
matrix:
58+
python-version: ["3.10", "3.11", "3.12"]
59+
60+
steps:
61+
- uses: actions/checkout@v4
62+
with:
63+
fetch-depth: 0 # Fetch all history for all tags and branches
64+
65+
- name: Set up Python ${{ matrix.python-version }}
66+
uses: actions/setup-python@v5
67+
with:
68+
python-version: ${{ matrix.python-version }}
69+
70+
- name: Install uv
71+
uses: astral-sh/setup-uv@v6
72+
with:
73+
enable-cache: true
74+
75+
- name: Install the project
76+
run: uv sync --locked --all-extras --dev
77+
78+
- name: Install tau2 for testing
79+
run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
80+
81+
- name: Run Core Tests with pytest-xdist
82+
env:
83+
E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
84+
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
85+
FIREWORKS_ACCOUNT_ID: ${{ secrets.FIREWORKS_ACCOUNT_ID }}
86+
PYTHONWARNINGS: "ignore::DeprecationWarning,ignore::RuntimeWarning"
87+
run: |
88+
# Run most tests in parallel, but explicitly ignore tests that manage their own servers or are slow
89+
uv run pytest \
90+
-n auto \
91+
--ignore=tests/test_batch_evaluation.py \
92+
--ignore=tests/pytest/test_frozen_lake.py \
93+
--ignore=tests/pytest/test_lunar_lander.py \
94+
--ignore=tests/pytest/test_tau_bench_airline.py \
95+
--cov=eval_protocol --cov-append --cov-report=xml --cov-report=term-missing -v --durations=10
96+
97+
- name: Store coverage file
98+
uses: actions/upload-artifact@v4
99+
with:
100+
name: coverage-core-${{ matrix.python-version }}
101+
path: coverage.xml
102+
retention-days: 1
103+
104+
test-batch-evaluation:
105+
name: Batch Evaluation Tests
106+
runs-on: ubuntu-latest
107+
needs: lint-and-type-check
108+
steps:
109+
- uses: actions/checkout@v4
110+
with:
111+
fetch-depth: 0 # Fetch all history for all tags and branches
112+
113+
- name: Set up Python 3.12
114+
uses: actions/setup-python@v5
115+
with:
116+
python-version: "3.12"
117+
118+
- name: Install uv
119+
uses: astral-sh/setup-uv@v6
120+
with:
121+
enable-cache: true
122+
123+
- name: Install the project
124+
run: uv sync --locked --all-extras --dev
125+
126+
- name: Install tau2 for testing
127+
run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
128+
129+
- name: Run Batch Evaluation Tests
130+
env:
131+
E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
132+
FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
133+
FIREWORKS_ACCOUNT_ID: ${{ secrets.FIREWORKS_ACCOUNT_ID }}
134+
PYTHONWARNINGS: "ignore::DeprecationWarning,ignore::RuntimeWarning"
135+
run: |
136+
# Run only this specific test file, WITHOUT xdist
137+
uv run pytest tests/test_batch_evaluation.py --cov=eval_protocol --cov-append --cov-report=xml -v --durations=10
138+
- name: Store coverage file
139+
uses: actions/upload-artifact@v4
140+
with:
141+
name: coverage-batch-eval
142+
path: coverage.xml
143+
retention-days: 1
144+
145+
test-mcp-e2e:
146+
name: MCP End-to-End Tests
147+
runs-on: ubuntu-latest
148+
needs: lint-and-type-check
149+
steps:
150+
- uses: actions/checkout@v4
151+
with:
152+
fetch-depth: 0 # Fetch all history for all tags and branches
153+
- name: Set up Python 3.12
154+
uses: actions/setup-python@v5
155+
with:
156+
python-version: "3.12"
157+
- name: Install uv
158+
uses: astral-sh/setup-uv@v6
159+
with:
160+
enable-cache: true
161+
162+
- name: Install the project
163+
run: uv sync --locked --all-extras --dev
164+
165+
- name: Install tau2 for testing
166+
run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
167+
168+
- name: Store coverage file
169+
uses: actions/upload-artifact@v4
170+
with:
171+
name: coverage-mcp-e2e
172+
path: coverage.xml
173+
retention-days: 1
174+
175+
upload-coverage:
176+
name: Upload Coverage
177+
runs-on: ubuntu-latest
178+
needs: [test-core, test-batch-evaluation, test-mcp-e2e]
179+
steps:
180+
- name: Download all coverage artifacts
181+
uses: actions/download-artifact@v4
182+
with:
183+
path: coverage-artifacts
184+
- name: Upload coverage to Codecov
185+
uses: codecov/codecov-action@v3
186+
with:
187+
token: ${{ secrets.CODECOV_TOKEN }}
188+
directory: ./coverage-artifacts/
189+
fail_ci_if_error: false
190+
verbose: true

0 commit comments

Comments
 (0)