eval-protocol
diff --git a/‎.env.example‎
Lines changed: 24 additions & 0 deletions b/‎.env.example‎
Lines changed: 24 additions & 0 deletions
diff --git a/‎.flake8‎
Lines changed: 3 additions & 0 deletions b/‎.flake8‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎.gitattributes‎
Lines changed: 1 addition & 0 deletions b/‎.gitattributes‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/bug_report.md‎
Lines changed: 33 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/bug_report.md‎
Lines changed: 33 additions & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/feature_request.md‎
Lines changed: 20 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/feature_request.md‎
Lines changed: 20 additions & 0 deletions
diff --git a/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 60 additions & 0 deletions b/‎.github/PULL_REQUEST_TEMPLATE.md‎
Lines changed: 60 additions & 0 deletions
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 190 additions & 0 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 190 additions & 0 deletions
@@ -0,0 +1,24 @@
+# Example .env file for eval protocol development
+# Copy this file to .env (in your project root) and fill in your actual values.
+# IMPORTANT: Add .env to your .gitignore file to avoid committing secrets!
+
+# Fireworks AI Credentials (for interacting with the Fireworks platform)
+# These are used by eval protocol to deploy evaluators to Fireworks, preview, etc.
+FIREWORKS_API_KEY="your_fireworks_api_key_here"
+FIREWORKS_ACCOUNT_ID="your_fireworks_account_id_here" # e.g., "fireworks" or your specific account
+
+# Optional: If targeting a non-production Fireworks API endpoint
+# FIREWORKS_API_BASE="https://dev.api.fireworks.ai"
+
+# GCP Configuration (for --target gcp-cloud-run if not using eval-protocol.yaml or CLI args)
+# Note: It's generally recommended to set these in eval-protocol.yaml or pass via CLI for specific projects.
+# However, you can set global defaults here if you frequently work with the same GCP setup.
+# GCP_PROJECT_ID="your_default_gcp_project_id"
+# GCP_REGION="your_default_gcp_region" # e.g., us-central1
+# GCP_AR_REPO="your_default_artifact_registry_repo_name" # e.g., eval-protocol-evaluators
+
+# E2B API Key (if working with E2B code execution features)
+# E2B_API_KEY="your_e2b_api_key_here"
+
+# Other environment variables your custom reward functions might need
+# MY_CUSTOM_SERVICE_API_KEY="some_other_key"
@@ -0,0 +1,3 @@
+[flake8]
+max-line-length = 119
+ignore = E203, W503
@@ -0,0 +1 @@
+eval_protocol/_version.py export-subst
@@ -0,0 +1,33 @@
+---
+name: Bug Report
+about: Create a report to help us improve
+title: "[BUG] Brief description of bug"
+labels: bug
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Go to '...'
+2. Click on '....'
+3. Scroll down to '....'
+4. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Screenshots (if applicable)**
+If applicable, add screenshots to help explain your problem.
+
+**Environment (please complete the following information):**
+ - OS: [e.g. macOS, Windows, Linux]
+- Python version: [e.g. 3.9, 3.10]
+ - Eval Protocol version: [e.g. 0.1.0, or commit SHA if from source]
+- How installed: [e.g. pip, from source]
+
+**Additional context**
+Add any other context about the problem here. For example, are you using it with a specific LLM provider, or in a particular environment?
@@ -0,0 +1,20 @@
+---
+name: Feature Request
+about: Suggest an idea for this project
+title: "[FEAT] Brief description of feature"
+labels: enhancement
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
@@ -0,0 +1,60 @@
+---
+name: Pull Request
+about: Propose changes to the codebase
+title: "Brief description of changes"
+labels: ''
+assignees: ''
+
+---
+
+## Description
+
+Please include a summary of the change and which issue is fixed or feature is implemented. Please also include relevant motivation and context. List any dependencies that are required for this change.
+
+Fixes # (issue)
+Implements # (issue)
+
+## Type of change
+
+Please delete options that are not relevant.
+
+- [ ] Bug fix (non-breaking change which fixes an issue)
+- [ ] New feature (non-breaking change which adds functionality)
+- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
+- [ ] This change requires a documentation update
+- [ ] Refactoring/Code cleanup
+- [ ] Build/CI/CD related changes
+- [ ] Other (please describe):
+
+## How Has This Been Tested?
+
+Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.
+
+- [ ] Test A
+- [ ] Test B
+
+**Test Configuration**:
+*   Firmware version:
+*   Hardware:
+*   Toolchain:
+*   SDK:
+
+## Checklist:
+
+- [ ] My code follows the style guidelines of this project (ran `black .`, `isort .`, `flake8 .`)
+- [ ] I have performed a self-review of my own code
+- [ ] I have commented my code, particularly in hard-to-understand areas
+- [ ] I have made corresponding changes to the documentation
+- [ ] My changes generate no new warnings
+- [ ] I have added tests that prove my fix is effective or that my feature works
+- [ ] New and existing unit tests pass locally with my changes
+- [ ] Any dependent changes have been merged and published in downstream modules
+- [ ] I have checked my code and corrected any misspellings
+
+## Screenshots (if applicable)
+
+If applicable, add screenshots to help showcase your changes.
+
+## Additional context
+
+Add any other context about the PR here.
@@ -0,0 +1,190 @@
+name: Python CI
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
+
+on:
+  push:
+    branches: [main]
+    paths-ignore:
+      - "docs/**"
+      - "*.md"
+  pull_request:
+    branches: [main]
+    paths-ignore:
+      - "docs/**"
+      - "*.md"
+  workflow_dispatch:
+
+jobs:
+  lint-and-type-check:
+    name: Lint & Type Check
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0 # Fetch all history for all tags and branches
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+        with:
+          enable-cache: true
+
+      - name: Install the project
+        run: uv sync --locked --all-extras --dev
+
+      - name: Install tau2 for testing
+        run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
+
+      - name: Lint with flake8
+        run: uv run flake8 eval_protocol tests examples scripts --count --exit-zero --max-complexity=10 --max-line-length=88 --statistics
+
+      - name: Type check with mypy
+        run: uv run mypy eval_protocol
+
+  test-core:
+    name: Core Tests (Python ${{ matrix.python-version }})
+    runs-on: ubuntu-latest
+    needs: lint-and-type-check
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.10", "3.11", "3.12"]
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0 # Fetch all history for all tags and branches
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+        with:
+          enable-cache: true
+
+      - name: Install the project
+        run: uv sync --locked --all-extras --dev
+
+      - name: Install tau2 for testing
+        run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
+
+      - name: Run Core Tests with pytest-xdist
+        env:
+          E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
+          FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
+          FIREWORKS_ACCOUNT_ID: ${{ secrets.FIREWORKS_ACCOUNT_ID }}
+          PYTHONWARNINGS: "ignore::DeprecationWarning,ignore::RuntimeWarning"
+        run: |
+          # Run most tests in parallel, but explicitly ignore tests that manage their own servers or are slow
+          uv run pytest \
+            -n auto \
+            --ignore=tests/test_batch_evaluation.py \
+            --ignore=tests/pytest/test_frozen_lake.py \
+            --ignore=tests/pytest/test_lunar_lander.py \
+            --ignore=tests/pytest/test_tau_bench_airline.py \
+            --cov=eval_protocol --cov-append --cov-report=xml --cov-report=term-missing -v --durations=10
+
+      - name: Store coverage file
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-core-${{ matrix.python-version }}
+          path: coverage.xml
+          retention-days: 1
+
+  test-batch-evaluation:
+    name: Batch Evaluation Tests
+    runs-on: ubuntu-latest
+    needs: lint-and-type-check
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0 # Fetch all history for all tags and branches
+
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+        with:
+          enable-cache: true
+
+      - name: Install the project
+        run: uv sync --locked --all-extras --dev
+
+      - name: Install tau2 for testing
+        run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
+
+      - name: Run Batch Evaluation Tests
+        env:
+          E2B_API_KEY: ${{ secrets.E2B_API_KEY }}
+          FIREWORKS_API_KEY: ${{ secrets.FIREWORKS_API_KEY }}
+          FIREWORKS_ACCOUNT_ID: ${{ secrets.FIREWORKS_ACCOUNT_ID }}
+          PYTHONWARNINGS: "ignore::DeprecationWarning,ignore::RuntimeWarning"
+        run: |
+          # Run only this specific test file, WITHOUT xdist
+          uv run pytest tests/test_batch_evaluation.py --cov=eval_protocol --cov-append --cov-report=xml -v --durations=10
+      - name: Store coverage file
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-batch-eval
+          path: coverage.xml
+          retention-days: 1
+
+  test-mcp-e2e:
+    name: MCP End-to-End Tests
+    runs-on: ubuntu-latest
+    needs: lint-and-type-check
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0 # Fetch all history for all tags and branches
+      - name: Set up Python 3.12
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+      - name: Install uv
+        uses: astral-sh/setup-uv@v6
+        with:
+          enable-cache: true
+
+      - name: Install the project
+        run: uv sync --locked --all-extras --dev
+
+      - name: Install tau2 for testing
+        run: uv pip install git+https://github.com/sierra-research/tau2-bench.git@main
+
+      - name: Store coverage file
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-mcp-e2e
+          path: coverage.xml
+          retention-days: 1
+
+  upload-coverage:
+    name: Upload Coverage
+    runs-on: ubuntu-latest
+    needs: [test-core, test-batch-evaluation, test-mcp-e2e]
+    steps:
+      - name: Download all coverage artifacts
+        uses: actions/download-artifact@v4
+        with:
+          path: coverage-artifacts
+      - name: Upload coverage to Codecov
+        uses: codecov/codecov-action@v3
+        with:
+          token: ${{ secrets.CODECOV_TOKEN }}
+          directory: ./coverage-artifacts/
+          fail_ci_if_error: false
+          verbose: true
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[flake8]`
	`2`	`+max-line-length = 119`
	`3`	`+ignore = E203, W503`