eval-protocol · shreymodi1 · Oct 15, 2025 · Oct 16, 2025 · Oct 16, 2025 · Oct 16, 2025
diff --git a/eval_protocol/utils/evaluation_row_utils.py b/eval_protocol/utils/evaluation_row_utils.py
@@ -9,6 +9,7 @@
 from typing import List
 
 from eval_protocol.models import EvaluationRow, Message
+from eval_protocol.models import InputMetadata
 
 
 def serialize_message(msg: Message) -> str:
@@ -134,3 +135,27 @@ def assistant_to_ground_truth(data: List[EvaluationRow]) -> List[EvaluationRow]:
         )
 
     return processed_rows
+
+
+def create_rows_from_indices(count: int, **metadata) -> List[EvaluationRow]:
+    """Create evaluation rows with sequential row_ids.
+
+    Useful for remote processors where the server determines content based on row_id.
+
+    Args:
+        count: Number of rows to create
+        **metadata: Additional metadata to include in each row
+
+    Returns:
+        List of EvaluationRows with row_id set to "0", "1", "2", ...
+    """
+    rows = []
+    for idx in range(count):
+        row_metadata = {"row_id": str(idx), **metadata}
+        rows.append(
+            EvaluationRow(
+                messages=[],
+                input_metadata=InputMetadata(**row_metadata),
+            )
+        )
+    return rows
diff --git a/examples/swebench/README.md b/examples/swebench/README.md
@@ -0,0 +1,300 @@
+# SWE-bench Evaluation Example
+
+This example shows how to evaluate LLM models on the SWE-bench software engineering benchmark using eval-protocol.
+
+## Quick Start
+
+### 1. Install Dependencies
+
+```bash
+# From the python-sdk repository root
+cd python-sdk
+
+# Install eval-protocol with swebench support
+pip install -e ".[swebench]"
+```
+
+### 2. Set up mini-swe-agent
+
+mini-swe-agent requires a Fireworks API key to function:
+
+```bash
+# Configure API key for mini-swe-agent
+mini-extra config set FIREWORKS_API_KEY your_fireworks_api_key
+
+# Verify it's set
+mini-extra config get FIREWORKS_API_KEY
+```
+
+### 3. Install SWE-bench Harness
+
+```bash
+# Navigate to the swebench example directory
+cd examples/swebench
+
+# Clone and install SWE-bench
+git clone https://github.com/princeton-nlp/SWE-bench
+pip install -e SWE-bench
+```
+
+### 4. Set Environment Variables
+
+```bash
+export FIREWORKS_API_KEY="your_fireworks_api_key"
+```
+
+## Running the Evaluation
+
+**IMPORTANT:** Always run both the server and tests from the `examples/swebench/` directory.
+
+### Step 1: Start the Server
+
+Open a terminal and run:
+
+```bash
+cd examples/swebench
+python server.py
+```
+
+You should see:
+```
+INFO:     Uvicorn running on http://127.0.0.1:3000 (Press CTRL+C to quit)
+```
+
+### Step 2: Configure Your Test
+
+Edit `tests/test_swebench.py` to set your model and parameters:
+
+```python
+completion_params=[{
+    "model": "accounts/fireworks/models/your-model-name",  # Edit this
+    "model_kwargs": {
+        "temperature": 0.2,      # Optional
+        # "max_tokens": 2048,    # Optional
+        # "reasoning": "high",   # Optional
+    }
+}],
+max_concurrent_rollouts=3,  # How many instances to run in parallel
+```
+
+To test different numbers of instances, edit line 26:
+```python
+def rows() -> List[EvaluationRow]:
+    return rows_from_indices(2)  # Change 2 to desired number (max 500)
+```
+
+### Step 3: Run the Test
+
+Open a second terminal:
+
+```bash
+cd examples/swebench
+pytest tests/test_swebench.py -v -s
+```
+
+## What Happens During a Run
+
+For each instance (row):
+
+1. **Server receives request** from pytest
+2. **Wrapper script** (`run_swe_agent_fw.py`) is called with the instance index
+3. **mini-swe-agent** runs in a Docker container for that specific repository
+4. **Agent attempts to solve** the issue by editing code
+5. **Patch is generated** and saved to `preds.json`
+6. **SWE-bench harness** applies the patch and runs tests
+7. **Results** are written to the row directory
+8. **Test fetches results** and displays pass/fail in the UI
+
+## Understanding the Output
+
+### Directory Structure
+
+Each instance creates its own `row_N/` directory:
+
+```
+examples/swebench/
+├── row_0/                                    # First instance
+│   ├── preds.json                            # ← Model's generated patch
+│   ├── astropy__astropy-12907/               # Instance-specific folder
+│   │   └── astropy__astropy-12907.traj.json  # Agent's execution trace
+│   ├── logs/                                 # Harness execution logs
+│   │   └── run_evaluation/
+│   │       └── eval-run/
+│   │           └── <safe_model_name>/
+│   │               └── astropy__astropy-12907/
+│   │                   ├── report.json       # ← Test results (pass/fail)
+│   │                   ├── test_output.txt   # Test execution output
+│   │                   ├── patch.diff        # Applied patch
+│   │                   └── eval.sh           # Evaluation script
+│   ├── agent_0.log                           # Agent console output
+│   ├── exit_statuses_*.yaml                  # Exit status if failed
+│   └── <model_name>.eval-run.json            # Overall run summary
+├── row_1/                                    # Second instance
+│   └── ...
+└── ...
+```
+
+### Key Files Explained
+
+#### `preds.json` - Model Predictions
+Location: `row_N/preds.json`
+
+Contains the patch generated by the model:
+```json
+{
+  "astropy__astropy-12907": {
+    "model_name_or_path": "accounts/fireworks/models/...",
+    "instance_id": "astropy__astropy-12907",
+    "model_patch": "diff --git a/... (the actual patch)"
+  }
+}
+```
+
+**If missing:** Agent failed before generating a patch (check `exit_statuses_*.yaml`)
+
+#### `report.json` - Test Results
+Location: `row_N/logs/run_evaluation/eval-run/<model_name>/<instance_id>/report.json`
+
+Contains pass/fail status after running tests:
+```json
+{
+  "astropy__astropy-12907": {
+    "patch_is_None": false,
+    "patch_exists": true,
+    "patch_successfully_applied": true,
+    "resolved": true,  // ← Was the issue fixed?
+    "tests_status": {
+      "FAIL_TO_PASS": {"success": [...], "failure": []},
+      "PASS_TO_PASS": {"success": [...], "failure": []}
+    }
+  }
+}
+```
+
+- `resolved: true` = Instance solved! All required tests pass.
+- `resolved: false` = Instance not solved (tests still failing)
+
+**If missing:** Agent didn't generate a patch or harness didn't run
+
+#### `exit_statuses_*.yaml` - Why Runs Failed
+Location: `row_N/exit_statuses_*.yaml`
+
+```yaml
+instances_by_exit_status:
+  Submitted: []
+  LimitsExceeded: ["astropy__astropy-12907"]  # Hit step/cost limits
+  Error: []
+```
+
+Common statuses:
+- `Submitted`: Completed normally
+- `LimitsExceeded`: Agent hit max steps or cost limit
+- `Error`: Unexpected error during execution
+
+#### `agent_N.log` - Agent Execution
+Location: `row_N/agent_N.log`
+
+Full console output from the agent run, including:
+- Docker container startup
+- Model API calls
+- Commands executed
+- Errors (if any)
+
+#### `*.traj.json` - Agent Trajectory
+Location: `row_N/<instance_id>/<instance_id>.traj.json`
+
+Complete record of the agent's execution:
+```json
+{
+  "instance_id": "astropy__astropy-12907",
+  "info": {
+    "submission": "...",  // The patch
+    "exit_status": "Submitted",
+    "model_stats": {
+      "instance_cost": 0.05,
+      "api_calls": 15
+    }
+  },
+  "messages": [...]  // All agent messages
+}
+```
+
+## Viewing Results
+
+### In the Terminal
+
+The test output shows:
+```
+INFO:test_swebench:[Row 0] Found instance_id: astropy__astropy-12907
+INFO:test_swebench:[Row 0] Report says resolved=True
+INFO:test_swebench:[Row 0] Final: resolved=True, reason=harness_resolved=True
+```
+
+### In the Eval Protocol UI
+
+If Elasticsearch is running, visit: `http://localhost:8000`
+- View aggregate scores
+- Inspect individual trajectories
+- Filter by resolved/unresolved
+- See cost and token usage
+
+### Check Individual Files
+
+```bash
+# Check if instance was solved
+cat row_0/logs/run_evaluation/eval-run/<model>/astropy__astropy-12907/report.json | jq '.["astropy__astropy-12907"].resolved'
+
+# View the generated patch
+cat row_0/preds.json | jq '.["astropy__astropy-12907"].model_patch'
+
+# Check exit status
+cat row_0/exit_statuses_*.yaml
+```
+
+## Performance Notes
+
+- **Small test (2 instances):** ~10-30 minutes
+- **Full dataset (500 instances):** 24-48 hours on a 16-core machine
+- **Concurrent runs:** Recommended 3-5 based on CPU/memory
+- **Docker space:** ~100GB for all images (downloads happen automatically)
+
+## Troubleshooting
+
+### Docker container fails to start
+```bash
+# Check Docker is running
+docker ps
+
+# Check disk space
+df -h
+```
+
+### Agent hits step limits
+Instances that consistently hit limits may need:
+- Higher step limit (edit mini-swe-agent config)
+- Different prompting strategy
+- More capable model
+
+### Server not responding
+```bash
+# Check server is running
+curl http://127.0.0.1:3000/status?rollout_id=test
+
+# Check server logs for errors
+# (shown in terminal where server.py is running)
+```
+
+## Next Steps
+
+- Review results in `row_*/logs/.../report.json`
+- Analyze failed instances to improve your model
+- Run on larger subsets to get statistical significance
+- Export results for further analysis
+
+## Support
+
+For issues:
+- Check agent logs: `row_N/agent_N.log`
+- Check exit statuses: `row_N/exit_statuses_*.yaml`
+- Verify Docker has sufficient resources
+- Ensure API key is valid and has credits
diff --git a/examples/swebench/SWE-bench b/examples/swebench/SWE-bench