A coding environment for agent evaluations. Provides bash and file editing tools.
⚠️ This is a template. Before building, customizeDockerfile.hudfor your project.
uv sync
uv run imagectl4.py coding-template -bvr # Build, validate, and runThese instructions are for running tasks locally. Telemetry from your runs will be uploaded to HUD's platform.
1. Clone and Initialize
git clone https://github.com/hud-evals/coding-template
cd coding-template
uv sync2. Build, Validate, and Run
# Build the Docker image
uv run imagectl4.py coding-template -b
# Validate that your task branches and grading are correct
uv run imagectl4.py coding-template -v
# Run an agent against scenarios
uv run imagectl4.py coding-template -r
# Or combine all three
uv run imagectl4.py coding-template -bvrUse --ids to target specific scenarios:
uv run imagectl4.py coding-template -bvr --ids my-task-1 my-task-2These instructions are for running remote jobs. You only have access to these if you are a HUD enterprise customer: contact founders@hud.ai to learn more.
1. Deploy to Platform
Connect this repo to hud.ai:
- Push to GitHub (or clone this template repository)
- Go to hud.ai → Environments → New Environment
- Connect your GitHub repo
- (Optional) Set
CODING_GITHUB_TOKENbuild arg if using a private repo - Your environment builds automatically on each push
Once deployed, your environment is accessible by its slug (e.g., my-org/coding).
Deploy requires REPO_URL:
hud deploy . --build-arg REPO_URL=https://github.com/your-org/your-repoFor private repos, add: --secret id=CODING_GITHUB_TOKEN,env=CODING_GITHUB_TOKEN
2. Create a Taskset and Add Your First Task
- Go to Tasksets → New Taskset and create a new taskset
- Go to your environment and enter
sample-json-bugas the scenario. This corresponds to the task intasks/basic.py. - Click on "Create Task" and add this to your taskset.
3. Run Your First Task
- Go to your taskset. Under "Tasks", you should be able to see your
sample-json-bugtask. - Click on "Run Taskset". Click on "Claude Sonnet 4.5" and set the "Max Steps" to 20 and "Group Size" to 5.
- Click on "Run [N] Tasks", which should open the page for the job you've just launched.
| Argument | Default | Description |
|---|---|---|
REPO_URL |
https://github.com/hud-evals/coding-template-sample |
Repository to clone (override for your repo) |
FOLDER_NAME |
project |
Destination folder in container |
Every task in tasks/*.py follows the 3-branch pattern, where each branch exists in the target repo:
| Branch | Purpose |
|---|---|
baseline |
Starting state the agent sees, where the agent makes changes |
test |
Contains tests that grade the agent's solution |
golden |
Correct solution for validation and/or training |
Git patches will automatically be generated for every branch defined in each task.
If you're not using git-based problems, comment out the git setup section in Dockerfile.hud.
@env.tool()
async def bash(command: str) -> str:
"""Run a bash command."""
@env.tool()
async def editor(command: str, path: str, ...) -> str:
"""View, create, and edit files."""from env import env, setup_task, make_prompt
from grading import AgentPatchGrader, Grade, ValidateMode
@env.scenario("sample-json-bug")
async def sample_json_bug(hints_enabled: bool = False, validate_mode: ValidateMode | None = None):
"""Fix the JSON serialization bug in server.py."""
setup_task(
task_id="sample_json_bug",
base="server_fix_baseline",
test="server_fix_test",
golden="server_fix_golden",
validate_mode=validate_mode,
)
prompt = make_prompt("""Fix the JSON serialization bug in server.py.
The API server's responses are malformed. When you make a request to any endpoint,
the response body is not valid JSON - it looks like a Python dict representation
instead of proper JSON (e.g., single quotes instead of double quotes).
""")
_ = yield prompt
grade = Grade.from_subscores([
AgentPatchGrader.grade(
weight=1.0,
problem_id="sample_json_bug",
test_files=["test_server.py"],
validate_mode=validate_mode,
)
])
yield grade.scoreThe validate_mode parameter is used by imagectl4.py -v to test that your branches and grading are correctly configured. It is passed automatically during validation -- you just need to thread it through to setup_task() and AgentPatchGrader.grade() as shown above.
Use imagectl4.py -j to generate problem-metadata.json and remote_tasks.json. The latter is used by hud eval for remote evaluation:
uv run imagectl4.py my-image -j # all scenarios
uv run imagectl4.py my-image -j --ids my-task # specific scenarios onlyRun evaluations at scale directly on hud.ai with parallel execution and automatic tracing. See the Remote getting started section above for setup.
Use hud eval with remote_tasks.json (generated by imagectl4.py -j) to run evaluations on the HUD platform:
hud eval ./remote_tasks.json --model gpt-4o --remote # https://hud.ai/models
hud eval my-org/coding-tasks --model gpt-4o --remote --group 5Use imagectl4.py to build, validate, and run scenarios locally. This is the recommended workflow for development and quick iteration:
# Build + validate + run all scenarios
uv run imagectl4.py my-image -bvr
# Target specific scenarios
uv run imagectl4.py my-image -bvr --ids my-task-1 my-task-2
# Run with a custom step limit
uv run imagectl4.py my-image -r --ids my-task --max-steps 30Note: Local runs execute one task at a time in a single container. For parallel execution, use remote evaluation.
This environment requires Docker. Use imagectl4.py for local development and quick iteration:
uv sync
# -b: build
# -v: validate
# -r: run
uv run imagectl4.py -bvrEdit tasks in tasks/*.py or grading in grading/, then rebuild and re-validate with -bv.
When to rebuild: Any change to task definitions, grading logic, Dockerfile, system packages, or service configs.
coding-template/
├── env.py # Tools + scenario registration
├── tools/ # bash, editor
├── grading/ # Grading logic and test runners
├── tasks/ # Problem definitions
├── local_test.py # Dev testing
└── Dockerfile.hud # Container config
- Customization Guide — Creating tasks, configuring builds, writing custom graders
- CLI Reference —
imagectl4.pyflags andhudcommands - Full Documentation — Platform documentation