Coding Environment Template

A coding environment for agent evaluations. Provides bash and file editing tools.

⚠️ This is a template. Before building, customize Dockerfile.hud for your project.

Quick Start

uv sync
uv run imagectl4.py coding-template -bvr  # Build, validate, and run

Getting Started

Local

These instructions are for running tasks locally. Telemetry from your runs will be uploaded to HUD's platform.

1. Clone and Initialize

git clone https://github.com/hud-evals/coding-template
cd coding-template
uv sync

2. Build, Validate, and Run

# Build the Docker image
uv run imagectl4.py coding-template -b

# Validate that your task branches and grading are correct
uv run imagectl4.py coding-template -v

# Run an agent against scenarios
uv run imagectl4.py coding-template -r

# Or combine all three
uv run imagectl4.py coding-template -bvr

Use --ids to target specific scenarios:

uv run imagectl4.py coding-template -bvr --ids my-task-1 my-task-2

Remote

These instructions are for running remote jobs. You only have access to these if you are a HUD enterprise customer: contact founders@hud.ai to learn more.

1. Deploy to Platform

Connect this repo to hud.ai:

Push to GitHub (or clone this template repository)
Go to hud.ai → Environments → New Environment
Connect your GitHub repo
(Optional) Set CODING_GITHUB_TOKEN build arg if using a private repo
Your environment builds automatically on each push

Once deployed, your environment is accessible by its slug (e.g., my-org/coding).

Deploy requires REPO_URL:

hud deploy . --build-arg REPO_URL=https://github.com/your-org/your-repo

For private repos, add: --secret id=CODING_GITHUB_TOKEN,env=CODING_GITHUB_TOKEN

2. Create a Taskset and Add Your First Task

Go to Tasksets → New Taskset and create a new taskset
Go to your environment and enter sample-json-bug as the scenario. This corresponds to the task in tasks/basic.py.
Click on "Create Task" and add this to your taskset.

3. Run Your First Task

Go to your taskset. Under "Tasks", you should be able to see your sample-json-bug task.
Click on "Run Taskset". Click on "Claude Sonnet 4.5" and set the "Max Steps" to 20 and "Group Size" to 5.
Click on "Run [N] Tasks", which should open the page for the job you've just launched.

Build Arguments

Argument	Default	Description
`REPO_URL`	`https://github.com/hud-evals/coding-template-sample`	Repository to clone (override for your repo)
`FOLDER_NAME`	`project`	Destination folder in container

Key Concepts

3-Branch Pattern

Every task in tasks/*.py follows the 3-branch pattern, where each branch exists in the target repo:

Branch	Purpose
`baseline`	Starting state the agent sees, where the agent makes changes
`test`	Contains tests that grade the agent's solution
`golden`	Correct solution for validation and/or training

Git patches will automatically be generated for every branch defined in each task.

If you're not using git-based problems, comment out the git setup section in Dockerfile.hud.

Tools (in `env.py`)

@env.tool()
async def bash(command: str) -> str:
    """Run a bash command."""

@env.tool()
async def editor(command: str, path: str, ...) -> str:
    """View, create, and edit files."""

Tasks (in `tasks/*.py`)

from env import env, setup_task, make_prompt
from grading import AgentPatchGrader, Grade, ValidateMode

@env.scenario("sample-json-bug")
async def sample_json_bug(hints_enabled: bool = False, validate_mode: ValidateMode | None = None):
    """Fix the JSON serialization bug in server.py."""
    
    setup_task(
        task_id="sample_json_bug",
        base="server_fix_baseline",
        test="server_fix_test",
        golden="server_fix_golden",
        validate_mode=validate_mode,
    )
    
    prompt = make_prompt("""Fix the JSON serialization bug in server.py.

The API server's responses are malformed. When you make a request to any endpoint,
the response body is not valid JSON - it looks like a Python dict representation
instead of proper JSON (e.g., single quotes instead of double quotes).
""")
    
    _ = yield prompt
    
    grade = Grade.from_subscores([
        AgentPatchGrader.grade(
            weight=1.0,
            problem_id="sample_json_bug",
            test_files=["test_server.py"],
            validate_mode=validate_mode,
        )
    ])
    yield grade.score

The validate_mode parameter is used by imagectl4.py -v to test that your branches and grading are correctly configured. It is passed automatically during validation -- you just need to thread it through to setup_task() and AgentPatchGrader.grade() as shown above.

Generate Task JSON

Use imagectl4.py -j to generate problem-metadata.json and remote_tasks.json. The latter is used by hud eval for remote evaluation:

uv run imagectl4.py my-image -j # all scenarios
uv run imagectl4.py my-image -j --ids my-task  # specific scenarios only

Run Evaluations

Remote (Platform GUI)

Run evaluations at scale directly on hud.ai with parallel execution and automatic tracing. See the Remote getting started section above for setup.

Remote (CLI)

Use hud eval with remote_tasks.json (generated by imagectl4.py -j) to run evaluations on the HUD platform:

hud eval ./remote_tasks.json --model gpt-4o --remote  # https://hud.ai/models
hud eval my-org/coding-tasks --model gpt-4o --remote --group 5

Local (CLI)

Use imagectl4.py to build, validate, and run scenarios locally. This is the recommended workflow for development and quick iteration:

# Build + validate + run all scenarios
uv run imagectl4.py my-image -bvr

# Target specific scenarios
uv run imagectl4.py my-image -bvr --ids my-task-1 my-task-2

# Run with a custom step limit
uv run imagectl4.py my-image -r --ids my-task --max-steps 30

Note: Local runs execute one task at a time in a single container. For parallel execution, use remote evaluation.

Local Development

This environment requires Docker. Use imagectl4.py for local development and quick iteration:

uv sync
# -b: build
# -v: validate
# -r: run
uv run imagectl4.py -bvr

Edit tasks in tasks/*.py or grading in grading/, then rebuild and re-validate with -bv.

When to rebuild: Any change to task definitions, grading logic, Dockerfile, system packages, or service configs.

Structure

coding-template/
├── env.py              # Tools + scenario registration
├── tools/              # bash, editor
├── grading/            # Grading logic and test runners
├── tasks/              # Problem definitions
├── local_test.py       # Dev testing
└── Dockerfile.hud      # Container config

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
build_scripts		build_scripts
grading		grading
tasks		tasks
tools		tools
.dockerignore		.dockerignore
.env.test		.env.test
.gitignore		.gitignore
CUSTOMIZATION_GUIDE.md		CUSTOMIZATION_GUIDE.md
Dockerfile.hud		Dockerfile.hud
HUD_CLI_GUIDE.md		HUD_CLI_GUIDE.md
README.md		README.md
cli.py		cli.py
env.py		env.py
imagectl4.py		imagectl4.py
local_test.py		local_test.py
pyproject.toml		pyproject.toml
remote_tasks.json		remote_tasks.json
remote_test.py		remote_test.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding Environment Template

Quick Start

Getting Started

Local

Remote

Build Arguments

Key Concepts

3-Branch Pattern

Tools (in `env.py`)

Tasks (in `tasks/*.py`)

Generate Task JSON

Run Evaluations

Remote (Platform GUI)

Remote (CLI)

Local (CLI)

Local Development

Structure

Further Reading

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

hud-evals/coding-template

Folders and files

Latest commit

History

Repository files navigation

Coding Environment Template

Quick Start

Getting Started

Local

Remote

Build Arguments

Key Concepts

3-Branch Pattern

Tools (in env.py)

Tasks (in tasks/*.py)

Generate Task JSON

Run Evaluations

Remote (Platform GUI)

Remote (CLI)

Local (CLI)

Local Development

Structure

Further Reading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Tools (in `env.py`)

Tasks (in `tasks/*.py`)

Packages