Add benchmark-side Apptainer workspace support by neubig · Pull Request #509 · OpenHands/benchmarks

neubig · 2026-03-12T22:02:23Z

Summary

add benchmark-side --workspace apptainer support in the shared parser/models and the supported runners
introduce a reusable create_apptainer_workspace() helper for pre-built agent-server images, with configurable Apptainer runtime env vars
document Apptainer usage and limitations in the root and benchmark READMEs, plus add focused tests
clarify that Apptainer requires registry-pullable images built with --push, improve the error message for local-only builds, and reuse cached SIFs from APPTAINER_CACHE_DIR

Testing

uv run pre-commit run --files README.md benchmarks/utils/args_parser.py benchmarks/utils/models.py benchmarks/utils/image_utils.py benchmarks/gaia/run_infer.py benchmarks/commit0/run_infer.py benchmarks/multiswebench/run_infer.py benchmarks/swebench/run_infer.py benchmarks/swtbench/run_infer.py benchmarks/swebenchmultimodal/run_infer.py benchmarks/swefficiency/run_infer.py benchmarks/openagentsafety/run_infer.py benchmarks/swebench/README.md benchmarks/multiswebench/README.md benchmarks/swefficiency/README.md benchmarks/swebenchmultimodal/README.md tests/test_image_utils.py tests/test_workspace_types.py
uv run pytest tests/test_image_utils.py tests/test_workspace_types.py
uv run pre-commit run --files benchmarks/utils/image_utils.py benchmarks/swebench/README.md README.md tests/test_image_utils.py
uv run pytest tests/test_image_utils.py -q

Evidence

I attempted a minimal end-to-end benchmark run in this sandbox with a public dataset and a published agent image:
- uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1
That run reached ApptainerWorkspace initialization and resolved a published image successfully:
- ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal
The run then failed immediately in the current sandbox with:
- [Errno 2] No such file or directory: 'apptainer'
Additional sandbox blockers remain:
- apptainer is not installed
- /dev/fuse is unavailable
- /var/run/docker.sock is unavailable, so a local Docker fallback is not possible here either
There is still insufficient evidence to merge this PR, since it has not been run end-to-end.
End-to-end Apptainer validation is blocked by human QA.

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟢 Good taste - Clear, honest documentation that solves a real problem.

This accurately reflects the current state: Apptainer is in the SDK but not wired into the benchmark CLI. The writing is pragmatic and gives users concrete paths forward on Docker-restricted systems. No bikeshedding, no pretending features exist that don't - just straightforward technical documentation.

Taste Rating: Elegant
Verdict: ✅ Ship it

Co-authored-by: openhands <openhands@all-hands.dev>

neubig · 2026-03-16T14:36:48Z

Following up: I tried the same validation path with a public benchmark dataset instead of GAIA.

Command attempted:

uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1

This is stronger evidence than the GAIA attempt because it used a public dataset and a published image:

ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal

The run reached ApptainerWorkspace initialization and then failed immediately with:

[Errno 2] No such file or directory: 'apptainer'

So at this point the blocker in this sandbox is no longer dataset access; it is the runtime environment itself. I still cannot complete Apptainer end-to-end validation here because:

apptainer is not installed
/dev/fuse is unavailable
/var/run/docker.sock is unavailable, so I cannot use a local Docker fallback here either

The PR remains draft. There is still insufficient evidence to merge this PR, since it has not been run end-to-end; end-to-end Apptainer validation is currently blocked by human QA.

Clarify Apptainer support in benchmark docs

79d308a

Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot approved these changes Mar 12, 2026

View reviewed changes

Add benchmark-side Apptainer workspace support

8d6113e

Co-authored-by: openhands <openhands@all-hands.dev>

neubig changed the title ~~Clarify Apptainer support in benchmark docs~~ Add benchmark-side Apptainer workspace support Mar 13, 2026

Improve Apptainer image guidance and cache reuse

2f23853

Co-authored-by: openhands <openhands@all-hands.dev>

neubig marked this pull request as draft March 16, 2026 14:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark-side Apptainer workspace support#509

Add benchmark-side Apptainer workspace support#509
neubig wants to merge 3 commits intomainfrom
docs/apptainer-benchmark-clarification

neubig commented Mar 12, 2026 •

edited

Loading

Uh oh!

all-hands-bot left a comment

Uh oh!

neubig commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

neubig commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Evidence

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

neubig commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

neubig commented Mar 12, 2026 •

edited

Loading

neubig commented Mar 16, 2026 •

edited

Loading