Skip to content

Add benchmark-side Apptainer workspace support#509

Draft
neubig wants to merge 3 commits intomainfrom
docs/apptainer-benchmark-clarification
Draft

Add benchmark-side Apptainer workspace support#509
neubig wants to merge 3 commits intomainfrom
docs/apptainer-benchmark-clarification

Conversation

@neubig
Copy link
Contributor

@neubig neubig commented Mar 12, 2026

Summary

  • add benchmark-side --workspace apptainer support in the shared parser/models and the supported runners
  • introduce a reusable create_apptainer_workspace() helper for pre-built agent-server images, with configurable Apptainer runtime env vars
  • document Apptainer usage and limitations in the root and benchmark READMEs, plus add focused tests
  • clarify that Apptainer requires registry-pullable images built with --push, improve the error message for local-only builds, and reuse cached SIFs from APPTAINER_CACHE_DIR

Testing

  • uv run pre-commit run --files README.md benchmarks/utils/args_parser.py benchmarks/utils/models.py benchmarks/utils/image_utils.py benchmarks/gaia/run_infer.py benchmarks/commit0/run_infer.py benchmarks/multiswebench/run_infer.py benchmarks/swebench/run_infer.py benchmarks/swtbench/run_infer.py benchmarks/swebenchmultimodal/run_infer.py benchmarks/swefficiency/run_infer.py benchmarks/openagentsafety/run_infer.py benchmarks/swebench/README.md benchmarks/multiswebench/README.md benchmarks/swefficiency/README.md benchmarks/swebenchmultimodal/README.md tests/test_image_utils.py tests/test_workspace_types.py
  • uv run pytest tests/test_image_utils.py tests/test_workspace_types.py
  • uv run pre-commit run --files benchmarks/utils/image_utils.py benchmarks/swebench/README.md README.md tests/test_image_utils.py
  • uv run pytest tests/test_image_utils.py -q

Evidence

  • I attempted a minimal end-to-end benchmark run in this sandbox with a public dataset and a published agent image:
    • uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1
  • That run reached ApptainerWorkspace initialization and resolved a published image successfully:
    • ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal
  • The run then failed immediately in the current sandbox with:
    • [Errno 2] No such file or directory: 'apptainer'
  • Additional sandbox blockers remain:
    • apptainer is not installed
    • /dev/fuse is unavailable
    • /var/run/docker.sock is unavailable, so a local Docker fallback is not possible here either
  • There is still insufficient evidence to merge this PR, since it has not been run end-to-end.
  • End-to-end Apptainer validation is blocked by human QA.

Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Good taste - Clear, honest documentation that solves a real problem.

This accurately reflects the current state: Apptainer is in the SDK but not wired into the benchmark CLI. The writing is pragmatic and gives users concrete paths forward on Docker-restricted systems. No bikeshedding, no pretending features exist that don't - just straightforward technical documentation.

Taste Rating: Elegant
Verdict: ✅ Ship it

Co-authored-by: openhands <openhands@all-hands.dev>
@neubig neubig changed the title Clarify Apptainer support in benchmark docs Add benchmark-side Apptainer workspace support Mar 13, 2026
Co-authored-by: openhands <openhands@all-hands.dev>
Copy link
Contributor Author

neubig commented Mar 16, 2026

Following up: I tried the same validation path with a public benchmark dataset instead of GAIA.

Command attempted:

  • uv run swebench-infer .llm_config/example.json --dataset princeton-nlp/SWE-bench_Lite --split test --select <tmpfile containing astropy__astropy-12907> --workspace apptainer --num-workers 1 --max-iterations 1 --max-attempts 1

This is stronger evidence than the GAIA attempt because it used a public dataset and a published image:

  • ghcr.io/openhands/eval-agent-server:bde715c-sweb.eval.x86_64.astropy_1776_astropy-12907-source-minimal

The run reached ApptainerWorkspace initialization and then failed immediately with:

  • [Errno 2] No such file or directory: 'apptainer'

So at this point the blocker in this sandbox is no longer dataset access; it is the runtime environment itself. I still cannot complete Apptainer end-to-end validation here because:

  • apptainer is not installed
  • /dev/fuse is unavailable
  • /var/run/docker.sock is unavailable, so I cannot use a local Docker fallback here either

The PR remains draft. There is still insufficient evidence to merge this PR, since it has not been run end-to-end; end-to-end Apptainer validation is currently blocked by human QA.

@neubig neubig marked this pull request as draft March 16, 2026 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants