Allow configurable automation timeouts#212
Conversation
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
|
🔍 Review in progress… We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here. |
|
🔍 Review in progress… We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here. |
|
@OpenHands /codereview-roasted |
|
I'm on it! malhotra5 can track my progress at all-hands.dev |
malhotra5
left a comment
There was a problem hiding this comment.
🟡 Acceptable — core logic is sound, but there's one silent behavioral break that must be called out, a few comments that should die, and a coverage gap on the only "interesting" defense path.
[CRITICAL ISSUES]
- [
tests/test_router.py/router.py] Breaking API change — undocumented: Previously, creating an automation without atimeoutstoredNULLand returnedtimeout: null. Now it returnstimeout: 600. Any client checkingresponse.timeout is Noneto detect "using system default" is silently broken. The PR description calls this out only as an implementation detail, not as a behavior change. The migration note needs an explicit callout: "Automations created without an explicit timeout will now store and return the configureddefault_run_duration(600 s) instead ofnull."
[IMPROVEMENT OPPORTUNITIES]
-
[
openhands/automation/dispatcher.py, line ~377] Comment describes non-local internals — delete it:# Use the same effective timeout as the bash command so the # watchdog archives/cleans up stale sandboxes at the user-selected # deadline. Legacy runs without a stored timeout fall back to the # configured default.Three lines explaining what
resolve_automation_timeout_secondsalready says on the tin. The sentence "Legacy runs without a stored timeout fall back to the configured default" describes the internals of the helper from the call site — it will drift when the helper changes. Kill it. -
[
openhands/automation/schemas.py/preset_router.py]build_automation_timeout_descriptionis frozen at import time:Field(description=build_automation_timeout_description(...))is evaluated when the class body executes (module import time), not per-request. IfAUTOMATION_DEFAULT_RUN_DURATIONorAUTOMATION_MAX_RUN_DURATIONare absent from the environment at first import, OpenAPI docs will show stale values for the lifetime of that process. Runtime behavior is unaffected (it's doc-only metadata), but operators expecting live config reflection in API docs will be surprised. Acknowledge this is intentional in a comment, or compute it lazily.
[TESTING GAPS]
- [
openhands/automation/utils/timeout.py,resolve_automation_timeout_seconds] No test for the defense-in-depth cap: The docstring explicitly calls outmin(..., max_run_duration)as "defense in depth" — meant to silently cap stored timeouts that exceed the current configured max (e.g., operator lowersAUTOMATION_MAX_RUN_DURATIONafter automations were created). There is zero test coverage for this scenario. A stored value of2400withmax_run_duration=1800should resolve to1800. Add a two-line unit test.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
The behavioral change (timeout: null→timeout: 600for automations created without an explicit timeout) is a public API change that existing clients may depend on. Everything else is clean refactoring or additive. The removal of theconstants.__getattr__shim is fine — no remaining callers were found in the codebase.
VERDICT:
✅ Worth merging — with the API behavior change documented in the PR description / migration notes.
KEY INSIGHT:
Storing the explicit default at creation time rather than resolving it lazily at dispatch is the right call — it makes per-automation timeout auditable and consistent — but it's a quiet API contract change that needs to be surfaced in the changelog.
Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:
- Add a
.agents/skills/custom-codereview-guide.mdfile to your branch (or edit it if one already exists) with the/codereviewtrigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.- Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.
Resolve with AI? Install the iterate skill in your agent and run
/iterateto automatically drive this PR through CI, review, and QA until it's merge-ready.Was this review helpful? React with 👍 or 👎 to give feedback.
This review was generated by an AI agent (OpenHands) on behalf of the user.
This comment was marked as duplicate.
This comment was marked as duplicate.
|
@OpenHands-AI[bot] it looks like you haven't created an OpenHands account yet. Please sign up at OpenHands Cloud and try again. |
Co-authored-by: openhands <openhands@all-hands.dev>
|
Addressed the review feedback in
Validation: This comment was created by an AI agent (OpenHands) on behalf of the user. |
|
@OpenHands Explain to me in concise story or stories: if I’m a user on my local machine (local backend), what do I need to know from this PR’s behavior? clone yourself agent-canvas and agent-sdk to see how automations are used from user perspective. Reply here on PR. |
|
I'm on it! enyst can track my progress at all-hands.dev |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
|
@enyst I cloned Story 1 — I run Agent Canvas locally. Story 2 — I create a prompt/plugin automation from Canvas or via the local API. Story 3 — My automation starts an SDK conversation locally. Story 4 — I already have local automations. For local operators, the knobs are This PR comment was generated by an AI agent (OpenHands) on behalf of the user. |
This comment was marked as duplicate.
This comment was marked as duplicate.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟢 Taste Rating: Good taste — The timeout policy is now centralized, create paths consistently materialize the configured default, and dispatch/watchdog execution use the same resolved deadline. The previous API behavior change is explicitly documented and the defense-in-depth cap is covered.
Verification
uv run ruff check ...on the changed Python files passed.- Targeted local pytest run reached 142 passed / 30 skipped before Docker-backed DB fixtures errored because the Docker socket is unavailable in this sandbox; GitHub CI shows the unit, backend, frontend, and image checks green at the current head.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟡 MEDIUM
This changes the public API response for newly-created automations that omittimeout(null→ configured default, 600s by default) and affects runtime execution deadlines, so compatibility risk is not zero. The implementation preserves legacyNULLrows at runtime, validates user-provided values, caps stored outliers defensively, and adds focused tests. No dependency or security-sensitive changes were introduced.
VERDICT:
✅ Worth merging: Core logic is sound and the earlier review feedback has been addressed.
KEY INSIGHT:
Centralizing timeout defaulting/validation in one helper removes the old implicit “system maximum as default” coupling and makes persisted automation behavior auditable.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
all-hands-bot
left a comment
There was a problem hiding this comment.
✅ QA Report: PASS
Configurable automation timeouts work through the real local API and dispatch path: omitted/null values default correctly, custom values up to the configured max are accepted, over-max values are rejected, and a dispatched run used the custom 1200s timeout.
Does this PR achieve its stated goal?
Yes. The PR set out to make automation timeouts configurable with a 600s default and 1800s user max, store defaults on newly-created automations including presets, and use the resolved timeout at dispatch/runtime. I verified this by running the installed service locally and making real HTTP requests: baseline main returned timeout: null and rejected 1200, while the PR returned 600 by default, accepted 1200/1800, rejected 1801, honored custom env config (45/75), and passed timeout:1200 to the agent-server bash start call during dispatch.
| Phase | Result |
|---|---|
| Environment Setup | ✅ Dependencies installed with uv; local API servers started from installed package targets with SQLite/local auth/local file storage. |
| CI Status | unit-tests, backend, frontend, and image build succeeded; qa-changes was still in progress when checked. |
| Functional Verification | ✅ Standard create, prompt/plugin presets, explicit null, custom env config, and dispatch runtime timeout were exercised via HTTP. |
Functional Verification
Test 1: Baseline main vs PR standard automation timeout API
Step 1 — Establish baseline on main:
Started the baseline package from /tmp/automation-qa-main with SQLite/local auth, then posted real create requests to http://127.0.0.1:8011/api/automation/v1.
Observed excerpts:
baseline_standard_no_timeout status=201 -> "timeout": null
baseline_standard_1200 status=422 -> "timeout must not exceed 600 seconds"
baseline_standard_600 status=201 -> "timeout": 600
This shows the old behavior: omitted timeouts were stored/returned as null, and users could not configure above 600s.
Step 2 — Apply the PRs changes:
Started the PR package from commit 30db55a with the same local settings.
Step 3 — Re-run with the PR:
Posted equivalent requests to http://127.0.0.1:8012/api/automation/v1.
Observed excerpts:
pr_standard_no_timeout status=201 -> "timeout": 600
pr_standard_1200 status=201 -> "timeout": 1200
pr_standard_1800 status=201 -> "timeout": 1800
pr_standard_1801 status=422 -> "timeout must not exceed 1800 seconds (30 minutes)"
This confirms the default is now stored/returned as 600s and the configurable max is now 1800s.
Test 2: Prompt/plugin presets and explicit null
Step 1 — Establish baseline on main:
Posted a prompt preset without timeout to the baseline server.
Observed excerpt:
baseline_prompt_no_timeout status=201 -> "timeout": null
This confirms preset creation also used the old null default behavior.
Step 2 — Apply the PRs changes and re-run preset flows:
Posted prompt and plugin preset requests, plus explicit "timeout": null, to the PR server.
Observed excerpts:
pr_prompt_no_timeout status=201 -> "timeout": 600
pr_prompt_1801 status=422 -> "timeout must not exceed 1800 seconds (30 minutes)"
pr_plugin_no_timeout status=201 -> "timeout": 600
pr_plugin_1801 status=422 -> "timeout must not exceed 1800 seconds (30 minutes)"
pr_standard_null status=201 -> timeout=600
pr_prompt_null status=201 -> timeout=600
pr_plugin_null status=201 -> timeout=600
This confirms newly-created standard, prompt, and plugin automations store the configured default when timeout is omitted or explicitly null, and presets enforce the new max.
Test 3: Config-backed default/max values
Step 1 — Baseline expectation:
On main, the max was effectively fixed at 600s as shown by the 1200 rejection above.
Step 2 — Apply PR with custom config:
Started the PR API with AUTOMATION_DEFAULT_RUN_DURATION=45 and AUTOMATION_MAX_RUN_DURATION=75.
Step 3 — Exercise the API:
Posted standard create requests to http://127.0.0.1:8013/api/automation/v1.
Observed excerpts:
pr_custom_default status=201 -> "timeout": 45
pr_custom_at_max status=201 -> "timeout": 75
pr_custom_over_max status=422 -> "timeout must not exceed 75 seconds (1 minutes)"
This confirms the timeout policy is actually config-backed, not hard-coded to 600/1800 in the API path.
Test 4: Dispatch/runtime timeout propagation
Step 1 — Baseline limitation:
On main, a 1200 second automation could not be created (422), so a user could not dispatch a run above the old 600s cap.
Step 2 — Apply PR and dispatch through the service:
Started a temporary fake local agent-server on 127.0.0.1:8020, then started the PR API with AUTOMATION_DISPATCHER_INTERVAL_SECONDS=1. Created an automation with timeout: 1200 and triggered /dispatch.
Observed excerpts:
create status=201 id=b6c0d3ce-d448-4715-8783-96d7f2000b8f -> "timeout": 1200
dispatch status=201 -> run status initially PENDING
runs response -> "status": "RUNNING", "bash_command_id": "cmd-qa-1"
computed duration_seconds=1200
Fake agent-server log excerpt:
{"path":"/api/bash/start_bash_command","body":"{... \"timeout\":1200}"}This confirms the dispatcher/runtime path used the configured automation timeout for both the run deadline (timeout_at - started_at = 1200) and the agent-server bash command timeout.
Issues Found
None.
This review was generated by an AI agent (OpenHands) on behalf of the user.
|
Q: A: The queue design is a database-backed queue, not a separate broker.
Cloud reliability: fairly good for a DB-backed queue, but not exactly-once. Why it is reliable in cloud:
Where it is not fully reliable / not exactly-once:
So I’d describe cloud semantics as: durable scheduling + single-consumer claiming + bounded failure detection, but duplicate event ingress is possible and execution is one-attempt unless manually retriggered. User automation code should still be idempotent when responding to external events. Local reliability: same state machine, weaker concurrency guarantees by design.
Net: cloud has the right Postgres primitives for multi-worker queue claiming, but the current design is not a complete exactly-once message queue. Local is intentionally simpler: single-process SQLite queue, good for dev/laptop, not something I’d call HA or multi-consumer safe. If we want automations to be robust under “things fire dupes, networks lie, agent runs vanish,” the next hardening step would be explicit idempotency/dedup keys for webhook deliveries, run attempt counters with retry/backoff/dead-letter states, and maybe a lease/heartbeat model for claimed This PR comment was generated by an AI agent (OpenHands) on behalf of the user. |
There was a problem hiding this comment.
There are some great things that this PR does to move the code in the direction of flexibility and stability at scale. A few of the things I like:
-
This PR limits the blast radius of the increased cap by making it opt-in on a case-by-case basis:
default_run_duration=600stays the default for every automation, and a run only reaches the highermax_run_duration=1800ceiling when it explicitly sets a longertimeout. -
Single source of truth: collapsing the duplicated schema + dispatcher logic into
utils/timeout.pyremoves drift and gives one chokepoint for future changes. -
Consistent timeout handling across all creation paths: the prompt and plugin presets get the same validation and defaulting as the standard create endpoint, so behavior no longer diverges by entry point.
-
Defense-in-depth preserved —
resolve_automation_timeout_secondsstill appliesmin(timeout, max)at dispatch, and all three enforcement points (API validate, bash-kill, watchdog deadline) stay intact. -
Cleans up dead weight by removing the deprecated
constants.pyshim and routing everything through config/helpers.
There are a few outstanding issues I have concerns about, and I'm going to mark this as request changes until we've talked them through. In summary:
-
Concern 1 (blocker) — raising the cap makes automations more likely to silently suspend a user's own live session. At the per-user sandbox cap the platform pauses the oldest running sandbox rather than rejecting, and automation never checks capacity before firing — so a longer ceiling means more time at the cap and more evictions of the user's own work. The graceful-skip path that would have caught this is dead code (its
CONCURRENCY_LIMIT_REACHED429 producer was reverted in #14877, leaving eviction as the deliberate behavior). -
Concern 2 (minor) — at the 30-min ceiling a dead run can keep showing as
Runningfor ~20 min longer than before. The watchdog closes a run's record off the wall-clocktimeout_atrather than the liveness the platform already tracks (idle_time/ sandbox status), so raising the ceiling stretches the mislabeled-Runningwindow from ~11 to ~31 min. Reporting-only; the sandbox itself is reclaimed independently. -
Concern 3 — a single automation on a tight schedule can chew through the user's whole sandbox budget by itself, evicting their live work and burning LLM spend on runs that get killed mid-flight. Nothing stops an automation from overlapping itself: the scheduler fires on cron timing with no in-flight check, no overlap guard, and no minimum interval, so a run that outlives its schedule stacks concurrent runs (up to ~30 for an every-minute schedule). The higher ceiling makes the stacking deeper, and it feeds the same cap-eviction churn as Concern 1. It doesn't take very many of these to destabilize an entire installation.
|
A HUMAN TOOK THE TIME TO RESEARCH AND WRITE THIS AND THE OTHER COMMENTS HERE Concern 1. — As written, raising the cap makes automations more likely to silently suspend a user's own live session Raising the ceiling makes a scheduled automation more likely to silently suspend the user's own live interactive session mid-task. At the per-user sandbox cap ( WHAT I NOTICED The graceful-skip path is an orphan. The consequence: the run isn't skipped, it proceeds, and the platform evicts the user's oldest sandbox to make room. Automation runs under the owner's own identity, so a user's scheduled automations and their interactive sessions share one cap, and the eviction picks its victim purely by age — a long-open debugging session is exactly the oldest. So the next scheduled run silently suspends their live work mid-task, with nothing connecting the stall to the automation that displaced it. THE CONNECTION TO THIS PR Once users start building long-running conversations via automation, their runs hold sandbox slots far longer — so a user sits at or near the cap a much larger fraction of the time, which is exactly the state where the next sandbox request (another run, or their own new interactive session) triggers an eviction. Automation doesn't check any of this before firing. So the higher ceiling doesn't merely allow longer runs — it raises how often a user is in the state where their own automation pauses their own live work. WHAT NEEDS TO HAPPEN The SKIPPED status and its "Skipped – Limit Reached" badge already exist and are the right UX — they just have no trigger today. Give them one, depending on the product direction:
Either way the end state is the same — a run at the cap is visibly skipped, never a silent eviction — and that decision should land before we raise the reachable ceiling. |
|
Concern 2. (minor) — At the 30-min ceiling, a dead run can keep showing as Running in the UI for ~20 min longer than before Minor and not a blocker: the agent-server exposes idle_time, and runtime-api uses it to pause/reap idle or dead runtimes (with k8s liveness for acute pod failures), so the sandbox is reclaimed regardless of the automation watchdog. That said, the watchdog closes a run's record only once timeout_at (= started_at + run_timeout) passes, on a 60s tick. So when the completion callback is lost (pod crash, network blip), the run keeps showing Running in the activity log until its own deadline, even though the run is effectively dead and the pod may already be gone. Raising the opt-in ceiling from 10 to 30 min stretches that mislabeled-Running window from ~11 min to ~31 min. WHAT NEEDS TO HAPPEN Nothing for this PR. Later, have the watchdog reconcile against the liveness the platform already exposes (idle_time / sandbox status) instead of the wall-clock timeout_at, so the run record closes promptly regardless of the ceiling. No new heartbeat needed — the signal already exists. |
| Environment variables (AUTOMATION_ prefix): | ||
| AUTOMATION_MAX_RUN_DURATION: Max run time in seconds (default: 600) | ||
| AUTOMATION_DEFAULT_RUN_DURATION: Default run time in seconds (default: 600) | ||
| AUTOMATION_MAX_RUN_DURATION: Max user-configurable run time in seconds |
There was a problem hiding this comment.
Since AUTOMATION_MAX_RUN_DURATION is the operator-facing knob, could we document its coupling to the runtime idle window right here in the env-var docs? Something like a Note: block after this list:
Note:
AUTOMATION_MAX_RUN_DURATION (the user-configurable ceiling) is coupled to
runtime-api's RUNTIME_IDLE_SECONDS. At 1800 the two match, so any run that
finishes within the window stays inside the runtime's idle grace period and
is never paused for inactivity. Raising AUTOMATION_MAX_RUN_DURATION above the
runtime idle window (OH_RUNTIME_IDLE_TIMEOUT_SECONDS) is only safe if
RUNTIME_IDLE_SECONDS is raised in lockstep — otherwise a run that goes idle
past the window (a stall, a long event-less LLM turn, an infrequently-polled
background job) is paused mid-execution and fails silently. (Local Agent
Canvas runs have no runtime idle cleanup, so this applies only to SaaS /
OpenHands Enterprise.)
|
Concern 3. — Nothing prevents a single automation from overlapping itself, and the higher ceiling makes it worse The scheduler fires on cron timing alone — there's no check that the previous run has finished — so an automation whose run outlives its own interval stacks concurrent runs. This isn't new, but raising the cap from 10 to 30 minutes triples the maximum overlap for a given schedule — a WHAT I NOTICED
THE CONNECTION TO THIS PR The stack isn't literally unbounded in sandboxes — the per-user cap (10) plus WHAT NEEDS TO HAPPEN
|
|
ALTERNATIVES
|
|
@malhotra5 help me understand the user-facing scenario for configuring this. Is it:
|
|
@jpelletier1 yes the user can configure the timeout limit for an automation (with a hard cap) |

Summary
default_run_durationfor the 10-minute default andmax_run_durationfor the 30-minute user-configurable limit.timeoutis omitted/null, including standard create plus prompt/plugin presets.max_run_durationand simplify dispatcher/runtime timeout resolution to a shared helper, with legacyNULLtimeout fallback for existing rows.constants.py; runtime code now uses config/helpers directly.Behavior change / migration note
automations.timeoutcolumn frommigrations/versions/001_initial_schema.py.timeoutnow store and return the configureddefault_run_duration(600seconds by default) instead ofnull. Existing rows withtimeout = NULLare still supported at runtime and resolve to the configured default.Testing
uv run ruff format openhands/automation/dispatcher.py openhands/automation/utils/timeout.py tests/test_config.pyuv run ruff check openhands/automation/dispatcher.py openhands/automation/utils/timeout.py tests/test_config.pyuv run pytest tests/test_config.py tests/test_execution.py -q— 71 passedNote: Docker-backed DB tests were attempted earlier but cannot run in this environment because testcontainers/Postgres requires a Docker daemon and the Docker socket is unavailable.
This PR was updated by an AI agent (OpenHands) on behalf of the user.