Skip to content

Allow configurable automation timeouts#212

Open
malhotra5 wants to merge 6 commits into
mainfrom
configurable-automation-timeouts
Open

Allow configurable automation timeouts#212
malhotra5 wants to merge 6 commits into
mainfrom
configurable-automation-timeouts

Conversation

@malhotra5

@malhotra5 malhotra5 commented Jun 24, 2026

Copy link
Copy Markdown
Member

Summary

  • Add config-backed automation timeout policy using default_run_duration for the 10-minute default and max_run_duration for the 30-minute user-configurable limit.
  • Store the configured default timeout on newly-created automations when timeout is omitted/null, including standard create plus prompt/plugin presets.
  • Keep schema validation capped at max_run_duration and simplify dispatcher/runtime timeout resolution to a shared helper, with legacy NULL timeout fallback for existing rows.
  • Remove deprecated timeout/field maps from constants.py; runtime code now uses config/helpers directly.

Behavior change / migration note

  • No new database migration is needed: the automation-level timeout is stored in the existing automations.timeout column from migrations/versions/001_initial_schema.py.
  • API behavior change: Automations created without an explicit timeout now store and return the configured default_run_duration (600 seconds by default) instead of null. Existing rows with timeout = NULL are still supported at runtime and resolve to the configured default.

Testing

  • uv run ruff format openhands/automation/dispatcher.py openhands/automation/utils/timeout.py tests/test_config.py
  • uv run ruff check openhands/automation/dispatcher.py openhands/automation/utils/timeout.py tests/test_config.py
  • uv run pytest tests/test_config.py tests/test_execution.py -q — 71 passed
  • Direct timeout helper/schema validation for defaulting, max accepted, and max+1 rejected

Note: Docker-backed DB tests were attempted earlier but cannot run in this environment because testcontainers/Postgres requires a Docker daemon and the Docker socket is unavailable.

This PR was updated by an AI agent (OpenHands) on behalf of the user.

Co-authored-by: openhands <openhands@all-hands.dev>
@github-actions

github-actions Bot commented Jun 24, 2026

Copy link
Copy Markdown

Coverage

Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
Co-authored-by: openhands <openhands@all-hands.dev>
@malhotra5 malhotra5 requested a review from all-hands-bot June 24, 2026 22:36

Copy link
Copy Markdown
Contributor

🔍 Review in progress…

We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here.

Copy link
Copy Markdown
Contributor

🔍 Review in progress…

We are performing the review through OpenHands Cloud Automation. You can log in and view the conversation here.

@malhotra5

Copy link
Copy Markdown
Member Author

@OpenHands /codereview-roasted

@openhands-ai

openhands-ai Bot commented Jun 25, 2026

Copy link
Copy Markdown

I'm on it! malhotra5 can track my progress at all-hands.dev

@malhotra5 malhotra5 left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable — core logic is sound, but there's one silent behavioral break that must be called out, a few comments that should die, and a coverage gap on the only "interesting" defense path.


[CRITICAL ISSUES]

  • [tests/test_router.py / router.py] Breaking API change — undocumented: Previously, creating an automation without a timeout stored NULL and returned timeout: null. Now it returns timeout: 600. Any client checking response.timeout is None to detect "using system default" is silently broken. The PR description calls this out only as an implementation detail, not as a behavior change. The migration note needs an explicit callout: "Automations created without an explicit timeout will now store and return the configured default_run_duration (600 s) instead of null."

[IMPROVEMENT OPPORTUNITIES]

  • [openhands/automation/dispatcher.py, line ~377] Comment describes non-local internals — delete it:

    # Use the same effective timeout as the bash command so the
    # watchdog archives/cleans up stale sandboxes at the user-selected
    # deadline. Legacy runs without a stored timeout fall back to the
    # configured default.
    

    Three lines explaining what resolve_automation_timeout_seconds already says on the tin. The sentence "Legacy runs without a stored timeout fall back to the configured default" describes the internals of the helper from the call site — it will drift when the helper changes. Kill it.

  • [openhands/automation/schemas.py / preset_router.py] build_automation_timeout_description is frozen at import time: Field(description=build_automation_timeout_description(...)) is evaluated when the class body executes (module import time), not per-request. If AUTOMATION_DEFAULT_RUN_DURATION or AUTOMATION_MAX_RUN_DURATION are absent from the environment at first import, OpenAPI docs will show stale values for the lifetime of that process. Runtime behavior is unaffected (it's doc-only metadata), but operators expecting live config reflection in API docs will be surprised. Acknowledge this is intentional in a comment, or compute it lazily.


[TESTING GAPS]

  • [openhands/automation/utils/timeout.py, resolve_automation_timeout_seconds] No test for the defense-in-depth cap: The docstring explicitly calls out min(..., max_run_duration) as "defense in depth" — meant to silently cap stored timeouts that exceed the current configured max (e.g., operator lowers AUTOMATION_MAX_RUN_DURATION after automations were created). There is zero test coverage for this scenario. A stored value of 2400 with max_run_duration=1800 should resolve to 1800. Add a two-line unit test.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
    The behavioral change (timeout: nulltimeout: 600 for automations created without an explicit timeout) is a public API change that existing clients may depend on. Everything else is clean refactoring or additive. The removal of the constants.__getattr__ shim is fine — no remaining callers were found in the codebase.

VERDICT:
Worth merging — with the API behavior change documented in the PR description / migration notes.

KEY INSIGHT:
Storing the explicit default at creation time rather than resolving it lazily at dispatch is the right call — it makes per-automation timeout auditable and consistent — but it's a quiet API contract change that needs to be surfaced in the changelog.


Improve this review? If any feedback above seems incorrect or irrelevant to this repository, you can teach the reviewer to do better:

  1. Add a .agents/skills/custom-codereview-guide.md file to your branch (or edit it if one already exists) with the /codereview trigger and the context the reviewer is missing. See the customization docs for the required frontmatter format.
  2. Re-request a review — the reviewer reads guidelines from the PR branch, so your changes take effect immediately.

Resolve with AI? Install the iterate skill in your agent and run /iterate to automatically drive this PR through CI, review, and QA until it's merge-ready.

Was this review helpful? React with 👍 or 👎 to give feedback.

This review was generated by an AI agent (OpenHands) on behalf of the user.

@openhands-ai

This comment was marked as duplicate.

@openhands-ai

openhands-ai Bot commented Jun 25, 2026

Copy link
Copy Markdown

@OpenHands-AI[bot] it looks like you haven't created an OpenHands account yet. Please sign up at OpenHands Cloud and try again.

Co-authored-by: openhands <openhands@all-hands.dev>

Copy link
Copy Markdown
Member Author

Addressed the review feedback in 552bc9b:

  • Added the explicit API behavior-change callout to the PR description: omitted timeout now stores/returns default_run_duration (600 seconds by default) instead of null.
  • Removed the redundant dispatcher comment near resolve_automation_timeout_seconds(...).
  • Documented that timeout field descriptions are import-time OpenAPI metadata, while runtime validation/dispatch keeps reading current config.
  • Added coverage for the defense-in-depth cap where a stored timeout above max_run_duration resolves to the configured max.

Validation: ruff format, ruff check, and uv run pytest tests/test_config.py tests/test_execution.py -q all passed.

This comment was created by an AI agent (OpenHands) on behalf of the user.

@malhotra5 malhotra5 marked this pull request as ready for review June 25, 2026 01:07
@enyst

enyst commented Jun 25, 2026

Copy link
Copy Markdown
Member

@OpenHands Explain to me in concise story or stories: if I’m a user on my local machine (local backend), what do I need to know from this PR’s behavior? clone yourself agent-canvas and agent-sdk to see how automations are used from user perspective.

Reply here on PR.

@enyst enyst removed the review-this label Jun 25, 2026
@openhands-ai

openhands-ai Bot commented Jun 25, 2026

Copy link
Copy Markdown

I'm on it! enyst can track my progress at all-hands.dev

all-hands-bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@enyst enyst added the qa-this Trigger OpenHands QA validation label Jun 25, 2026

enyst commented Jun 25, 2026

Copy link
Copy Markdown
Member

@enyst I cloned OpenHands/agent-canvas and OpenHands/software-agent-sdk and read the local automation path. User-facing version:

Story 1 — I run Agent Canvas locally.
npm run dev starts a local agent-server plus the automation sidecar behind /api/automation, with the automation DB in local SQLite. This PR does not require a migration or a new local setup step. The thing you’ll notice is API shape: if a new automation is created without a timeout, GET/list now shows timeout: 600 instead of timeout: null.

Story 2 — I create a prompt/plugin automation from Canvas or via the local API.
Current Canvas mostly launches automation creation through chat/skills rather than a dedicated timeout field. If the agent/API request does not specify a timeout, the run gets the service default: 10 minutes. If you need longer, include "timeout": <seconds> in the automation request; by default the max accepted value is 1800 seconds / 30 minutes, and larger values are rejected.

Story 3 — My automation starts an SDK conversation locally.
The generated preset script detects local mode via AGENT_SERVER_URL and uses SDK RemoteWorkspace against your local agent-server, so it still uses your local settings, secrets, MCP config, skills, and workspace. The timeout is the outer automation-run budget: setup, SDK install, repo clone, skill loading, conversation.run(), and callback all have to finish inside it. The same effective timeout drives the local bash command and the watchdog deadline; if the run is still active past that, it is marked failed/timed out.

Story 4 — I already have local automations.
Existing SQLite rows with timeout = NULL keep working; at runtime they resolve to the configured default. Only newly-created omitted/null timeouts are normalized into a concrete stored default.

For local operators, the knobs are AUTOMATION_DEFAULT_RUN_DURATION and AUTOMATION_MAX_RUN_DURATION when starting the automation backend; Agent Canvas does not set them, so the defaults are 600s and 1800s.

This PR comment was generated by an AI agent (OpenHands) on behalf of the user.

@openhands-ai

This comment was marked as duplicate.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Taste Rating: Good taste — The timeout policy is now centralized, create paths consistently materialize the configured default, and dispatch/watchdog execution use the same resolved deadline. The previous API behavior change is explicitly documented and the defense-in-depth cap is covered.

Verification

  • uv run ruff check ... on the changed Python files passed.
  • Targeted local pytest run reached 142 passed / 30 skipped before Docker-backed DB fixtures errored because the Docker socket is unavailable in this sandbox; GitHub CI shows the unit, backend, frontend, and image checks green at the current head.

[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟡 MEDIUM
    This changes the public API response for newly-created automations that omit timeout (null → configured default, 600s by default) and affects runtime execution deadlines, so compatibility risk is not zero. The implementation preserves legacy NULL rows at runtime, validates user-provided values, caps stored outliers defensively, and adds focused tests. No dependency or security-sensitive changes were introduced.

VERDICT:
Worth merging: Core logic is sound and the earlier review feedback has been addressed.

KEY INSIGHT:
Centralizing timeout defaulting/validation in one helper removes the old implicit “system maximum as default” coupling and makes persisted automation behavior auditable.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ QA Report: PASS

Configurable automation timeouts work through the real local API and dispatch path: omitted/null values default correctly, custom values up to the configured max are accepted, over-max values are rejected, and a dispatched run used the custom 1200s timeout.

Does this PR achieve its stated goal?

Yes. The PR set out to make automation timeouts configurable with a 600s default and 1800s user max, store defaults on newly-created automations including presets, and use the resolved timeout at dispatch/runtime. I verified this by running the installed service locally and making real HTTP requests: baseline main returned timeout: null and rejected 1200, while the PR returned 600 by default, accepted 1200/1800, rejected 1801, honored custom env config (45/75), and passed timeout:1200 to the agent-server bash start call during dispatch.

Phase Result
Environment Setup ✅ Dependencies installed with uv; local API servers started from installed package targets with SQLite/local auth/local file storage.
CI Status ⚠️ unit-tests, backend, frontend, and image build succeeded; qa-changes was still in progress when checked.
Functional Verification ✅ Standard create, prompt/plugin presets, explicit null, custom env config, and dispatch runtime timeout were exercised via HTTP.
Functional Verification

Test 1: Baseline main vs PR standard automation timeout API

Step 1 — Establish baseline on main:
Started the baseline package from /tmp/automation-qa-main with SQLite/local auth, then posted real create requests to http://127.0.0.1:8011/api/automation/v1.

Observed excerpts:

baseline_standard_no_timeout status=201 -> "timeout": null
baseline_standard_1200 status=422 -> "timeout must not exceed 600 seconds"
baseline_standard_600 status=201 -> "timeout": 600

This shows the old behavior: omitted timeouts were stored/returned as null, and users could not configure above 600s.

Step 2 — Apply the PRs changes:
Started the PR package from commit 30db55a with the same local settings.

Step 3 — Re-run with the PR:
Posted equivalent requests to http://127.0.0.1:8012/api/automation/v1.

Observed excerpts:

pr_standard_no_timeout status=201 -> "timeout": 600
pr_standard_1200 status=201 -> "timeout": 1200
pr_standard_1800 status=201 -> "timeout": 1800
pr_standard_1801 status=422 -> "timeout must not exceed 1800 seconds (30 minutes)"

This confirms the default is now stored/returned as 600s and the configurable max is now 1800s.

Test 2: Prompt/plugin presets and explicit null

Step 1 — Establish baseline on main:
Posted a prompt preset without timeout to the baseline server.

Observed excerpt:

baseline_prompt_no_timeout status=201 -> "timeout": null

This confirms preset creation also used the old null default behavior.

Step 2 — Apply the PRs changes and re-run preset flows:
Posted prompt and plugin preset requests, plus explicit "timeout": null, to the PR server.

Observed excerpts:

pr_prompt_no_timeout status=201 -> "timeout": 600
pr_prompt_1801 status=422 -> "timeout must not exceed 1800 seconds (30 minutes)"
pr_plugin_no_timeout status=201 -> "timeout": 600
pr_plugin_1801 status=422 -> "timeout must not exceed 1800 seconds (30 minutes)"
pr_standard_null status=201 -> timeout=600
pr_prompt_null status=201 -> timeout=600
pr_plugin_null status=201 -> timeout=600

This confirms newly-created standard, prompt, and plugin automations store the configured default when timeout is omitted or explicitly null, and presets enforce the new max.

Test 3: Config-backed default/max values

Step 1 — Baseline expectation:
On main, the max was effectively fixed at 600s as shown by the 1200 rejection above.

Step 2 — Apply PR with custom config:
Started the PR API with AUTOMATION_DEFAULT_RUN_DURATION=45 and AUTOMATION_MAX_RUN_DURATION=75.

Step 3 — Exercise the API:
Posted standard create requests to http://127.0.0.1:8013/api/automation/v1.

Observed excerpts:

pr_custom_default status=201 -> "timeout": 45
pr_custom_at_max status=201 -> "timeout": 75
pr_custom_over_max status=422 -> "timeout must not exceed 75 seconds (1 minutes)"

This confirms the timeout policy is actually config-backed, not hard-coded to 600/1800 in the API path.

Test 4: Dispatch/runtime timeout propagation

Step 1 — Baseline limitation:
On main, a 1200 second automation could not be created (422), so a user could not dispatch a run above the old 600s cap.

Step 2 — Apply PR and dispatch through the service:
Started a temporary fake local agent-server on 127.0.0.1:8020, then started the PR API with AUTOMATION_DISPATCHER_INTERVAL_SECONDS=1. Created an automation with timeout: 1200 and triggered /dispatch.

Observed excerpts:

create status=201 id=b6c0d3ce-d448-4715-8783-96d7f2000b8f -> "timeout": 1200
dispatch status=201 -> run status initially PENDING
runs response -> "status": "RUNNING", "bash_command_id": "cmd-qa-1"
computed duration_seconds=1200

Fake agent-server log excerpt:

{"path":"/api/bash/start_bash_command","body":"{... \"timeout\":1200}"}

This confirms the dispatcher/runtime path used the configured automation timeout for both the run deadline (timeout_at - started_at = 1200) and the agent-server bash command timeout.

Issues Found

None.

This review was generated by an AI agent (OpenHands) on behalf of the user.

@enyst

enyst commented Jun 25, 2026

Copy link
Copy Markdown
Member

😅
image

@enyst enyst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

enyst commented Jun 25, 2026

Copy link
Copy Markdown
Member

Q:
You said SQLite. Check the cloud backend for automation implementation, it should have Postgresql I think? Here’s the thing: in automations, what matters a lot, maybe most, imo, is a correct message queueing design. Because things can fire dupes, network issues, etc; or can fail agent runs without notice; etc. How is message queueing designed and how reliable is it, and why, in cloud design and in local design?

A:
Yes — correction/clarification: cloud is PostgreSQL/Cloud SQL by default. SQLite is the local Agent Canvas / self-hosted local-mode path. I checked both the automation service code and deploy config: the cloud/default DB path is Postgres (postgresql+asyncpg, optional GCP Cloud SQL connector); Agent Canvas dev-with-automation explicitly starts automation with AUTOMATION_DB_URL=sqlite+aiosqlite:///.../automation/automations.db.

The queue design is a database-backed queue, not a separate broker. automation_runs is both queue and run history:

  • Producers create automation_runs(status=PENDING):
    • cron scheduler, after deciding an automation is due
    • manual dispatch endpoint
    • event/webhook receiver, one run per matched automation
  • Dispatcher consumes PENDING rows, marks them RUNNING, sets started_at and timeout_at, commits, then starts execution asynchronously.
  • Completion is async: the SDK script calls /runs/{id}/complete when it exits.
  • Watchdog scans RUNNING rows past timeout_at; it verifies the agent-server/bash state where possible, then marks COMPLETED or FAILED.

Cloud reliability: fairly good for a DB-backed queue, but not exactly-once.

Why it is reliable in cloud:

  • Postgres row locks are used for schedulers/dispatchers: FOR UPDATE SKIP LOCKED. Multiple workers can poll without claiming the same automation/run row.
  • The queue is durable: once a PENDING row is committed, it survives process restarts.
  • Dispatcher claim is transactional: selected PENDING rows are changed to RUNNING before execution starts, so another dispatcher should not also run them.
  • Terminal status is guarded by optimistic updates (WHERE status = RUNNING), so callback vs watchdog races do not flip a completed/failed run twice; late callbacks get conflict semantics.
  • Missing callbacks are bounded by timeout_at + watchdog. This PR matters there because the configured timeout becomes the watchdog deadline.
  • Cloud execution gets a fresh sandbox per run, so run side effects are isolated; the watchdog can use stored sandbox_id/bash_command_id to inspect the right command output.

Where it is not fully reliable / not exactly-once:

  • Webhook/event dedup is not implemented. If GitHub/custom webhook delivery is retried after the service committed the run but before the sender observed the 2xx response, the retry can create a second PENDING run. The code comments explicitly note replay/dedup as a consideration.
  • There is no idempotency key or unique constraint on provider delivery IDs.
  • Once a run is marked RUNNING, failures are generally terminal, not retried/requeued. If the process dies after claiming a run but before successfully starting execution, the watchdog will eventually mark it failed/timed out rather than automatically re-attempting it.
  • Sandbox/API creation has some targeted retries for rate limits/auth refresh, but the queue itself is not an at-least-once retry system with attempt counters/backoff/dead-lettering.

So I’d describe cloud semantics as: durable scheduling + single-consumer claiming + bounded failure detection, but duplicate event ingress is possible and execution is one-attempt unless manually retriggered. User automation code should still be idempotent when responding to external events.

Local reliability: same state machine, weaker concurrency guarantees by design.

  • Local Agent Canvas uses SQLite and a single automation sidecar process. In that intended setup, the DB-backed queue is durable enough for laptop use: committed PENDING rows survive restart, and the dispatcher/watchdog pick them up when the service is running again.
  • SQLite mode skips FOR UPDATE SKIP LOCKED; the code assumes single-process local mode. If a user starts multiple local automation services against the same SQLite DB, duplicate claiming/races are not protected the way Postgres cloud is.
  • Local execution uses a persistent local agent-server, not fresh cloud sandboxes. It does isolate each run into its own automation-runs/{run_id} workspace and records bash_command_id for verification, but it is not the same isolation/HA story as cloud.
  • If the local machine sleeps or the service is down, cron dispatch pauses. On restart, the scheduler may create a run for a missed due window, but it does not replay every missed interval as a full backlog.

Net: cloud has the right Postgres primitives for multi-worker queue claiming, but the current design is not a complete exactly-once message queue. Local is intentionally simpler: single-process SQLite queue, good for dev/laptop, not something I’d call HA or multi-consumer safe.

If we want automations to be robust under “things fire dupes, networks lie, agent runs vanish,” the next hardening step would be explicit idempotency/dedup keys for webhook deliveries, run attempt counters with retry/backoff/dead-letter states, and maybe a lease/heartbeat model for claimed RUNNING jobs rather than treating post-claim crashes as terminal timeout failures.

This PR comment was generated by an AI agent (OpenHands) on behalf of the user.

@jpshackelford jpshackelford left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some great things that this PR does to move the code in the direction of flexibility and stability at scale. A few of the things I like:

  • This PR limits the blast radius of the increased cap by making it opt-in on a case-by-case basis: default_run_duration=600 stays the default for every automation, and a run only reaches the higher max_run_duration=1800 ceiling when it explicitly sets a longer timeout.

  • Single source of truth: collapsing the duplicated schema + dispatcher logic into utils/timeout.py removes drift and gives one chokepoint for future changes.

  • Consistent timeout handling across all creation paths: the prompt and plugin presets get the same validation and defaulting as the standard create endpoint, so behavior no longer diverges by entry point.

  • Defense-in-depth preserved — resolve_automation_timeout_seconds still applies min(timeout, max) at dispatch, and all three enforcement points (API validate, bash-kill, watchdog deadline) stay intact.

  • Cleans up dead weight by removing the deprecated constants.py shim and routing everything through config/helpers.

There are a few outstanding issues I have concerns about, and I'm going to mark this as request changes until we've talked them through. In summary:

  • Concern 1 (blocker) — raising the cap makes automations more likely to silently suspend a user's own live session. At the per-user sandbox cap the platform pauses the oldest running sandbox rather than rejecting, and automation never checks capacity before firing — so a longer ceiling means more time at the cap and more evictions of the user's own work. The graceful-skip path that would have caught this is dead code (its CONCURRENCY_LIMIT_REACHED 429 producer was reverted in #14877, leaving eviction as the deliberate behavior).

  • Concern 2 (minor) — at the 30-min ceiling a dead run can keep showing as Running for ~20 min longer than before. The watchdog closes a run's record off the wall-clock timeout_at rather than the liveness the platform already tracks (idle_time / sandbox status), so raising the ceiling stretches the mislabeled-Running window from ~11 to ~31 min. Reporting-only; the sandbox itself is reclaimed independently.

  • Concern 3 — a single automation on a tight schedule can chew through the user's whole sandbox budget by itself, evicting their live work and burning LLM spend on runs that get killed mid-flight. Nothing stops an automation from overlapping itself: the scheduler fires on cron timing with no in-flight check, no overlap guard, and no minimum interval, so a run that outlives its schedule stacks concurrent runs (up to ~30 for an every-minute schedule). The higher ceiling makes the stacking deeper, and it feeds the same cap-eviction churn as Concern 1. It doesn't take very many of these to destabilize an entire installation.

@jpshackelford

jpshackelford commented Jun 26, 2026

Copy link
Copy Markdown
Member

A HUMAN TOOK THE TIME TO RESEARCH AND WRITE THIS AND THE OTHER COMMENTS HERE

Concern 1. — As written, raising the cap makes automations more likely to silently suspend a user's own live session

Raising the ceiling makes a scheduled automation more likely to silently suspend the user's own live interactive session mid-task. At the per-user sandbox cap (max_num_sandboxes, default 10), the app-server doesn't reject a new sandbox once a user is at the limit — it pauses the user's oldest running one to make room, and automation creates its sandbox without checking. Longer runs hold slots longer, so users sit at the cap more of the time, so that eviction fires more often. The code looks like it handles the cap gracefully, but that path is dead — and the history is the interesting part.

WHAT I NOTICED

The graceful-skip path is an orphan. backends/cloud.py:_concurrency_limit_detail keys off a CONCURRENCY_LIMIT_REACHED 429 and marks the run SKIPPED, with a clear "Skipped – Limit Reached" badge already wired into the activity log. That was the automation half of a coordinated "enforce conversation limits" feature: automation #174 added this consumer on 2026-06-10, and OpenHands #14168 added the matching 429 producer on 2026-06-15 (Personal Org cap of 3, default 10). #14168 was reverted two days later in #14877 — and notably the revert wasn't wholesale: it deliberately kept pause_old_sandboxes (the eviction behavior) while removing the hard limit, describing it as "preserving sandbox-service-level cleanup." So eviction is the behavior the platform chose to keep, not an accident or an oversight. Today there's no producer for that 429 anywhere in OpenHands main, the SKIPPED path never fires, and the only service that serves POST /api/v1/sandboxes answers the cap by pausing the oldest sandbox, not rejecting.

The consequence: the run isn't skipped, it proceeds, and the platform evicts the user's oldest sandbox to make room. Automation runs under the owner's own identity, so a user's scheduled automations and their interactive sessions share one cap, and the eviction picks its victim purely by age — a long-open debugging session is exactly the oldest. So the next scheduled run silently suspends their live work mid-task, with nothing connecting the stall to the automation that displaced it.

THE CONNECTION TO THIS PR

Once users start building long-running conversations via automation, their runs hold sandbox slots far longer — so a user sits at or near the cap a much larger fraction of the time, which is exactly the state where the next sandbox request (another run, or their own new interactive session) triggers an eviction. Automation doesn't check any of this before firing. So the higher ceiling doesn't merely allow longer runs — it raises how often a user is in the state where their own automation pauses their own live work.

WHAT NEEDS TO HAPPEN

The SKIPPED status and its "Skipped – Limit Reached" badge already exist and are the right UX — they just have no trigger today. Give them one, depending on the product direction:

  • If the hard-limit/429 is coming back: coordinate the re-land so the SKIPPED path fires again, and reword "Skipped – Limit Reached" to name the actual limit (ideally the user's current count) — the cleanup we already flagged for that message.

  • If eviction is the settled model: add a pre-flight capacity check before dispatch. Automation can already get half of what it needs — it lists the user's sandboxes via GET /api/v1/sandboxes (which we call in get_sandbox_agent_url), so the running count is free. The other half, max_num_sandboxes, lives only inside the app-server (a config field, default 10) and isn't exposed on any route, so this needs a small app-server change: either surface the limit / remaining capacity on the sandbox API, or add a "skip-don't-evict" create mode that returns a limit signal instead of paging the oldest sandbox. With either, automation marks the run SKIPPED with that message instead of creating a sandbox and evicting a live session. (We should not hardcode max_num_sandboxes in automation — that just recreates the lockstep-coupling problem we're already worried about.)

  • Another option is to have a separate pool for automation conversations / sandboxes vs those started interactively or by another API call so that automations only disrupt or limit other automations, not work managed directly by the user.

Either way the end state is the same — a run at the cap is visibly skipped, never a silent eviction — and that decision should land before we raise the reachable ceiling.

@jpshackelford

jpshackelford commented Jun 26, 2026

Copy link
Copy Markdown
Member

Concern 2. (minor) — At the 30-min ceiling, a dead run can keep showing as Running in the UI for ~20 min longer than before

Minor and not a blocker: the agent-server exposes idle_time, and runtime-api uses it to pause/reap idle or dead runtimes (with k8s liveness for acute pod failures), so the sandbox is reclaimed regardless of the automation watchdog.

That said, the watchdog closes a run's record only once timeout_at (= started_at + run_timeout) passes, on a 60s tick. So when the completion callback is lost (pod crash, network blip), the run keeps showing Running in the activity log until its own deadline, even though the run is effectively dead and the pod may already be gone. Raising the opt-in ceiling from 10 to 30 min stretches that mislabeled-Running window from ~11 min to ~31 min.

WHAT NEEDS TO HAPPEN

Nothing for this PR. Later, have the watchdog reconcile against the liveness the platform already exposes (idle_time / sandbox status) instead of the wall-clock timeout_at, so the run record closes promptly regardless of the ceiling. No new heartbeat needed — the signal already exists.

Environment variables (AUTOMATION_ prefix):
AUTOMATION_MAX_RUN_DURATION: Max run time in seconds (default: 600)
AUTOMATION_DEFAULT_RUN_DURATION: Default run time in seconds (default: 600)
AUTOMATION_MAX_RUN_DURATION: Max user-configurable run time in seconds

@jpshackelford jpshackelford Jun 26, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AUTOMATION_MAX_RUN_DURATION is the operator-facing knob, could we document its coupling to the runtime idle window right here in the env-var docs? Something like a Note: block after this list:

    Note:
        AUTOMATION_MAX_RUN_DURATION (the user-configurable ceiling) is coupled to
        runtime-api's RUNTIME_IDLE_SECONDS. At 1800 the two match, so any run that
        finishes within the window stays inside the runtime's idle grace period and
        is never paused for inactivity. Raising AUTOMATION_MAX_RUN_DURATION above the
        runtime idle window (OH_RUNTIME_IDLE_TIMEOUT_SECONDS) is only safe if
        RUNTIME_IDLE_SECONDS is raised in lockstep — otherwise a run that goes idle
        past the window (a stall, a long event-less LLM turn, an infrequently-polled
        background job) is paused mid-execution and fails silently. (Local Agent
        Canvas runs have no runtime idle cleanup, so this applies only to SaaS /
        OpenHands Enterprise.)

@jpshackelford

Copy link
Copy Markdown
Member

Concern 3. — Nothing prevents a single automation from overlapping itself, and the higher ceiling makes it worse

The scheduler fires on cron timing alone — there's no check that the previous run has finished — so an automation whose run outlives its own interval stacks concurrent runs. This isn't new, but raising the cap from 10 to 30 minutes triples the maximum overlap for a given schedule — a */10 schedule now yields up to three concurrent runs of the same automation, each holding its own sandbox. And it's worse than that, because nothing validates the schedule's frequency — a cron can fire as often as once a minute, so an every-minute schedule against a 30-minute run stacks up to ~30 concurrent runs.

WHAT I NOTICED

is_automation_due (utils/cron.py) is purely cron-vs-clock: it returns due when prev_fire_time > last_triggered_at, and last_triggered_at is stamped at schedule time in create_pending_run, not at completion — so a still-running prior run never defers the next fire. Neither create_pending_run nor the dispatcher checks for an in-flight run of the same automation; the dispatcher claims any row with status == PENDING. So per-automation concurrency is unbounded. The only frequency-related limit is POLL_INTERVAL_SECONDS = 60 (a polling cadence) plus cron's 1-minute granularity, which together cap fires at ~once/minute, and validate_cron_schedule only checks croniter.is_valid — nothing rejects a tight schedule.

THE CONNECTION TO THIS PR

The stack isn't literally unbounded in sandboxes — the per-user cap (10) plus pause_old_sandboxes bounds concurrent sandboxes. But that bound is the eviction from Concern 1: a stacking automation hits the cap and the platform pauses the oldest sandbox — which can be the user's own live session, or an earlier run of the same automation. So instead of "unbounded sandboxes," stacking degrades into self-inflicted churn: create → evict → the evicted run loses its callback and fails, while every fire still burns a full agent run's LLM spend. Raising the ceiling lets a given schedule reach deeper into that churn — which is why the higher number turns a previously-shallow overlap into a real one.

WHAT NEEDS TO HAPPEN

  • Add an overlap guard: before creating a PENDING run, skip the fire if the automation already has a RUNNING or PENDING run, and mark it SKIPPED with a clear reason ("previous run still in progress"). This is the cheapest single change, it's fully within automation's control (no app-server dependency), and it bounds each automation to one run at a time — which also removes a whole class of the cap pressure feeding Concern 1.
  • Set a minimum allowed interval on cron automations, so validate_cron_schedule rejects schedules that fire faster than we're willing to support (the every-minute case above).
  • If we don't add the guard before raising the ceiling, at minimum validate or warn at config time when the schedule interval is shorter than max_run_duration, so users don't silently stack runs.

@jpshackelford

jpshackelford commented Jun 26, 2026

Copy link
Copy Markdown
Member

ALTERNATIVES

  • Consider gating this setting so that it applies only to local installs and not to SaaS or OpenHands Enterprise.

  • For long running conversations, a queue and next tick approach rather than cron type scheduler side steps most of the issues that we have with cron-based automations and allows users to build long running and complex workflows with a very simple and scale safe primitive.

@jpelletier1

Copy link
Copy Markdown

@malhotra5 help me understand the user-facing scenario for configuring this. Is it:

  • As a user who is prompting OpenHands to create an Automation, I can include the desired automation timeout duration in my prompt and OpenHands will know how to configure that.

@malhotra5

Copy link
Copy Markdown
Member Author

@jpelletier1 yes the user can configure the timeout limit for an automation (with a hard cap)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-this Trigger OpenHands QA validation review-this

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants