fix(copilot): add tool call circuit breakers and intermediate persistence#12604
fix(copilot): add tool call circuit breakers and intermediate persistence#12604
Conversation
…ence - Add WebSearch call cap (15/session) to prevent runaway research loops - Add total tool call cap (100/turn) as hard circuit breaker - Add web search best practices guidance to system prompt - Add intermediate session persistence (every 30s or 10 messages) - Add tests for WebSearch cap and total tool call cap Addresses findings from session d2f7cba3: 179 WebSearch calls, $20.66 cost, 82 minutes for a single user message.
|
Zamil Majdy seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds a "Web search best practices" prompt subsection, enforces per-turn WebSearch and total tool-call caps in security hooks with denial logic, expands tests for those caps, and adds intermediate periodic persistence of streaming session messages during stream attempts. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ServiceLayer as Service Layer
participant SecurityHooks as Security Hooks
participant ToolExecution as Tool Execution
participant SessionStore as Session Store
Client->>ServiceLayer: start_stream_attempt()
ServiceLayer->>ServiceLayer: init _last_flush_time, _msgs_since_flush
loop Stream Processing
ToolExecution->>SecurityHooks: invoke_tool(name, args)
activate SecurityHooks
SecurityHooks->>SecurityHooks: increment total_tool_call_count
alt total_tool_call_count > max_tool_calls
SecurityHooks-->>ToolExecution: deny(reason: exceeds total tool calls)
else name == "WebSearch"
SecurityHooks->>SecurityHooks: increment web_search_count
alt web_search_count > max_web_searches
SecurityHooks-->>ToolExecution: deny(reason: web search cap exceeded)
else
SecurityHooks-->>ToolExecution: allow()
end
else
SecurityHooks-->>ToolExecution: allow()
end
deactivate SecurityHooks
ServiceLayer->>ServiceLayer: increment _msgs_since_flush / check time
alt flush threshold reached (>=10 msgs or >=30s)
ServiceLayer->>SessionStore: upsert_chat_session (asyncio.shield)
SessionStore-->>ServiceLayer: success / failure
ServiceLayer->>ServiceLayer: reset _last_flush_time, _msgs_since_flush
end
end
ServiceLayer->>SessionStore: final upsert_chat_session (finally)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)
327-327: Move fixture imports to module top-level.
create_security_hooksis imported inside fixtures at Line 327 and Line 384; this should be a top-level import for backend Python files unless it is a heavy optional dependency.As per coding guidelines, "Import only at the top level; no local/inner imports except for lazy imports of heavy optional dependencies like
openpyxl".Also applies to: 384-384
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py` at line 327, Move the local imports of create_security_hooks out of the fixtures and into the module top-level: remove any inner/local imports of create_security_hooks inside the fixtures and add a single top-level import statement for create_security_hooks at the top of security_hooks_test.py so the fixtures (referencing create_security_hooks at the locations flagged) use the module-level symbol; ensure you don't introduce a heavy optional dependency—only keep it top-level if it's lightweight or keep a conditional lazy import with a clear comment if it is heavy.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks.py`:
- Around line 164-169: The web_search_count and total_tool_call_count counters
defined inside create_security_hooks are closure-local and therefore reset
whenever hooks are recreated (so per-stream rather than per-session); change
them to be persisted per session_id (e.g., store/retrieve counters in Redis or
the app's session model keyed by session_id) and update create_security_hooks to
read the counters at creation and atomically increment/store updates on each
tool call; apply the same change for the other counters referenced around the
later block (the section at the other occurrence you noted) so all
circuit-breaker limits are session-persistent rather than closure-scoped.
In `@autogpt_platform/backend/backend/copilot/sdk/service.py`:
- Around line 1547-1574: The intermediate DB flush currently writes the live
ctx.session via upsert_chat_session whenever _msgs_since_flush/_last_flush_time
trigger, which can persist messages from attempts that may later be rolled back
in memory; change this so rollbackable attempt state is not durably persisted:
either stop calling upsert_chat_session for messages belonging to the current
in-flight attempt (guard the flush using the attempt/turn state or
_attempt_is_rollbackable flag) or, if you must persist, ensure the rollback is
made durable by persisting an explicit attempt checkpoint/transaction marker and
the corresponding rollback operation before retrying (update the flush logic
around _msgs_since_flush/upsert_chat_session and coordinate with the in-memory
rollback path that removes attempt messages so DB state remains consistent).
---
Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py`:
- Line 327: Move the local imports of create_security_hooks out of the fixtures
and into the module top-level: remove any inner/local imports of
create_security_hooks inside the fixtures and add a single top-level import
statement for create_security_hooks at the top of security_hooks_test.py so the
fixtures (referencing create_security_hooks at the locations flagged) use the
module-level symbol; ensure you don't introduce a heavy optional dependency—only
keep it top-level if it's lightweight or keep a conditional lazy import with a
clear comment if it is heavy.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8236af4e-fcf5-4868-b440-b362786a130f
📒 Files selected for processing (4)
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: check API types
- GitHub Check: Seer Code Review
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
- GitHub Check: end-to-end tests
- GitHub Check: Check PR Status
- GitHub Check: Analyze (python)
🧰 Additional context used
📓 Path-based instructions (4)
autogpt_platform/backend/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend developmentRefer to
@backend/CLAUDE.mdfor backend-specific commands, architecture, and development tasks
autogpt_platform/backend/**/*.py: Import only at the top level; no local/inner imports except for lazy imports of heavy optional dependencies likeopenpyxl
Use absolute imports withfrom backend.module import ...for cross-package imports; single-dot relative imports (from .sibling import ...) are acceptable for sibling modules within the same package; avoid double-dot relative imports (from ..parent import ...)
Do not use duck typing withhasattr(),getattr(), orisinstance()for type dispatch; use typed interfaces, unions, or protocols instead
Use Pydantic models for structured data instead of dataclasses, namedtuples, or dicts
Do not use linter suppressors; no# type: ignore,# noqa, or# pyright: ignorecomments — fix the underlying type/code issue instead
Use list comprehensions instead of manual loop-and-append patterns
Use early return guard clauses to avoid deep nesting
Use%sfor deferred interpolation indebuglog statements; use f-strings for readability in other log levels (e.g.,logger.debug("Processing %s items", count),logger.info(f"Processing {count} items"))
Sanitize error paths usingos.path.basename()in error messages to avoid leaking directory structure
Avoid TOCTOU (time-of-check-time-of-use) patterns; do not use check-then-act patterns for file access and credit charging operations
Use Redis pipelines withtransaction=Truefor atomicity on multi-step Redis operations
Usemax(0, value)guards for computed values that should never be negative
Keep files under ~300 lines; if a file grows beyond this, split by responsibility (extract h...
Files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
autogpt_platform/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Format Python code with
poetry run format
Files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
autogpt_platform/backend/**/*test*.py
📄 CodeRabbit inference engine (AGENTS.md)
Run
poetry run testfor backend testing (runs pytest with docker based postgres + prisma)
Files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
autogpt_platform/backend/**/*_test.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
autogpt_platform/backend/**/*_test.py: Use pytest with snapshot testing for API responses; test files should be colocated with source files using the*_test.pynaming pattern
Mock at boundaries by mocking where the symbol is used, not where it is defined. After refactoring, update mock targets to match new module paths
UseAsyncMockfromunittest.mockfor mocking async functions
Files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
🧠 Learnings (13)
📓 Common learnings
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12439
File: autogpt_platform/backend/backend/blocks/autogpt_copilot.py:0-0
Timestamp: 2026-03-16T17:00:02.827Z
Learning: In autogpt_platform/backend/backend/blocks/autogpt_copilot.py, the recursion guard uses two module-level ContextVars: `_copilot_recursion_depth` (tracks current nesting depth) and `_copilot_recursion_limit` (stores the chain-wide ceiling). On the first invocation, `_copilot_recursion_limit` is set to `max_recursion_depth`; nested calls use `min(inherited_limit, max_recursion_depth)`, so they can only lower the cap, never raise it. The entry/exit logic is extracted into module-level helper functions. This is the approved pattern for preventing runaway sub-agent recursion in AutogptCopilotBlock (PR `#12439`, commits 348e9f8e2 and 3b70f61b1).
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/backend/backend/copilot/workflow_import/converter.py:0-0
Timestamp: 2026-03-17T10:57:12.953Z
Learning: In Significant-Gravitas/AutoGPT PR `#12440`, `autogpt_platform/backend/backend/copilot/workflow_import/converter.py` was fully rewritten (commit 732960e2d) to no longer make direct LLM/OpenAI API calls. The converter now builds a structured text prompt for AutoPilot/CoPilot instead. There is no `response.choices` access or any direct LLM client usage in this file. Do not flag `response.choices` access or LLM client initialization patterns as issues in this file.
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12356
File: autogpt_platform/backend/backend/copilot/constants.py:9-12
Timestamp: 2026-03-10T08:39:22.025Z
Learning: In Significant-Gravitas/AutoGPT PR `#12356`, the `COPILOT_SYNTHETIC_ID_PREFIX = "copilot-"` check in `create_auto_approval_record` (human_review.py) is intentional and safe. The `graph_exec_id` passed to this function comes from server-side `PendingHumanReview` DB records (not from user input); the API only accepts `node_exec_id` from users. Synthetic `copilot-*` IDs are only ever created server-side in `run_block.py`. The prefix skip avoids a DB lookup for a `AgentGraphExecution` record that legitimately does not exist for CoPilot sessions, while `user_id` scoping is enforced at the auth layer and on the resulting auto-approval record.
📚 Learning: 2026-03-17T10:57:12.953Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/backend/backend/copilot/workflow_import/converter.py:0-0
Timestamp: 2026-03-17T10:57:12.953Z
Learning: In Significant-Gravitas/AutoGPT PR `#12440`, `autogpt_platform/backend/backend/copilot/workflow_import/converter.py` was fully rewritten (commit 732960e2d) to no longer make direct LLM/OpenAI API calls. The converter now builds a structured text prompt for AutoPilot/CoPilot instead. There is no `response.choices` access or any direct LLM client usage in this file. Do not flag `response.choices` access or LLM client initialization patterns as issues in this file.
Applied to files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-27T08:39:43.869Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 12592
File: autogpt_platform/frontend/AGENTS.md:1-3
Timestamp: 2026-03-27T08:39:43.869Z
Learning: In Significant-Gravitas/AutoGPT, Claude is the primary coding agent. AGENTS.md files intentionally retain Claude-specific wording (e.g., "CLAUDE.md - Frontend", "This file provides guidance to Claude Code") even though AGENTS.md is the canonical cross-agent instruction source. Do not flag Claude-specific titles or phrasing in AGENTS.md files as issues.
Applied to files:
autogpt_platform/backend/backend/copilot/prompting.py
📚 Learning: 2026-02-26T17:02:22.448Z
Learnt from: Pwuts
Repo: Significant-Gravitas/AutoGPT PR: 12211
File: .pre-commit-config.yaml:160-179
Timestamp: 2026-02-26T17:02:22.448Z
Learning: Keep the pre-commit hook pattern broad for autogpt_platform/backend to ensure OpenAPI schema changes are captured. Do not narrow to backend/api/ alone, since the generated schema depends on Pydantic models across multiple directories (backend/data/, backend/blocks/, backend/copilot/, backend/integrations/, backend/util/). Narrowing could miss schema changes and cause frontend type desynchronization.
Applied to files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-04T08:04:35.881Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12273
File: autogpt_platform/backend/backend/copilot/tools/workspace_files.py:216-220
Timestamp: 2026-03-04T08:04:35.881Z
Learning: In the AutoGPT Copilot backend, ensure that SVG images are not treated as vision image types by excluding 'image/svg+xml' from INLINEABLE_MIME_TYPES and MULTIMODAL_TYPES in tool_adapter.py; the Claude API supports PNG, JPEG, GIF, and WebP for vision. SVGs (XML text) should be handled via the text path instead, not the vision path.
Applied to files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-05T15:42:08.207Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 12297
File: .claude/skills/backend-check/SKILL.md:14-16
Timestamp: 2026-03-05T15:42:08.207Z
Learning: In Python files under autogpt_platform/backend (recursively), rely on poetry run format to perform formatting (Black + isort) and linting (ruff). Do not run poetry run lint as a separate step after poetry run format, since format already includes linting checks.
Applied to files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-16T16:35:40.236Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/backend/backend/api/features/workflow_import.py:54-63
Timestamp: 2026-03-16T16:35:40.236Z
Learning: Avoid using the word 'competitor' in public-facing identifiers and text. Use neutral naming for API paths, model names, function names, and UI text. Examples: rename 'CompetitorFormat' to 'SourcePlatform', 'convert_competitor_workflow' to 'convert_workflow', '/competitor-workflow' to '/workflow'. Apply this guideline to files under autogpt_platform/backend and autogpt_platform/frontend.
Applied to files:
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-18T14:03:32.534Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12473
File: autogpt_platform/backend/backend/copilot/tools/agent_browser_integration_test.py:86-195
Timestamp: 2026-03-18T14:03:32.534Z
Learning: In Significant-Gravitas/AutoGPT, the integration tests in `autogpt_platform/backend/backend/copilot/tools/agent_browser_integration_test.py` intentionally use real external URLs (example.com, httpbin.org). They are gated with `pytest.mark.skipif(shutil.which("agent-browser") is None, ...)`, so they are automatically skipped in standard CI where agent-browser is not installed. They are designed to be run explicitly inside the Docker environment to verify that system Chromium (AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium) actually launches and can fetch pages end-to-end. Do not flag the use of real network calls in these tests as a flakiness concern.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-25T06:59:27.324Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-03-25T06:59:27.324Z
Learning: Applies to autogpt_platform/backend/**/*_test.py : Mock at boundaries by mocking where the symbol is used, not where it is defined. After refactoring, update mock targets to match new module paths
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-15T16:52:15.463Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12426
File: autogpt_platform/backend/backend/copilot/sdk/service.py:0-0
Timestamp: 2026-03-15T16:52:15.463Z
Learning: In Significant-Gravitas/AutoGPT (copilot backend), GitHub tokens (GH_TOKEN / GITHUB_TOKEN) for the `gh` CLI are injected lazily per-command in `autogpt_platform/backend/backend/copilot/tools/bash_exec._execute_on_e2b()` by calling `integration_creds.get_integration_env_vars(user_id)`, not on the global SDK subprocess environment in `sdk/service.py`. This scopes credentials to individual E2B sandbox command invocations and prevents token leakage into tool output streams or uploaded transcripts.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-17T06:48:26.471Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12445
File: autogpt_platform/backend/backend/copilot/sdk/service.py:1071-1072
Timestamp: 2026-03-17T06:48:26.471Z
Learning: In Significant-Gravitas/AutoGPT (autogpt_platform), the AI SDK enforces `z.strictObject({type, errorText})` on SSE `StreamError` responses, so additional fields like `retryable: bool` cannot be added to `StreamError` or serialized via `to_sse()`. Instead, retry signaling for transient Anthropic API errors is done via the `COPILOT_RETRYABLE_ERROR_PREFIX` constant prepended to persisted session messages (in `ChatMessage.content`). The frontend detects retryable errors by checking `markerType === "retryable_error"` from `parseSpecialMarkers()` — no SSE schema changes and no string matching on error text. This pattern was established in PR `#12445`, commit 64d82797b.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/service.py
📚 Learning: 2026-03-13T15:49:44.961Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12385
File: autogpt_platform/backend/backend/copilot/rate_limit.py:0-0
Timestamp: 2026-03-13T15:49:44.961Z
Learning: In `autogpt_platform/backend/backend/copilot/rate_limit.py`, the original per-session token window (with a TTL-based reset) was replaced with fixed daily and weekly windows. `resets_at` is now derived from `_daily_reset_time()` (midnight UTC) and `_weekly_reset_time()` (next Monday 00:00 UTC) — deterministic fixed-boundary calculations that require no Redis TTL introspection.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/service.py
📚 Learning: 2026-03-16T17:00:02.827Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12439
File: autogpt_platform/backend/backend/blocks/autogpt_copilot.py:0-0
Timestamp: 2026-03-16T17:00:02.827Z
Learning: In autogpt_platform/backend/backend/blocks/autogpt_copilot.py, the recursion guard uses two module-level ContextVars: `_copilot_recursion_depth` (tracks current nesting depth) and `_copilot_recursion_limit` (stores the chain-wide ceiling). On the first invocation, `_copilot_recursion_limit` is set to `max_recursion_depth`; nested calls use `min(inherited_limit, max_recursion_depth)`, so they can only lower the cap, never raise it. The entry/exit logic is extracted into module-level helper functions. This is the approved pattern for preventing runaway sub-agent recursion in AutogptCopilotBlock (PR `#12439`, commits 348e9f8e2 and 3b70f61b1).
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
🔇 Additional comments (2)
autogpt_platform/backend/backend/copilot/prompting.py (1)
110-117: Prompt guidance addition looks solid.The new web-search guardrails are clear and consistent with the circuit-breaker behavior introduced elsewhere in this PR.
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)
340-432: Great cap-behavior test coverage.The allow-under-cap and deny-at-cap+1 assertions for both WebSearch and total tool calls are exactly the right regression surface for this change.
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
Outdated
Show resolved
Hide resolved
…pe labels, reorder checks, top-level imports - Guard intermediate DB flush with is_final_attempt flag to prevent persisting messages from attempts that may be rolled back on retry - Fix 'per session' → 'per turn' in comments/docstrings/denial messages since hooks are recreated per stream invocation - Reorder circuit breaker checks: WebSearch cap before total counter increment so denied searches don't consume total budget slots - Move create_security_hooks import to module top-level in tests per CLAUDE.md coding guidelines
The is_final_attempt guard disabled intermediate flush for the common case (first attempt succeeds, which is 99%+ of turns). Retries only fire on context-too-long errors with events_yielded==0, meaning the stream barely started and flush threshold was almost certainly not reached. Keep flush always enabled and document the theoretical edge case.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (1)
379-431: Total tool call cap tests are well-structured.Good test isolation by setting a high
max_web_searches(100) to focus on the total cap behavior. The tests verify the cap is enforced and the denial message instructs synthesis.Optional enhancement: Consider adding a test that verifies denied WebSearches don't consume total tool call budget (e.g., hit WebSearch cap at 3, then confirm 5 non-WebSearch calls still succeed with
max_tool_calls=5). This would provide explicit coverage of the ordering guarantee.,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py` around lines 379 - 431, Add a new test ensuring denied WebSearch attempts don't count toward the total tool call cap: create security hooks via create_security_hooks with max_web_searches=3 and max_tool_calls=5, call the PreToolUse hook to perform three WebSearch tool calls (expect them to be denied using _is_denied), then perform five non-WebSearch tool calls (e.g., tool_name "SomeTool") and assert they are all allowed with _is_denied false; place this as an async pytest test (e.g., test_websearch_denials_do_not_consume_total_cap) alongside the existing tests and reuse the PreToolUse hook retrieval pattern used in _hooks_tool_cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py`:
- Around line 379-431: Add a new test ensuring denied WebSearch attempts don't
count toward the total tool call cap: create security hooks via
create_security_hooks with max_web_searches=3 and max_tool_calls=5, call the
PreToolUse hook to perform three WebSearch tool calls (expect them to be denied
using _is_denied), then perform five non-WebSearch tool calls (e.g., tool_name
"SomeTool") and assert they are all allowed with _is_denied false; place this as
an async pytest test (e.g., test_websearch_denials_do_not_consume_total_cap)
alongside the existing tests and reuse the PreToolUse hook retrieval pattern
used in _hooks_tool_cap.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6190152c-6340-4456-a19a-38180664f239
📒 Files selected for processing (4)
autogpt_platform/backend/backend/copilot/prompting.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.pyautogpt_platform/backend/backend/copilot/sdk/service.py
✅ Files skipped from review due to trivial changes (1)
- autogpt_platform/backend/backend/copilot/prompting.py
🚧 Files skipped from review as they are similar to previous changes (1)
- autogpt_platform/backend/backend/copilot/sdk/service.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
- GitHub Check: check API types
- GitHub Check: end-to-end tests
- GitHub Check: test (3.12)
- GitHub Check: test (3.11)
- GitHub Check: test (3.13)
- GitHub Check: Seer Code Review
- GitHub Check: Check PR Status
🧰 Additional context used
📓 Path-based instructions (4)
autogpt_platform/backend/**/*.py
📄 CodeRabbit inference engine (.github/copilot-instructions.md)
autogpt_platform/backend/**/*.py: Use Python 3.11 (required; managed by Poetry via pyproject.toml) for backend development
Always run 'poetry run format' (Black + isort) before linting in backend development
Always run 'poetry run lint' (ruff) after formatting in backend developmentRefer to
@backend/CLAUDE.mdfor backend-specific commands, architecture, and development tasks
autogpt_platform/backend/**/*.py: Import only at the top level; no local/inner imports except for lazy imports of heavy optional dependencies likeopenpyxl
Use absolute imports withfrom backend.module import ...for cross-package imports; single-dot relative imports (from .sibling import ...) are acceptable for sibling modules within the same package; avoid double-dot relative imports (from ..parent import ...)
Do not use duck typing withhasattr(),getattr(), orisinstance()for type dispatch; use typed interfaces, unions, or protocols instead
Use Pydantic models for structured data instead of dataclasses, namedtuples, or dicts
Do not use linter suppressors; no# type: ignore,# noqa, or# pyright: ignorecomments — fix the underlying type/code issue instead
Use list comprehensions instead of manual loop-and-append patterns
Use early return guard clauses to avoid deep nesting
Use%sfor deferred interpolation indebuglog statements; use f-strings for readability in other log levels (e.g.,logger.debug("Processing %s items", count),logger.info(f"Processing {count} items"))
Sanitize error paths usingos.path.basename()in error messages to avoid leaking directory structure
Avoid TOCTOU (time-of-check-time-of-use) patterns; do not use check-then-act patterns for file access and credit charging operations
Use Redis pipelines withtransaction=Truefor atomicity on multi-step Redis operations
Usemax(0, value)guards for computed values that should never be negative
Keep files under ~300 lines; if a file grows beyond this, split by responsibility (extract h...
Files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
autogpt_platform/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Format Python code with
poetry run format
Files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
autogpt_platform/backend/**/*test*.py
📄 CodeRabbit inference engine (AGENTS.md)
Run
poetry run testfor backend testing (runs pytest with docker based postgres + prisma)
Files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
autogpt_platform/backend/**/*_test.py
📄 CodeRabbit inference engine (autogpt_platform/backend/CLAUDE.md)
autogpt_platform/backend/**/*_test.py: Use pytest with snapshot testing for API responses; test files should be colocated with source files using the*_test.pynaming pattern
Mock at boundaries by mocking where the symbol is used, not where it is defined. After refactoring, update mock targets to match new module paths
UseAsyncMockfromunittest.mockfor mocking async functions
Files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
🧠 Learnings (17)
📓 Common learnings
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12604
File: autogpt_platform/backend/backend/copilot/sdk/security_hooks.py:165-171
Timestamp: 2026-03-30T11:49:36.382Z
Learning: In `autogpt_platform/backend/backend/copilot/sdk/security_hooks.py`, the `web_search_count` and `total_tool_call_count` circuit-breaker counters in `create_security_hooks` are intentionally per-turn (closure-local), not per-session. Hooks are recreated per stream invocation in `service.py`, so counters reset each turn. This is an accepted v1 design: it caps a single runaway turn (incident d2f7cba3: 179 WebSearch calls, $20.66). True per-session persistence via Redis is deferred to a later iteration. Do not flag these as a per-session vs. per-turn mismatch bug.
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12439
File: autogpt_platform/backend/backend/blocks/autogpt_copilot.py:0-0
Timestamp: 2026-03-16T17:00:02.827Z
Learning: In autogpt_platform/backend/backend/blocks/autogpt_copilot.py, the recursion guard uses two module-level ContextVars: `_copilot_recursion_depth` (tracks current nesting depth) and `_copilot_recursion_limit` (stores the chain-wide ceiling). On the first invocation, `_copilot_recursion_limit` is set to `max_recursion_depth`; nested calls use `min(inherited_limit, max_recursion_depth)`, so they can only lower the cap, never raise it. The entry/exit logic is extracted into module-level helper functions. This is the approved pattern for preventing runaway sub-agent recursion in AutogptCopilotBlock (PR `#12439`, commits 348e9f8e2 and 3b70f61b1).
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12356
File: autogpt_platform/backend/backend/copilot/constants.py:9-12
Timestamp: 2026-03-10T08:39:22.025Z
Learning: In Significant-Gravitas/AutoGPT PR `#12356`, the `COPILOT_SYNTHETIC_ID_PREFIX = "copilot-"` check in `create_auto_approval_record` (human_review.py) is intentional and safe. The `graph_exec_id` passed to this function comes from server-side `PendingHumanReview` DB records (not from user input); the API only accepts `node_exec_id` from users. Synthetic `copilot-*` IDs are only ever created server-side in `run_block.py`. The prefix skip avoids a DB lookup for a `AgentGraphExecution` record that legitimately does not exist for CoPilot sessions, while `user_id` scoping is enforced at the auth layer and on the resulting auto-approval record.
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/backend/backend/copilot/workflow_import/converter.py:0-0
Timestamp: 2026-03-17T10:57:12.953Z
Learning: In Significant-Gravitas/AutoGPT PR `#12440`, `autogpt_platform/backend/backend/copilot/workflow_import/converter.py` was fully rewritten (commit 732960e2d) to no longer make direct LLM/OpenAI API calls. The converter now builds a structured text prompt for AutoPilot/CoPilot instead. There is no `response.choices` access or any direct LLM client usage in this file. Do not flag `response.choices` access or LLM client initialization patterns as issues in this file.
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12445
File: autogpt_platform/backend/backend/copilot/sdk/service.py:1071-1072
Timestamp: 2026-03-17T06:48:26.471Z
Learning: In Significant-Gravitas/AutoGPT (autogpt_platform), the AI SDK enforces `z.strictObject({type, errorText})` on SSE `StreamError` responses, so additional fields like `retryable: bool` cannot be added to `StreamError` or serialized via `to_sse()`. Instead, retry signaling for transient Anthropic API errors is done via the `COPILOT_RETRYABLE_ERROR_PREFIX` constant prepended to persisted session messages (in `ChatMessage.content`). The frontend detects retryable errors by checking `markerType === "retryable_error"` from `parseSpecialMarkers()` — no SSE schema changes and no string matching on error text. This pattern was established in PR `#12445`, commit 64d82797b.
📚 Learning: 2026-03-30T11:49:36.382Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12604
File: autogpt_platform/backend/backend/copilot/sdk/security_hooks.py:165-171
Timestamp: 2026-03-30T11:49:36.382Z
Learning: In `autogpt_platform/backend/backend/copilot/sdk/security_hooks.py`, the `web_search_count` and `total_tool_call_count` circuit-breaker counters in `create_security_hooks` are intentionally per-turn (closure-local), not per-session. Hooks are recreated per stream invocation in `service.py`, so counters reset each turn. This is an accepted v1 design: it caps a single runaway turn (incident d2f7cba3: 179 WebSearch calls, $20.66). True per-session persistence via Redis is deferred to a later iteration. Do not flag these as a per-session vs. per-turn mismatch bug.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-16T17:00:02.827Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12439
File: autogpt_platform/backend/backend/blocks/autogpt_copilot.py:0-0
Timestamp: 2026-03-16T17:00:02.827Z
Learning: In autogpt_platform/backend/backend/blocks/autogpt_copilot.py, the recursion guard uses two module-level ContextVars: `_copilot_recursion_depth` (tracks current nesting depth) and `_copilot_recursion_limit` (stores the chain-wide ceiling). On the first invocation, `_copilot_recursion_limit` is set to `max_recursion_depth`; nested calls use `min(inherited_limit, max_recursion_depth)`, so they can only lower the cap, never raise it. The entry/exit logic is extracted into module-level helper functions. This is the approved pattern for preventing runaway sub-agent recursion in AutogptCopilotBlock (PR `#12439`, commits 348e9f8e2 and 3b70f61b1).
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-12T14:42:40.552Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12385
File: autogpt_platform/backend/backend/copilot/rate_limit.py:141-170
Timestamp: 2026-03-12T14:42:40.552Z
Learning: In Significant-Gravitas/AutoGPT, `check_rate_limit` in `autogpt_platform/backend/backend/copilot/rate_limit.py` is intentionally a "pre-turn soft check" (not a hard atomic reservation). Because LLM token counts are unknown before generation completes, a strict check-and-reserve is impractical. The TOCTOU race (two concurrent turns both passing the pre-check and both committing via `record_token_usage`) is an accepted trade-off. If stricter enforcement is ever needed, the approach is a Lua script doing GET+INCRBY atomically in Redis.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-10T08:39:22.025Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12356
File: autogpt_platform/backend/backend/copilot/constants.py:9-12
Timestamp: 2026-03-10T08:39:22.025Z
Learning: In Significant-Gravitas/AutoGPT PR `#12356`, the `COPILOT_SYNTHETIC_ID_PREFIX = "copilot-"` check in `create_auto_approval_record` (human_review.py) is intentional and safe. The `graph_exec_id` passed to this function comes from server-side `PendingHumanReview` DB records (not from user input); the API only accepts `node_exec_id` from users. Synthetic `copilot-*` IDs are only ever created server-side in `run_block.py`. The prefix skip avoids a DB lookup for a `AgentGraphExecution` record that legitimately does not exist for CoPilot sessions, while `user_id` scoping is enforced at the auth layer and on the resulting auto-approval record.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-17T06:48:26.471Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12445
File: autogpt_platform/backend/backend/copilot/sdk/service.py:1071-1072
Timestamp: 2026-03-17T06:48:26.471Z
Learning: In Significant-Gravitas/AutoGPT (autogpt_platform), the AI SDK enforces `z.strictObject({type, errorText})` on SSE `StreamError` responses, so additional fields like `retryable: bool` cannot be added to `StreamError` or serialized via `to_sse()`. Instead, retry signaling for transient Anthropic API errors is done via the `COPILOT_RETRYABLE_ERROR_PREFIX` constant prepended to persisted session messages (in `ChatMessage.content`). The frontend detects retryable errors by checking `markerType === "retryable_error"` from `parseSpecialMarkers()` — no SSE schema changes and no string matching on error text. This pattern was established in PR `#12445`, commit 64d82797b.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-13T15:49:44.961Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12385
File: autogpt_platform/backend/backend/copilot/rate_limit.py:0-0
Timestamp: 2026-03-13T15:49:44.961Z
Learning: In `autogpt_platform/backend/backend/copilot/rate_limit.py`, the original per-session token window (with a TTL-based reset) was replaced with fixed daily and weekly windows. `resets_at` is now derived from `_daily_reset_time()` (midnight UTC) and `_weekly_reset_time()` (next Monday 00:00 UTC) — deterministic fixed-boundary calculations that require no Redis TTL introspection.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-10T08:38:36.655Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12356
File: autogpt_platform/backend/backend/copilot/tools/run_block.py:349-370
Timestamp: 2026-03-10T08:38:36.655Z
Learning: In the AutoGPT CoPilot HITL (Human-In-The-Loop) flow (`autogpt_platform/backend/backend/copilot/tools/run_block.py`), the review card presented to users sets `editable: false`, meaning reviewers cannot modify the input payload. Therefore, credentials resolved before `is_block_exec_need_review()` remain valid and do not need to be recomputed after the review step — the original `input_data` is unchanged through the review lifecycle.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py
📚 Learning: 2026-03-17T10:57:12.953Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/backend/backend/copilot/workflow_import/converter.py:0-0
Timestamp: 2026-03-17T10:57:12.953Z
Learning: In Significant-Gravitas/AutoGPT PR `#12440`, `autogpt_platform/backend/backend/copilot/workflow_import/converter.py` was fully rewritten (commit 732960e2d) to no longer make direct LLM/OpenAI API calls. The converter now builds a structured text prompt for AutoPilot/CoPilot instead. There is no `response.choices` access or any direct LLM client usage in this file. Do not flag `response.choices` access or LLM client initialization patterns as issues in this file.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-02-26T17:02:22.448Z
Learnt from: Pwuts
Repo: Significant-Gravitas/AutoGPT PR: 12211
File: .pre-commit-config.yaml:160-179
Timestamp: 2026-02-26T17:02:22.448Z
Learning: Keep the pre-commit hook pattern broad for autogpt_platform/backend to ensure OpenAPI schema changes are captured. Do not narrow to backend/api/ alone, since the generated schema depends on Pydantic models across multiple directories (backend/data/, backend/blocks/, backend/copilot/, backend/integrations/, backend/util/). Narrowing could miss schema changes and cause frontend type desynchronization.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-04T08:04:35.881Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12273
File: autogpt_platform/backend/backend/copilot/tools/workspace_files.py:216-220
Timestamp: 2026-03-04T08:04:35.881Z
Learning: In the AutoGPT Copilot backend, ensure that SVG images are not treated as vision image types by excluding 'image/svg+xml' from INLINEABLE_MIME_TYPES and MULTIMODAL_TYPES in tool_adapter.py; the Claude API supports PNG, JPEG, GIF, and WebP for vision. SVGs (XML text) should be handled via the text path instead, not the vision path.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-05T15:42:08.207Z
Learnt from: ntindle
Repo: Significant-Gravitas/AutoGPT PR: 12297
File: .claude/skills/backend-check/SKILL.md:14-16
Timestamp: 2026-03-05T15:42:08.207Z
Learning: In Python files under autogpt_platform/backend (recursively), rely on poetry run format to perform formatting (Black + isort) and linting (ruff). Do not run poetry run lint as a separate step after poetry run format, since format already includes linting checks.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-16T16:35:40.236Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/backend/backend/api/features/workflow_import.py:54-63
Timestamp: 2026-03-16T16:35:40.236Z
Learning: Avoid using the word 'competitor' in public-facing identifiers and text. Use neutral naming for API paths, model names, function names, and UI text. Examples: rename 'CompetitorFormat' to 'SourcePlatform', 'convert_competitor_workflow' to 'convert_workflow', '/competitor-workflow' to '/workflow'. Apply this guideline to files under autogpt_platform/backend and autogpt_platform/frontend.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks.pyautogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-25T06:59:27.324Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-03-25T06:59:27.324Z
Learning: Applies to autogpt_platform/backend/**/*_test.py : Mock at boundaries by mocking where the symbol is used, not where it is defined. After refactoring, update mock targets to match new module paths
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-23T06:36:25.447Z
Learnt from: majdyz
Repo: Significant-Gravitas/AutoGPT PR: 12440
File: autogpt_platform/frontend/src/app/(platform)/library/components/LibraryImportWorkflowDialog/useLibraryImportWorkflowDialog.ts:0-0
Timestamp: 2026-03-23T06:36:25.447Z
Learning: In Significant-Gravitas/AutoGPT PR `#12440`, the `LibraryImportWorkflowDialog` (previously `LibraryImportCompetitorDialog`) and its associated generated API hook (`usePostV2ImportACompetitorWorkflowN8nMakeComZapier` / `usePostV2ImportAWorkflowFromAnotherToolN8nMakeComZapier`) were removed in a subsequent refactor. Workflow import from external platforms (n8n, Make.com, Zapier) now uses a server action `fetchWorkflowFromUrl` instead of direct API calls or generated orval hooks. Do not expect or flag missing generated hook usage for workflow import in `autogpt_platform/frontend/src/app/(platform)/library/components/LibraryImportWorkflowDialog/`.
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-02-04T16:49:42.490Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2026-02-04T16:49:42.490Z
Learning: Applies to autogpt_platform/frontend/**/*.{tsx,ts} : Use generated API hooks from '@/app/api/__generated__/endpoints/' instead of deprecated 'BackendAPI' or 'src/lib/autogpt-server-api/*'
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
📚 Learning: 2026-03-25T06:59:27.324Z
Learnt from: CR
Repo: Significant-Gravitas/AutoGPT PR: 0
File: autogpt_platform/backend/CLAUDE.md:0-0
Timestamp: 2026-03-25T06:59:27.324Z
Learning: Applies to autogpt_platform/backend/backend/api/**/*.py : Use `Security()` instead of `Depends()` for authentication dependencies to ensure proper OpenAPI security specification
Applied to files:
autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py
🔇 Additional comments (7)
autogpt_platform/backend/backend/copilot/sdk/security_hooks.py (5)
131-152: Circuit breaker parameters and documentation are well-designed.The new parameters with sensible defaults (15 WebSearches, 100 total tool calls) directly address the runaway-turn incident. The docstring correctly documents the per-turn scope and explains that counters reset each stream invocation, which aligns with the accepted v1 design.
165-171: Counter implementation using mutable closure containers is appropriate.The
list[int]pattern for closure-mutable state is idiomatic Python. The comments clearly document the per-turn scope and reference the incident this addresses. Based on learnings, this per-turn design is intentional for v1—it caps runaway turns while deferring Redis-based session persistence.
209-227: WebSearch circuit breaker logic is correct.The ordering (WebSearch check before total counter increment) ensures denied searches don't consume slots from the total budget—matching the commit description. The denial message appropriately instructs the LLM to synthesize from gathered information rather than retry.
229-244: Total tool call circuit breaker provides a hard cap.This serves as the ultimate safety net to bound cost and latency regardless of tool type. The 100-call default is reasonable for complex research tasks while preventing runaway turns.
367-371: Logging format cleanup looks good.The single-line format with
%splaceholders is appropriate forinfo-level logs.autogpt_platform/backend/backend/copilot/sdk/security_hooks_test.py (2)
14-18: Import moved to module top-level—compliant with guidelines.This addresses the previous feedback about avoiding local imports.
323-377: WebSearch cap tests provide good coverage.The tests correctly verify:
- Exactly
max_web_searchescalls are allowed (3 in fixture)- The next call is denied
- The denial message contains actionable guidance ("synthesize", "per turn")
Otto-AGPT
left a comment
There was a problem hiding this comment.
Reviewed the diff. Overall this is solid and addresses both issues from the d2f7cba3 incident. A few observations:
Circuit breakers (security_hooks.py) ✅
- Ordering is correct: denied WebSearch calls don't consume total tool budget (early return before total counter increment)
- Counter uses
list[int]for closure mutability — works fine, consistent with howtask_tool_use_idsis tracked in the same closure - Cap values (15 searches, 100 total) seem reasonable. For reference, the incident session had 179 searches — this would have cut it to 15 and saved $15+ and ~70 minutes
One thing to verify: Where is create_security_hooks called relative to retries in _run_stream_attempt? If hooks are shared across retry attempts, the counters persist — so after a failed attempt that used 10 searches, the retry only gets 5 more. This is probably fine since retries only happen on context-too-long with events_yielded==0 (meaning counters are barely incremented), but worth being aware of.
Intermediate persistence (service.py) ✅
asyncio.shieldprotects the flush from cancellation — good- Rollback note for orphaned messages is accurate and the risk is minimal
- Minor observation:
_msgs_since_flushincrements for every SDK stream event (including text deltas), not just meaningful UI messages. During rapid text streaming, the 10-message threshold could trigger frequent flushes. In practice the 30s time interval is the effective throttle, so this isn't a real issue — but if you wanted to tighten it, you could count only tool outputs or assistant message completions
Prompt guidance (prompting.py) ✅
- Clear and actionable
Tests ✅
- Good coverage for both caps (under cap allowed, at cap denied)
- Tests correctly verify denial messages contain 'synthesize' and 'per turn'
Not addressed (separate follow-up):
- The GCP logs confirmed WebSearch calls are genuinely 29-54s each — moving to our own Brave Search MCP tool would be the real perf fix (20x improvement). This PR is the right short-term mitigation though.
- Redis TTL expiring before long turns complete (tracked in SECRT-2178)
…Search For research-heavy tasks (5+ searches), the prompt now directs the LLM to use run_block with PerplexityBlock (sonar-pro) instead of repeated WebSearch calls. Perplexity returns synthesized, cited answers in a single call — avoiding the 29-54s per-call latency of SDK WebSearch and reducing total search count significantly.
Perplexity guidance is managed in the Langfuse prompt (source of truth), not in the code supplement. Reverts the redundant addition from fc13c30. The Langfuse prompt has been updated with stronger, actionable guidance including block ID, model names, and clear trigger (3+ searches).
21bffb1 to
b3f52ce
Compare
Root cause of empty session on reload: the session meta key in Redis has a 1h TTL set once at create_session time. Turns exceeding 1h (like session d2f7cba3 at 82min) cause the meta key to expire, making get_active_session return False. The resume endpoint then returns 204 and the frontend shows an empty session. Fix: publish_chunk now periodically refreshes the meta key TTL (every 60s) when session_id is provided. stream_and_publish already has session_id and passes it through. This keeps the meta key alive for as long as chunks are being published. GCP logs confirmed the bug: at 09:49 (73 min into the turn), GET_SESSION returned active_session=False, msg_count=1 — the meta key had expired 13 minutes earlier.
During long sub-agent runs, the SDK sends task_progress SystemMessages that were previously silent (no stream chunks produced). This meant publish_chunk was never called during those gaps, causing BOTH the meta key and stream key to expire in Redis. Fix: - response_adapter: emit StreamHeartbeat for task_progress events, so publish_chunk is called even during sub-agent gaps - stream_registry: refresh stream key TTL alongside meta key in the periodic keepalive block (every 60s) This ensures that as long as the SDK is producing any events (including task_progress), both Redis keys stay alive. Confirmed via live reproduction: session d2f7cba3 T13 ran for 1h45min+ with both keys expired because only task_progress events were arriving.
- Disable the SDK built-in Task (sub-agent) tool by adding it to SDK_DISALLOWED_TOOLS. The AutoPilotBlock via run_block is the preferred delegation mechanism — it has full Langfuse observability, unlike the SDK Task tool which runs opaquely. The Task tool was the root cause of the d2f7cba3 incident: it spawned 5 sub-agents with no shared context, each independently hammering WebSearch with overlapping queries. - Bump WebSearch cap from 15 to 30 per turn — less restrictive while still preventing the worst-case runaway. - Update prompt to reflect Task tool is disabled, point to AutoPilotBlock for sub-agent delegation.
…, WebSearch denial budget - tool_adapter_test: verify Task, Bash, WebFetch are in SDK_DISALLOWED_TOOLS - response_adapter_test: verify task_progress emits StreamHeartbeat - security_hooks_test: verify denied WebSearches don't consume total tool budget
Task was added to SDK_DISALLOWED_TOOLS so all Task tests now expect denial. Removed concurrency slot tests since they're unreachable when the tool is blocked at the access level.
Remove session entries from the module-level _meta_ttl_refresh_at dict when mark_session_completed is called, preventing unbounded memory growth over the lifetime of the backend process.
Code Review -- PR #12604SummaryThis PR adds: (1) WebSearch + total tool call circuit breakers in security hooks, (2) intermediate session persistence (flush to DB every 30s/10 msgs), (3) StreamHeartbeat on task_progress events to keep Redis TTLs alive, (4) Task tool fully blocked via SDK_DISALLOWED_TOOLS + BLOCKED_TOOLS, and (5) web search best practice prompting guidance. Findings
VerdictClean PR. All circuit breaker tests pass, ordering is correct, memory leak fixed. No bugs found. |
Why
CoPilot session
d2f7cba3took 82 minutes and cost $20.66 for a single user message. Root causes:What
Circuit breakers
Redis TTL fixes (root cause of empty session on reload)
publish_chunk()now periodically refreshes both the session meta key AND stream key TTL (every 60s).task_progressSDK events now emitStreamHeartbeatchunks, ensuringpublish_chunkis called even during long sub-agent gaps where no real chunks are produced.stream_ttllose their "running" status and stream data, makingget_active_session()return False.Intermediate DB persistence
asyncio.shield(upsert_chat_session())matching the existingfinallyblock pattern.SDK Task tool disabled
"Task"toSDK_DISALLOWED_TOOLS. The AutoPilotBlock viarun_blockis the preferred sub-agent delegation mechanism — it has full Langfuse observability.security_hooks.pykept as defense-in-depth.Prompt guidance
How
Files changed
security_hooks.py— WebSearch + total tool call circuit breakers with configurable capsstream_registry.py— Redis meta + stream key TTL refresh inpublish_chunk(), module-level keepalive trackerresponse_adapter.py—task_progressSystemMessage → StreamHeartbeat emissionservice.py— Intermediate DB persistence in_run_stream_attemptstream looptool_adapter.py— Task added toSDK_DISALLOWED_TOOLSprompting.py— Web search best practices, Task tool disabled notesecurity_hooks_test.py— Tests for WebSearch cap, total tool cap, task_progress heartbeat, Task disallowedGCP log evidence
Test plan