diff --git a/CLAUDE.md b/CLAUDE.md index 411675e..10805ac 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -7,614 +7,257 @@ Python-based async orchestrator that polls Yandex Tracker for tasks tagged `ai-task` and dispatches Claude Agent SDK agents to execute them. Includes a real-time web dashboard for monitoring agent output and task status. ## Stack -- Python 3.12+ -- claude-agent-sdk (Claude Agent SDK for Python) -- FastAPI + uvicorn (web dashboard) -- requests (Tracker REST API) -- PyYAML (repos.yaml config) -- React 19 + Vite + TypeScript + TailwindCSS (frontend) -- xterm.js (agent terminal streaming) -- pytest + pytest-asyncio (testing) +- Python 3.12+, claude-agent-sdk, FastAPI + uvicorn, requests, PyYAML +- React 19 + Vite + TypeScript + TailwindCSS, xterm.js +- pytest + pytest-asyncio ## Architecture -- **Agent SDK** — in-process agent execution via `ClaudeSDKClient` +- **Agent SDK** — in-process execution via `ClaudeSDKClient` - **Orchestrator Agent** — `OrchestratorAgent` handles worker result decisions (track PR, complete, fail, epic child events) - **Agent-driven completion** — agent decides when task is done; PR tracking is informational - **In-process MCP tools** — Tracker tools scoped per-issue (no external MCP processes) - **Git worktrees** — per-task workspace isolation - **Error recovery** — typed error classification + exponential backoff retry - **Async concurrency** — `asyncio` with semaphore-controlled parallel agents -- **Epic coordination** — supervisor-driven epic management: auto-discovery of children (with auto-decomposition when no children found), then supervisor sets dependency graph and activation order via MCP tools (`awaiting_plan` → `executing`); supervisor oversees lifecycle events (preflight skips, epic completion) and can reset falsely-terminated children via `epic_reset_child` -- **Workpad** — persistent structured comment on Tracker issue; idempotent creation via hidden HTML marker (``); auto-discovery on session resume; agent updates progress after each milestone via `tracker_create_workpad` / `tracker_update_workpad` MCP tools -- **Tracker status reconciliation** — periodic check for externally closed/cancelled tasks; phase-aware cleanup (running → TASK_FAILED, PR-tracked → cancel only, needs-info → TASK_FAILED); `removed` flag on tracked objects prevents stale reference races; `record_pr_cancelled()` with `cancelled_at` column in `pr_lifecycle` -- **Multi-turn continuation** — retries agent up to `MAX_CONTINUATION_TURNS` (default 3) when it completes without PR and task is still open; `tracker_mark_complete` tool for explicit no-PR completion signal; `continuation_exhausted` flag triggers retry on 2nd+ attempt; cost cap guard (`MAX_CONTINUATION_COST`); Tracker status check detects external resolution during continuation -- **Merge conflict retry** — SHA-gated retry for merge conflicts (up to `MERGE_CONFLICT_MAX_RETRIES`, default 2); only re-prompts when agent pushes new commit but conflict persists; resets on resolution -- **Heartbeat monitor** — periodic health checks (every 5 min) detecting stuck agents, long-running tasks, and stale reviews; alerts supervisor with actionable diagnostics and cooldown-based deduplication; enriched with cost, tokens, output snippets, and Tracker status -- **PR auto-merge** — opt-in automatic merge when CI green + reviews approved + no conflicts; uses GitHub GraphQL `enablePullRequestAutoMerge` with REST fallback; supervisor can also merge manually via `github_merge_pr` tool -- **Pre-merge code review** — one-shot Sonnet sub-agent reviews PR diff against task requirements, project conventions, and OWASP security checklist before auto-merge; fail-close by default (rejects on error/timeout, configurable via `PRE_MERGE_REVIEW_FAIL_OPEN`); posts REQUEST_CHANGES on reject so worker agent self-corrects; sends rejection prompt directly to worker session; notifies supervisor on rejection; auto-resets on new commits (head SHA tracking) for fresh review cycles; on re-review approve, posts APPROVE to dismiss stale REQUEST_CHANGES so auto-merge can proceed; configurable timeout (`pre_merge_review_timeout_seconds`) -- **Human gate** — blocks auto-merge for PRs exceeding diff size threshold (`HUMAN_GATE_MAX_DIFF_LINES`, default 0 = disabled) or touching sensitive file paths (`HUMAN_GATE_SENSITIVE_PATHS`, comma-separated globs); posts PR comment explaining the block; publishes `HUMAN_GATE_TRIGGERED` event -- **Post-merge verification** — after PR merge, watches CI pipeline and K8s rollout, then spawns a one-shot verification sub-agent to test the deployed change on dev; on pass publishes `TASK_VERIFIED`; on fail auto-creates hotfix Bug task; fire-and-forget (doesn't block merge flow); configurable via `POST_MERGE_VERIFICATION_ENABLED` -- **Environment config** — SQLite-backed key-value store for per-environment connection details (API URLs, test credentials); supervisor writes via `env_set` MCP tool, worker agents read via `env_get`; used by verification sub-agent to connect to dev/staging +- **Epic coordination** — supervisor-driven: auto-discovery of children (with auto-decomposition), dependency graph via MCP tools (`awaiting_plan` → `executing`), lifecycle events, `epic_reset_child` +- **Workpad** — persistent structured comment on Tracker issue; idempotent via hidden HTML marker; agent updates progress via MCP tools +- **Tracker status reconciliation** — periodic check for externally closed/cancelled tasks; phase-aware cleanup; `removed` flag prevents stale reference races +- **Multi-turn continuation** — retries agent up to `MAX_CONTINUATION_TURNS` (3) when it completes without PR and task is still open; `tracker_mark_complete` for explicit no-PR completion; cost cap guard (`MAX_CONTINUATION_COST`) +- **Merge conflict retry** — SHA-gated retry (up to `MERGE_CONFLICT_MAX_RETRIES` = 2); resets on resolution +- **Heartbeat monitor** — periodic health checks (every 5 min) detecting stuck agents, long-running tasks, stale reviews; cooldown-based deduplication +- **PR auto-merge** — opt-in when CI green + reviews approved + no conflicts; GitHub GraphQL `enablePullRequestAutoMerge` with REST fallback +- **Pre-merge code review** — one-shot Sonnet sub-agent reviews PR diff; fail-close by default (`PRE_MERGE_REVIEW_FAIL_OPEN`); posts REQUEST_CHANGES on reject; auto-resets on new commits for fresh review cycles +- **Human gate** — blocks auto-merge for large diffs (`HUMAN_GATE_MAX_DIFF_LINES`) or sensitive paths (`HUMAN_GATE_SENSITIVE_PATHS`) +- **Post-merge verification** — watches CI + K8s rollout, spawns verification sub-agent on dev; on fail auto-creates hotfix Bug task; configurable via `POST_MERGE_VERIFICATION_ENABLED` +- **Environment config** — SQLite key-value store for per-environment connection details; supervisor writes via `env_set`, workers read via `env_get` - **Event bus** — async pub/sub for real-time streaming to web dashboard - **Web dashboard** — FastAPI REST + WebSocket, React frontend with xterm.js -- **Supervisor chat** — interactive + autonomous streaming chat with full authority (`bypassPermissions`); auto-sends epic planning requests via `auto_send()` -- **Supervisor memory** — SQLite + FTS5 hybrid search over markdown files (`data/memory/`); see "Supervisor Memory" section below -- **Inter-agent communication** — centralized mailbox with interrupt-based message delivery between concurrent worker agents; see "Inter-Agent Communication" section below -- **Task dependency management** — auto-defers tasks with unresolved dependencies detected via Tracker links (`depends on`, `is blocked by`) and LLM-based text extraction from descriptions (Haiku); rechecks on every poll; supervisor can override via MCP tools (`approve_task_dispatch`, `defer_task`); epic children excluded (use EpicCoordinator); fail-open on API/LLM errors -- **K8s diagnostics** — optional Kubernetes pod log/status inspection via in-cluster ServiceAccount auth; feature-gated via `K8S_LOGS_ENABLED` env var -- **Persistent stats** — SQLite-backed storage for task runs, errors, PR lifecycles via EventBus subscriber -- **Auto-compaction** — summarizes context via Haiku when approaching token limit, recreates session to continue work without losing progress -- **Session resumption** — captures `session_id` from Claude SDK, persists in SQLite; on restart, resumes PR monitoring and needs-info sessions with `resume` + `fork_session=True` for full conversation history restoration (graceful fallback to fresh session on failure); fresh fallback sessions receive context prompt with task description, recent comments, and inter-agent message history to enable effective continuation -- **Dead session recovery** — on-demand sessions that fail (crash, timeout, `success=False`) are automatically recreated with context preserved via session_id resume; triggers on next message delivery or interrupt attempt; publishes `SESSION_RECREATED` event for observability - -### Supervisor Memory - -Long-term memory for the supervisor agent, persisted across sessions. - -**Storage:** Markdown files in `data/memory/` are the source of truth. An SQLite index provides fast retrieval via hybrid search (BM25 keyword + cosine vector similarity). - -**How it works:** -1. Supervisor writes decisions, learnings, and context to markdown files via MCP tools (`memory_write`, `memory_read`, `memory_search`) -2. On startup, `SupervisorRunner` indexes all markdown files: splits into overlapping chunks (~400 tokens each), computes embeddings via Gemini API, stores in SQLite with FTS5 -3. On search, queries run both BM25 (keyword) and vector similarity, results merged with configurable weights (0.3 BM25 / 0.7 vector) -4. Index auto-refreshes when file content changes (content hash check) - -**Key modules:** -- `orchestrator/supervisor_memory.py` — `MemoryIndex` (indexing, search, CRUD), chunking, embedding, hybrid ranking -- `orchestrator/supervisor.py` — `SupervisorRunner` (initialization, index rebuild on startup) -- `orchestrator/supervisor_tools.py` — MCP tools (`memory_write`, `memory_read`, `memory_search`) exposed to supervisor agent - -### Inter-Agent Communication - -Enables concurrent worker agents to coordinate when working on related tasks (e.g., frontend + backend for the same feature). - -**How it works:** -1. Agent X calls `list_running_agents` to discover peers (includes `component`/`repo` metadata for informed targeting) -2. Agent X calls `send_message_to_agent("QR-Y", "What API endpoint are you creating?")` (non-blocking, delivery status returned) -3. Message is delivered to Agent Y via `session.interrupt_with_message()` — Y sees it immediately -4. Agent Y calls `reply_to_message(msg_id, "POST /api/v1/auth/login")` -5. Reply is delivered back to Agent X via interrupt -6. For blocking coordination, Agent X can use `send_request_to_agent("QR-Y", "...", timeout_seconds=60)` which waits for a reply - -**Message types:** `REQUEST` (expects reply), `RESPONSE` (reply to REQUEST), `NOTIFICATION` (informational), `ARTIFACT` (data transfer) - -**Delivery statuses:** `DELIVERED` (interrupt reached live session), `QUEUED` (interrupt failed, message in inbox), `OVERFLOW_DROPPED` (evicted due to MAX_INBOX_SIZE=50; only set if message was never delivered) - -**Architecture:** -- `AgentMailbox` — centralized singleton owned by `Orchestrator`, manages per-agent inboxes (deques with MAX_INBOX_SIZE=50), message lifecycle (pending→read→replied/expired), delivery tracking, stats, periodic cleanup -- `comm_tools.py` — 5 MCP tools (`list_running_agents`, `send_message_to_agent`, `send_request_to_agent`, `reply_to_message`, `check_messages`) scoped per-agent via closure -- Delivery uses existing `AgentSession.interrupt_with_message()` — no new transport -- Supervisor has read-only access via `view_agent_messages` and `get_comm_stats` tools -- **Mailbox lifecycle** — agents stay registered while their session is alive (including PR monitor, needs-info monitor, and on-demand session phases). Unregistration happens only at terminal paths: task completion, failure, PR merge/close, or graceful shutdown. This ensures agents are discoverable and messageable throughout their full lifecycle. -- **Periodic cleanup** — `_periodic_mailbox_cleanup()` in `main.py` removes terminal messages (REPLIED/EXPIRED) older than 1 hour every 30 minutes - -**Key modules:** -- `orchestrator/agent_mailbox.py` — `AgentMailbox`, `AgentMessage` (with `msg_type`, `delivery_status`, `created_at`), `AgentInfo` (with `component`, `repo`), `MessageType`, `DeliveryStatus`, `request_and_wait()`, `get_stats()`, `cleanup_terminal_messages()` -- `orchestrator/comm_tools.py` — `build_comm_server()` returns per-agent MCP server with 5 tools -- `orchestrator/main.py` — wires mailbox callbacks (`_list_agent_info`, `_interrupt_agent_for_comm`) - -## Structure -- `orchestrator/compaction.py` — context compaction: `should_compact()`, `summarize_output()`, `build_continuation_prompt()` -- `orchestrator/config.py` — env-based config with SDK fields -- `orchestrator/constants.py` — shared enums and type aliases (EventType, PRState, MAX_COMPACTION_CYCLES, MAX_CONTINUATION_TURNS) -- `orchestrator/tracker_client.py` — Yandex Tracker REST client -- `orchestrator/tracker_enums.py` — enums and helpers for Tracker status keyword matching -- `orchestrator/tracker_types.py` — TypedDict type definitions for Tracker API responses -- `orchestrator/tracker_tools.py` — in-process MCP @tool wrappers for worker agents -- `orchestrator/heartbeat.py` — periodic agent health monitoring (HeartbeatMonitor, AgentHealthReport, HeartbeatResult), stuck/long-running/stale detection, supervisor alerting with cooldown -- `orchestrator/github_client.py` — GitHub GraphQL client for PR review thread and CI monitoring, auto-merge (enablePullRequestAutoMerge), merge readiness checks (MergeReadiness) -- `orchestrator/orchestrator_agent.py` — Opus-level agent that decides on worker results (track PR, complete, fail, epic child events) -- `orchestrator/orchestrator_tools.py` — MCP tools for orchestrator agent (track_pr, retry_task, escalate, fail_task, complete_task, create_follow_up, get_task_history, get_recent_events) -- `orchestrator/supervisor_tools.py` — in-process MCP @tool wrappers + `build_supervisor_allowed_tools()` for supervisor chat (Tracker, GitHub, stats, memory, epic management, Bash, Write, Edit) -- `orchestrator/supervisor_chat.py` — interactive + autonomous streaming chat with supervisor (SupervisorChatManager, `auto_send()`, bypassPermissions) -- `orchestrator/supervisor_memory.py` — supervisor memory system (SQLite + FTS5 hybrid search, markdown files, chunking, embeddings) -- `orchestrator/supervisor.py` — supervisor memory system initialization (SupervisorRunner, memory_index, embedder) -- `orchestrator/dependency_manager.py` — task dependency deferral: `DependencyManager` (check/recheck/approve/defer), `DeferredTask` dataclass, Tracker link parsing, LLM-based text blocker extraction via Haiku (`extract_blocker_keys_from_text`) -- `orchestrator/supervisor_prompt_builder.py` — `build_supervisor_system_prompt` (re-export) + `build_epic_plan_prompt`, `build_preflight_skip_prompt`, `build_epic_completion_prompt`, `build_task_deferred_prompt`, `build_task_unblocked_prompt`, `build_heartbeat_prompt`, `build_epic_decompose_prompt`, `build_escalation_prompt` for autonomous supervisor notifications -- `orchestrator/alertmanager_webhook.py` — Alertmanager webhook dataclasses, parsing, prompt formatting, and auto-task creation helpers (AlertmanagerAlert, AlertmanagerPayload, parse_payload, format_alert_prompt, build_issue_summary, build_issue_description, map_component) -- `orchestrator/adr.py` — Architecture Decision Records: `create_adr`, `list_adrs`, `read_adr`, `slugify` -- `orchestrator/escalation.py` — supervisor uncertainty escalation: `escalate_to_human`, `build_escalation_comment` -- `orchestrator/repo_resolver.py` — git repo cloning/pulling -- `orchestrator/workspace.py` — git worktree management -- `orchestrator/workspace_tools.py` — MCP workspace tools (lazy worktree creation) -- `orchestrator/recovery.py` — error classification + retry logic + state persistence -- `orchestrator/prompt_builder.py` — task prompt construction (`build_task_prompt`, `build_review_prompt`, `build_needs_info_response_prompt`, `build_fallback_context_prompt` for fresh on-demand sessions, `build_merge_conflict_prompt`, `build_pipeline_failure_prompt`, `build_pre_merge_rejection_prompt`) -- `orchestrator/agent_mailbox.py` — inter-agent communication mailbox (message routing, inbox management, interrupt-based delivery) -- `orchestrator/comm_tools.py` — MCP tools for worker agent communication (list peers, send/reply/check messages) -- `orchestrator/k8s_client.py` — Kubernetes client for pod logs/status inspection (in-cluster ServiceAccount auth, graceful degradation) -- `orchestrator/agent_runner.py` — SDK client wrapper -- `orchestrator/event_bus.py` — async pub/sub event bus -- `orchestrator/stats_collector.py` — EventBus subscriber that persists statistics to storage -- `orchestrator/stats_models.py` — data models for persistent statistics (TaskRun, SupervisorRun, etc.) -- `orchestrator/metrics.py` — Prometheus metrics registry with text format serialization (Counter, Gauge, Histogram) for VictoriaMetrics export via /metrics endpoint -- `orchestrator/storage.py` — abstract Storage Protocol interface for persistence backends -- `orchestrator/sqlite_storage.py` — SQLite-backed storage implementation -- `orchestrator/_persistence.py` — mixin for background asyncio task-based persistence -- `orchestrator/task_dispatcher.py` — Tracker polling + agent dispatch -- `orchestrator/epic_coordinator.py` — epic child discovery (`discover_children` → `awaiting_plan` or `needs_decomposition`), supervisor-driven activation (`activate_child`, `set_child_dependencies`), dependency-aware child sequencing, `register_child`, `rediscover_children` -- `orchestrator/post_merge_verifier.py` — post-merge verification: CI watch, K8s rollout watch, one-shot verification sub-agent, hotfix task creation on failure (PostMergeVerifier, VerificationResult, VerificationIssue) -- `orchestrator/pr_monitor.py` — PR review/CI monitoring + merge conflict detection + auto-merge processing + pre-merge review gate + human gate for large/sensitive PRs + post-merge verification trigger -- `orchestrator/pre_merge_reviewer.py` — one-shot sub-agent for semantic code review before auto-merge (ReviewVerdict, ReviewIssue, context assembly, OWASP security checklist, configurable fail-open/fail-close) -- `orchestrator/needs_info_monitor.py` — needs-info status monitoring -- `orchestrator/proposal_manager.py` — improvement proposal lifecycle -- `orchestrator/web.py` — FastAPI REST + WebSocket server -- `orchestrator/main.py` — async orchestrator loop + web server + epic event watcher (auto-triggers supervisor for epic planning, decomposition, heartbeat) -- `prompts/workflow.md` — workflow instructions for worker agents -- `prompts/supervisor_workflow.md` — workflow instructions for supervisor agent -- `prompts/plan_agent.md` — planning sub-agent prompt (read-only codebase exploration) -- `prompts/critic_agent.md` — critic sub-agent prompt; reviews the plan for wrong abstractions, missing edge cases, side effects on related systems, and convention violations; iterates with plan_agent until approved (max 3 rounds) -- `prompts/simplify_agent.md` — simplify sub-agent prompt; 3 parallel instances review changes for code reuse, code quality, and efficiency before commit; directly fixes issues found -- `prompts/code_quality_gate.md` — code quality gate sub-agent prompt; read-only compliance check against CLAUDE.md Code Quality Rules; returns approve/revise verdict; worker iterates until approved (max 3 rounds) -- `frontend/` — React + Vite + TypeScript dashboard -- `tests/` — pytest tests -- `Taskfile.yml` — quality check tasks (Docker-based, synced with CI) + security tasks (Semgrep, Gitleaks) -- `.semgrep.yml` — custom Semgrep rules for OWASP security scanning (Python + TypeScript) -- `docs/decisions/` — Architecture Decision Records (ADRs) -- `ci/python.Dockerfile` — Python CI image (ruff, mypy, pytest, pip-audit) -- `ci/frontend.Dockerfile` — Frontend CI image (Node 22, npm deps) +- **Supervisor chat** — interactive + autonomous streaming with `bypassPermissions`; `auto_send()` for epic planning +- **Supervisor memory** — SQLite + FTS5 hybrid search (BM25 0.3 / vector 0.7) over markdown files in `data/memory/`; Gemini embeddings; auto-refresh on content change +- **Inter-agent communication** — centralized `AgentMailbox` with interrupt-based message delivery; message types: REQUEST, RESPONSE, NOTIFICATION, ARTIFACT; delivery statuses: DELIVERED, QUEUED, OVERFLOW_DROPPED (MAX_INBOX_SIZE=50); 5 MCP tools per agent; supervisor has read-only access +- **Task dependency management** — auto-defers tasks with unresolved deps (Tracker links + LLM text extraction via Haiku); supervisor can override; fail-open on errors +- **K8s diagnostics** — optional pod log/status inspection; feature-gated via `K8S_LOGS_ENABLED` +- **Persistent stats** — SQLite-backed via EventBus subscriber +- **Auto-compaction** — Haiku summarization when approaching token limit, session recreation +- **Session resumption** — persists `session_id` in SQLite; resumes with `fork_session=True`; fresh fallback with context prompt on failure +- **Dead session recovery** — auto-recreates failed sessions with context preserved; publishes `SESSION_RECREATED` event ## Python Style Guide (based on Google Python Style Guide) -All Python code in this project follows the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) with project-specific additions below. +Follows [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) with additions below. ### Imports +**Ordering** (blank-line separated, lexicographic within group): +1. `from __future__` 2. stdlib 3. third-party 4. local -**Ordering** (separated by blank lines, each group sorted lexicographically by full package path): -1. `from __future__` imports -2. Standard library (`import os`, `import asyncio`) -3. Third-party packages (`import fastapi`, `import yaml`) -4. Local project imports (`from orchestrator.config import Settings`) - -**Rules:** -- `import x` for packages and modules only — never for individual classes or functions -- `from x import y` where `x` is the package prefix and `y` is the module name -- `from x import y as z` when names conflict or are too long -- Never use relative imports (`from . import foo`) — always use the full package path -- Each import on its own line (exception: `typing` and `collections.abc` imports may be combined) -- Symbols from `typing`, `collections.abc`, `typing_extensions` may be imported directly +- `import x` for packages/modules only — `from x import y` for submodules +- Never use relative imports — always full package path +- `typing`, `collections.abc`, `typing_extensions` symbols may be imported directly ### Naming - | Type | Convention | Example | |------|-----------|---------| -| Packages/Modules | `lower_with_under` | `agent_runner.py` | -| Classes | `CapWords` | `TaskDispatcher` | -| Exceptions | `CapWords` + `Error` | `TrackerApiError` | -| Functions/Methods | `lower_with_under` | `dispatch_task()` | -| Constants (module-level) | `ALL_CAPS` | `MAX_COMPACTION_CYCLES` | -| Instance variables | `lower_with_under` | `self.issue_key` | -| Protected/internal | `_single_leading_underscore` | `_parse_response()` | -| Type variables | `CapWords` | `T`, `EventT` | - -- Never use `__double_leading_underscore` for class members — prefer single `_` -- Never encode type in variable name (`id_to_name_dict` → `id_to_name`) -- Test methods: `test__` pattern +| Modules | `lower_with_under` | `agent_runner.py` | +| Classes/Exceptions | `CapWords` (`Error` suffix) | `TrackerApiError` | +| Functions/Methods/Vars | `lower_with_under` | `dispatch_task()` | +| Constants | `ALL_CAPS` | `MAX_COMPACTION_CYCLES` | +| Protected | `_single_underscore` | `_parse_response()` | -### Type Annotations +- Never `__double_underscore` for class members +- Test methods: `test__` -- Required on all public functions and methods -- Update annotations when modifying code -- Do not annotate `self` or `cls` unless necessary -- Use `is None` / `is not None` — never `== None` -- Define type aliases for complex types (shared aliases go in `constants.py`) -- Use `# type: ignore` sparingly with an explanatory comment -- For conditional imports (circular deps): - ```python - from __future__ import annotations - import typing - if typing.TYPE_CHECKING: - from orchestrator.expensive_module import HeavyClass - ``` +### Type Annotations +- Required on all public functions/methods +- Shared type aliases in `constants.py` +- Conditional imports: `if typing.TYPE_CHECKING:` with `from __future__ import annotations` ### Docstrings +- `"""` always. Required for modules, public API, classes, nontrivial private functions +- Sections in order: summary, description, Args, Returns, Raises +- Class docstring: what an instance represents + `Attributes:` section -- Triple double-quotes `"""` always -- Required for: all modules, all public functions/methods, all classes, nontrivial private functions -- Style: pick descriptive (`"""Fetches rows."""`) or imperative (`"""Fetch rows."""`) — be consistent within a file - -**Function/method docstring sections** (in order): +### Exception Handling +- **Never bare `except:`** or catch `Exception` unless re-raising or at isolation point +- **Never `assert` for runtime logic** — only in tests +- Every module defines structured error types with context fields (status codes, entity IDs) ```python -def dispatch_task(issue_key: str, priority: int = 0) -> TaskRun: - """Dispatches a task to the next available agent. - - Validates the issue, creates a worktree, and starts an agent session. - - Args: - issue_key: Yandex Tracker issue key (e.g., 'QR-123'). - priority: Dispatch priority. Higher values run first. - - Returns: - A TaskRun instance with the agent session result. - - Raises: - TrackerApiError: If the Tracker API is unreachable. - WorkspaceError: If worktree creation fails. - """ +class TrackerApiError(OrchestratorError): + def __init__(self, status_code: int, message: str, issue_key: str | None = None): + self.status_code = status_code + self.issue_key = issue_key + super().__init__(f"Tracker API error {status_code}: {message}") ``` -**Class docstring:** describe what an instance represents + `Attributes:` section for public attributes. - -### Comments - -- Explain *why*, not *what* — assume the reader knows Python -- Inline comments: at least 2 spaces from code, `# ` followed by text -- TODO format: `# TODO: link/context - Description.` — never reference individuals - -### Exception Handling - -- Use built-in exceptions when appropriate (`ValueError`, `TypeError`, etc.) -- Custom exceptions: inherit from existing exception class, name ends with `Error` -- **Never use bare `except:`** or catch `Exception` unless re-raising or at an isolation point -- Minimize `try` block size -- Use `finally` for cleanup -- **Never use `assert` for runtime logic** — asserts can be stripped with `-O`. Only use in tests - -### Line Length and Formatting - -- Maximum **80 characters** per line -- Exceptions: long imports, URLs in comments, long string constants -- **No backslash continuation** — use implicit joining inside `()`, `[]`, `{}` -- **4 spaces** per indent level, never tabs -- **Two blank lines** between top-level definitions -- **One blank line** between methods within a class -- **Trailing commas** when closing token is on a separate line -- **No semicolons** — never terminate lines with `;`, never two statements on one line - -### True/False Evaluations - -- Use implicit false: `if not items:` instead of `if len(items) == 0:` -- None checks: always `if x is None:` / `if x is not None:` -- Never compare booleans: `if flag:` not `if flag == True:` - -### Default Arguments - -- **Never use mutable defaults** (lists, dicts, sets): - ```python - # BAD - def foo(items: list[str] = []) -> None: ... - - # GOOD - def foo(items: list[str] | None = None) -> None: - if items is None: - items = [] - ``` +### Formatting +- **80 chars** max (except long imports, URLs, string constants) +- 4 spaces indent, no tabs, no backslash continuation, no semicolons +- Trailing commas when closing token is on separate line ### String Formatting - -- Prefer f-strings for general formatting -- **Logging: always use `%`-style**, never f-strings: - ```python - # GOOD - logger.info("Processing %d items for %s", count, user_id) - - # BAD - logger.info(f"Processing {count} items for {user_id}") - ``` -- Never use `+`/`+=` for string accumulation in loops — use `"".join()` - -### Functions and Methods - -- Prefer small, focused functions (~40 lines soft limit) -- **Never use `staticmethod`** — use a module-level function instead -- Use `classmethod` only for named constructors or class-specific routines - -### Properties - -- Use `@property` for trivial computed access that feels like attribute access -- Must be cheap and unsurprising — no I/O, no expensive computation - -### Comprehensions - -- Use for simple cases only -- No multiple `for` clauses or complex filter expressions — use a regular loop instead - -### Lambda - -- Only for one-liners (< 60-80 chars). Otherwise use a named `def` -- Prefer `operator` module over lambdas for common operations - -### Global State - -- Avoid mutable global state -- Module-level constants are fine: `ALL_CAPS` naming, `_` prefix if internal - -### Files and Resources - -- Always explicitly close files/sockets/connections -- Use `with` statements (or `contextlib.closing()`) - -### Threading and Async - -- Don't rely on atomicity of built-in types -- Use `asyncio` primitives for concurrency (locks, queues, semaphores) -- Use `queue.Queue` for inter-thread communication - -### General Conventions - -- Tests use pytest with mocking (`unittest.mock`) -- Async tests use pytest-asyncio (auto mode) -- Config from environment variables with sensible defaults -- `if __name__ == "__main__": main()` guard for executable modules +- f-strings for general use +- **Logging: always `%`-style** — `logger.info("Processing %d items for %s", count, uid)` +- Never `+`/`+=` for string accumulation — use `"".join()` + +### Structured Logging +- Task-scoped: include `task_key`. PR-scoped: `task_key` + `pr_url`. Session-scoped: `session_id` +- `DEBUG` = routine, `INFO` = business events, `WARNING` = recoverable anomalies, `ERROR` = unrecoverable failures + +### Size Limits +- **Module: ~500 lines.** Split into sub-modules if exceeded +- **Class: ~15 instance vars.** Split into collaborators via constructor injection +- **Function: ~40 lines.** Prefer small, focused functions +- No nested function factories with 5+ inner functions — convert to class +- One responsibility per module + +### Concurrency Patterns +- **Lock cleanup on error** — `self._locks.pop(key, None)` in `except` block +- **Atomic multi-record persistence** — wrap related writes in single transaction +- **Side effects after commit** — persist first, then publish events/notifications + +### HTTP Resilience +- Timeouts on every request: `(connect, read)` tuple +- Retry with exponential backoff + jitter on 429, 500, 502, 503, 504 +- Never retry non-idempotent operations without idempotency keys + +### Input Validation +All FastAPI endpoints: validate with `Query`, `Path`, `Body` + type constraints. + +### Other Conventions +- Never use mutable default arguments — use `None` + `if X is None: X = []` +- Use implicit false: `if not items:` not `if len(items) == 0:` +- Never `staticmethod` — use module-level function +- Always use `with` for files/sockets/connections +- `if __name__ == "__main__": main()` guard ## Code Quality Rules -These rules are derived from recurring PR review issues. Follow them strictly. +Derived from recurring PR review issues. Follow strictly. ### No cross-module private imports -Never import `_private` names from other modules. If you need a function/constant from another module, make it public or create a shared module. -```python -# BAD: fragile coupling to internal details -from orchestrator.needs_info_monitor import _NEEDS_INFO_STATUSES +Never import `_private` names from other modules. Make it public or create a shared module. -# GOOD: import public API -from orchestrator.needs_info_monitor import is_needs_info_status -``` - -### No function duplication across modules -If two modules need the same utility, define it once and import it. Never copy-paste a function. -```python -# BAD: copy-paste of build_system_prompt_append -def build_supervisor_system_prompt(path): ... # same logic - -# GOOD: re-export or direct import -from orchestrator.prompt_builder import build_system_prompt_append as build_supervisor_system_prompt -``` +### No function duplication +If two modules need the same utility, define once and import. ### Types belong in the module that owns the concept -Shared type aliases (Literal types, type unions) used by multiple modules go in `constants.py`. Feature-specific types stay in their module. Core modules must NEVER import from feature-specific modules. -```python -# BAD: core module imports from feature module -# pr_monitor.py -from orchestrator.epic_coordinator import ChildStatus - -# GOOD: shared type lives in constants.py -from orchestrator.constants import EventType -``` +Shared types → `constants.py`. Core modules must NEVER import from feature modules. ### Propagate data through all paths -When adding a field to a dataclass (e.g. `issue_summary`), trace ALL paths where the dataclass is created or passed via callbacks. Update every callback type signature, every creation site, and every consumer. Common miss: the needs-info → PR monitor path vs the direct dispatch → PR monitor path. +When adding a dataclass field, trace ALL creation sites, callback signatures, and consumers. ### Consistent locking in paired operations -If `approve()` uses a lock, `reject()` must also use a lock. Symmetric operations require symmetric concurrency protection. +If `approve()` uses a lock, `reject()` must also use a lock. ### Wall-clock vs monotonic time -- **`time.time()`** — for external-facing timestamps (API responses, event history, user display) -- **`time.monotonic()`** — for internal cooldowns, rate limiting, elapsed time checks - -Never mix them. If a property is exposed via API or stored for later display, use wall-clock. +- `time.time()` — external-facing timestamps (API, events, display) +- `time.monotonic()` — internal cooldowns, rate limiting, elapsed checks ### Comments must match behavior -When commenting queue/eviction semantics ("drops oldest", "evicts newest"), verify the actual behavior matches. `asyncio.Queue` is FIFO — `get_nowait()` removes the oldest, `put_nowait()` on a full queue raises `QueueFull` (drops newest). - -### Consistent overflow strategies -Use the same overflow handling pattern within a subsystem. If supervisor trigger queue evicts oldest with a warning, document why other queues (event bus) silently drop instead. +Verify queue/eviction semantics comments match actual code behavior. ### Never discard async results silently -Every `await session.send()` returns `AgentResult`. Never discard the result without handling. If the result is not needed — document why explicitly. When draining/resuming — merge results (accumulate costs, prefer latest pr_url/needs_info). -```python -# BAD: result silently discarded -await session.send(msg) +Every `await session.send()` result must be handled. When draining — merge results: +- **Sticky flags** (`success`) — `base.X or update.X` +- **Latest-wins** (`output`, `pr_url`) — `update.X or base.X` +- **Accumulators** (`cost_usd`) — sum -# GOOD: merge into base result -drain_result = await session.send(msg) -result = _merge_results(result, drain_result) -``` +### Event lifecycle contract +Every TASK_STARTED must end with exactly one of: TASK_COMPLETED, TASK_FAILED, or PR_TRACKED. -### Sticky flags in merge/reduce functions -When merging dataclass instances (e.g. `_merge_results`), classify each field's merge semantics: -- **Sticky flags** (`success`) — once `True`, must stay `True`. Use `base.X or update.X`. -- **Latest-wins** (`output`, `pr_url`, `needs_info`) — prefer update, fallback to base. Use `update.X or base.X`. -- **Accumulators** (`cost_usd`, `duration_seconds`, `proposals`) — sum or concatenate. +| Outcome | Events | Records | +|---------|--------|---------| +| Success with PR | STARTED → PR_TRACKED | task_run + pr_lifecycle | +| Success without PR | STARTED → COMPLETED | task_run | +| Failure | STARTED → FAILED | task_run + error_log | +| PR merged + verified | PR_TRACKED → MERGED → VERIFIED | task_run + pr_lifecycle (verified_at) | -Never apply "latest-wins" to a sticky flag — a failed drain must not downgrade a successful base result. +**QR-247:** TASK_COMPLETED from `complete_task_impl` includes `has_pr: False`. Suspicious output → `no_pr_warning: True`. -### Event lifecycle contract -Every task dispatch path must produce exactly one terminal event. Invariant: **every TASK_STARTED must end with exactly one of**: TASK_COMPLETED, TASK_FAILED, or PR_TRACKED. - -| Outcome | Required events | Records | -|---------|----------------|---------| -| Success with PR | TASK_STARTED → PR_TRACKED | 1 task_run + 1 pr_lifecycle | -| Success without PR | TASK_STARTED → TASK_COMPLETED | 1 task_run | -| Failure | TASK_STARTED → TASK_FAILED | 1 task_run + 1 error_log | -| Needs-info → PR | TASK_STARTED → NEEDS_INFO → PR_TRACKED | 1 task_run + 1 pr_lifecycle | -| PR merged + verified | PR_TRACKED → PR_MERGED → TASK_VERIFIED | 1 task_run + 1 pr_lifecycle (verified_at set) | -| PR merged + verification failed | PR_TRACKED → PR_MERGED → VERIFICATION_FAILED | 1 task_run + 1 pr_lifecycle + 1 error_log | - -**Note (QR-247):** TASK_COMPLETED events from `complete_task_impl` always include `has_pr: False`. If the agent's output appears suspicious (too short or contains waiting/blocked patterns), the event also includes `no_pr_warning: True` to flag potential false completions. - -### Test async consumers with queue.join() -Don't compete with `_run()` for queue items. Instead of `get_nowait()` in tests, use `queue.join()` + `task_done()` to wait for processing completion. - -### Test hygiene: parametrize and deduplicate -Periodically review tests for bloat. Apply these rules: - -- **Table-driven tests** — when multiple test methods differ only in inputs/expected values, collapse them into one `@pytest.mark.parametrize`. This is the Python equivalent of Go's table-driven tests: - ```python - @pytest.mark.parametrize( - ("input_val", "expected"), - [ - ("rate limit exceeded", ErrorCategory.RATE_LIMIT), - ("401 Unauthorized", ErrorCategory.AUTH), - ("some unknown error", ErrorCategory.PERMANENT), - ], - ) - def test_classify(self, input_val, expected) -> None: - assert classify_error(input_val) == expected - ``` -- **Remove redundant assertions** — if `test_initial_state` checks `backoff == 0.0`, don't repeat that assertion at the start of `test_backoff_increases`. -- **Merge overlapping tests** — two tests that both verify "non-retryable category → `should_retry is False`" for different categories should be one parametrized test. -- **Hoist repeated imports** — `import time` or `from module import X` scattered inside test methods should be at module level. -- **Delete useless tests** — trivial constant-membership checks (`X in SET`) can be folded into behavioral tests that already exercise that logic. +### Event data contracts +Document required `event.data` fields at `EventType` definition. Verify ALL publishers provide new fields. ### Fail-open vs fail-closed must be explicit -When changing error handling from fail-closed (reject/block on error) to fail-open (allow/proceed on error) or vice versa — add a comment at the point of change explaining the reasoning. This is a safety-critical decision that must be documented in code, not just in the PR description. -```python -# BAD: silent inversion of safety behavior -except Exception: - logger.warning("Error checking PR for %s", task_key) - # (previously returned error to caller — now silently continues) - -# GOOD: explicit reasoning -except Exception: - # Fail-open: supervisor explicitly decided to skip this task, - # so infrastructure errors should not block their intent. - # The old fail-closed behavior caused QR-266 where stale cache - # blocked legitimate skips for 5+ retries. - logger.warning("Error checking PR for %s - allowing skip (fail-open)", task_key) -``` +Always add a comment explaining the reasoning when changing between fail-open and fail-closed behavior. -### Event data contracts -When an event handler requires specific fields in `event.data`, document the required fields next to the `EventType` definition or in the handler's docstring. When adding a new required field to `event.data`, search for ALL publishers of that event type and verify they provide it. -```python -# BAD: handler silently requires pr_url but contract is undocumented -async def _on_pr_merged(self, event: Event) -> None: - pr_url = event.data.get("pr_url") # where is this documented? +### Self-review before commit +Re-read full diff: no duplicated code, updated helpers/fixtures, parametrized similar tests, no dead mocks. -# GOOD: contract documented at the event type -class EventType(str, Enum): - PR_MERGED = "pr_merged" # data: {"pr_url": str} -``` +### Testing Rules -### Self-review before commit -Before creating a commit, re-read the full diff and verify: -- No duplicated code blocks (including SQL queries, mock setup, test boilerplate) -- All helper functions and fixtures updated to match new logic (no dead parameters) -- Tests that differ only in inputs collapsed via `@pytest.mark.parametrize` -- No mocks/parameters that no longer affect the code path under test +**Mock at boundaries only.** External boundaries: HTTP clients, Claude SDK, filesystem, clock. Use real instances/fakes for EventBus, Storage, etc. -### Lifecycle tests for stateful multi-cycle flows -Unit tests that verify individual steps (reject, approve, merge) catch isolated bugs but miss state-transition failures. Any feature with a multi-step lifecycle (reject → fix → re-review → merge, dispatch → CI fail → fix → CI pass → merge, etc.) MUST have at least one end-to-end scenario test that walks through the full cycle. +**Never test private methods.** Test the public method that calls them. -Common pattern: a flag set in step N blocks step M, but no test ever runs steps N and M sequentially. +**Never assert mock call sequences** unless testing the external boundary itself. Assert observable behavior instead. -**Rule:** For every stateful flow, write a scenario test that exercises the full lifecycle — not just each step in isolation. -```python -# BAD: only tests reject and approve separately -def test_reject_posts_request_changes(self): ... -def test_approve_enables_auto_merge(self): ... - -# GOOD: tests the full reject → fix → approve → merge lifecycle -async def test_reject_then_fix_then_approve_merges(self): - # 1. Review rejects — posts REQUEST_CHANGES - # 2. Worker fixes — new commit resets flags - # 3. Re-review approves — dismisses stale review - # 4. Auto-merge succeeds -``` +**Mock specs mandatory.** Always `MagicMock(spec=ClassName)`. -## Development Process (TDD) +**Fixture factories in conftest.py.** 3+ lines of setup → use a factory fixture. -Follow test-driven development: +**Table-driven tests.** Multiple tests differing only in inputs → `@pytest.mark.parametrize`. -1. **Write a failing test** — define the expected behavior before writing implementation -2. **Make it pass** — write the minimum code to pass the test -3. **Refactor** — clean up while keeping tests green +**Lifecycle tests.** Every stateful multi-step flow must have an end-to-end scenario test. -This applies to new features, bug fixes, and refactors alike. +**Time-dependent tests.** Use `freezegun`, never `asyncio.sleep(0)`. -## Documentation +**Test async consumers.** Use `queue.join()` + `task_done()`, not `get_nowait()`. -After any significant changes (logic, architecture, new modules, changed interfaces) — update the README and relevant sections of this CLAUDE.md (Structure, Architecture, etc.) to keep documentation in sync with the codebase. +## Development Process (TDD) -## Bug Fix Process (Review Comments) +1. **Write a failing test** — define expected behavior first +2. **Make it pass** — minimum code +3. **Refactor** — clean up, keep tests green -When handling review comments (from Cursor Bugbot, human reviewers, etc.): +## Documentation -1. **Create a test case first** — write a test that reproduces the reported bug -2. **Run the test** — if the test fails, the bug is confirmed -3. **Fix the bug** — make the test pass -4. **If the test passes without changes** — the bug report is incorrect; delete the test +After significant changes — update README and CLAUDE.md to keep docs in sync. -This applies to ALL review comments — any comment may be wrong. Never fix blindly. +## Bug Fix Process (Review Comments) -5. **Search for similar issues across the project** — every bug or review comment is a signal that the same pattern may exist elsewhere. After fixing the reported instance, search the entire codebase for analogous problems and fix them all in the same commit. +1. **Write a test** that reproduces the reported bug +2. **Run it** — if it fails, bug confirmed +3. **Fix** — make the test pass +4. **If test passes without changes** — bug report is incorrect; delete the test +5. **Search for similar issues** across the entire codebase; fix all in same commit ## Quality Checks (Taskfile + Docker) -All code quality checks run in disposable Docker containers via [Task](https://taskfile.dev). This ensures reproducibility regardless of the local environment. The same checks run in GitHub Actions CI. +All checks run in Docker containers via [Task](https://taskfile.dev). Same checks in CI. -### Prerequisites -- Docker -- [Task](https://taskfile.dev) (`brew install go-task` on macOS) +**Prerequisites:** Docker + [Task](https://taskfile.dev) (`brew install go-task`) -### Commands ```bash -# Run ALL quality checks (Python + Frontend in parallel) -task quality - -# Python only -task python:quality # lint -> format:check -> typecheck -> test - -# Frontend only -task frontend:quality # typecheck -> lint -> test -> build - -# Individual checks -task lint # ruff check -task format:check # ruff format --check -task typecheck # mypy orchestrator/ -task test # pytest with 75% coverage threshold -task audit # pip audit (non-blocking) -task frontend:typecheck -task frontend:lint -task frontend:test -task frontend:build - -# Dev convenience (modifies files on host via volume mount) -task format # ruff format (auto-fix) -task lint:fix # ruff check --fix (auto-fix) +task quality # ALL checks (Python + Frontend parallel) +task python:quality # lint -> format:check -> typecheck -> test +task frontend:quality # typecheck -> lint -> test -> build +task lint # ruff check +task format:check # ruff format --check +task typecheck # mypy orchestrator/ +task test # pytest with 75% coverage +task audit # pip-audit (non-blocking) +task format # ruff format (auto-fix) +task lint:fix # ruff check --fix (auto-fix) ``` -### How it works -- `ci/python.Dockerfile` -- Python 3.12 image with dev deps (ruff, mypy, pytest, pip-audit) -- `ci/frontend.Dockerfile` -- Node 22 image with npm deps -- Docker images are rebuilt only when source files change (checksum-based) -- `task quality` builds images once, then runs all checks in `--rm` containers -- `task format` / `task lint:fix` mount `orchestrator/` and `tests/` back to host for write-back - -### Sync with CI -GitHub Actions workflow (`.github/workflows/quality.yml`) uses the same Taskfile: -- `task python:quality` + `task audit` (continue-on-error) -- `task frontend:quality` - -To add or change a check -- edit `Taskfile.yml`. Both local and CI pick it up automatically. - -## Virtual Environment - -For running the orchestrator locally (not quality checks), use the virtual environment: - -```bash -# Create venv (once) -python3 -m venv .venv - -# Activate -source .venv/bin/activate -``` +CI uses same Taskfile (`.github/workflows/quality.yml`). Edit `Taskfile.yml` to change checks. ## Running -```bash -# Activate venv first -source .venv/bin/activate - -# Install -pip install -e ".[dev]" -# Tests (prefer `task test` for Docker-isolated run) -pytest tests/ -v - -# Run orchestrator -python -m orchestrator.main +```bash +source .venv/bin/activate && pip install -e ".[dev]" +pytest tests/ -v # or `task test` for Docker run +python -m orchestrator.main # run orchestrator ``` ## Frontend ```bash -cd frontend -npm install +cd frontend && npm install npm run dev # dev server with proxy to :8080 npm run build # production build to frontend/dist/ ``` ## Docker ```bash -docker compose build -docker compose up -d +docker compose build && docker compose up -d # Dashboard at http://localhost:8080 ``` diff --git a/orchestrator/main.py b/orchestrator/main.py index 7f1cd0a..b613bf6 100644 --- a/orchestrator/main.py +++ b/orchestrator/main.py @@ -206,6 +206,7 @@ def __init__(self) -> None: preflight_checker=self._preflight_checker, mark_dispatched_callback=self._dispatched.add, remove_dispatched_callback=self._dispatched.discard, + clear_recovery_callback=self._recovery.clear, get_state_callback=self.get_state, get_task_events_callback=self._get_task_events, ) diff --git a/orchestrator/supervisor_chat.py b/orchestrator/supervisor_chat.py index da21d58..7c78446 100644 --- a/orchestrator/supervisor_chat.py +++ b/orchestrator/supervisor_chat.py @@ -104,6 +104,7 @@ def __init__( preflight_checker: PreflightChecker | None = None, mark_dispatched_callback: Callable[[str], None] | None = None, remove_dispatched_callback: Callable[[str], None] | None = None, + clear_recovery_callback: Callable[[str], None] | None = None, # Diagnostic callbacks get_state_callback: Callable[[], dict[str, Any]] | None = None, get_task_events_callback: (Callable[[str], list[dict[str, Any]]] | None) = None, @@ -135,6 +136,7 @@ def __init__( self._preflight_checker = preflight_checker self._mark_dispatched = mark_dispatched_callback self._remove_dispatched = remove_dispatched_callback + self._clear_recovery = clear_recovery_callback self._get_state_callback = get_state_callback self._get_task_events_callback = get_task_events_callback self._heartbeat_monitor = heartbeat_monitor @@ -209,6 +211,7 @@ def _on_task_created(key: str) -> None: preflight_checker=self._preflight_checker, mark_dispatched_callback=self._mark_dispatched, remove_dispatched_callback=self._remove_dispatched, + clear_recovery_callback=self._clear_recovery, event_bus=self._event_bus, heartbeat_monitor=self._heartbeat_monitor, get_state_callback=self._get_state_callback, diff --git a/orchestrator/supervisor_tools.py b/orchestrator/supervisor_tools.py index 2e57993..e229822 100644 --- a/orchestrator/supervisor_tools.py +++ b/orchestrator/supervisor_tools.py @@ -65,6 +65,7 @@ def build_supervisor_server( preflight_checker: PreflightChecker | None = None, mark_dispatched_callback: Callable[[str], None] | None = None, remove_dispatched_callback: Callable[[str], None] | None = None, + clear_recovery_callback: Callable[[str], None] | None = None, event_bus: EventBus | None = None, # Heartbeat monitor (optional — for get_agent_health tool) heartbeat_monitor: HeartbeatMonitor | None = None, @@ -1400,6 +1401,9 @@ async def requeue_task( task_key = args["task_key"] if remove_dispatched_callback is not None: remove_dispatched_callback(task_key) + if clear_recovery_callback is not None: + clear_recovery_callback(task_key) + logger.info("Cleared recovery state for %s", task_key) if preflight_checker is not None: preflight_checker.approve_for_dispatch( task_key, diff --git a/tests/test_supervisor_tools.py b/tests/test_supervisor_tools.py index 689e540..3e1eda2 100644 --- a/tests/test_supervisor_tools.py +++ b/tests/test_supervisor_tools.py @@ -33,6 +33,7 @@ def _build_tools_and_deps( preflight_checker: object | None = None, mark_dispatched_callback: object | None = None, remove_dispatched_callback: object | None = None, + clear_recovery_callback: object | None = None, event_bus: object | None = None, ) -> tuple[dict[str, Any], MagicMock, list[dict[str, str]], list[dict[str, object]]]: """Build supervisor tools and return (tools_by_name, tracker_mock, proposals, events). @@ -84,6 +85,7 @@ def get_events(count: int = 50) -> list[dict[str, object]]: preflight_checker=preflight_checker, mark_dispatched_callback=mark_dispatched_callback, remove_dispatched_callback=remove_dispatched_callback, + clear_recovery_callback=clear_recovery_callback, event_bus=event_bus, ) @@ -3086,6 +3088,73 @@ def mock_get_pr_status(owner: str, repo: str, pr_number: int): dep_mgr.remove_deferred.assert_not_called() +# =================================================================== +# requeue_task +# =================================================================== + + +class TestRequeueTask: + """Tests for requeue_task tool.""" + + def _make_requeue_deps( + self, + clear_recovery: object | None = None, + ) -> tuple[dict[str, Any], MagicMock, MagicMock, MagicMock]: + """Build tools with requeue_task support. + + Returns (tools, preflight, remove_dispatched, dep_mgr). + """ + preflight = MagicMock() + preflight.approve_for_dispatch = MagicMock() + remove_dispatched = MagicMock() + + dep_mgr = MagicMock() + dep_mgr.approve_dispatch = AsyncMock() + dep_mgr.remove_deferred = AsyncMock() + + storage = MagicMock() + storage.load_pr_tracking = AsyncMock(return_value=[]) + + event_bus = MagicMock() + event_bus.publish = AsyncMock() + + tools, _, _, _ = _build_tools_and_deps( + preflight_checker=preflight, + dependency_manager=dep_mgr, + remove_dispatched_callback=remove_dispatched, + clear_recovery_callback=clear_recovery, + storage=storage, + event_bus=event_bus, + ) + return tools, preflight, remove_dispatched, dep_mgr + + @pytest.mark.asyncio + async def test_requeue_clears_recovery_state(self) -> None: + """requeue_task must call clear_recovery_callback to reset retry counter.""" + clear_recovery = MagicMock() + tools, preflight, remove_dispatched, _ = self._make_requeue_deps( + clear_recovery=clear_recovery, + ) + + result = await tools["requeue_task"]({"task_key": "QR-273"}) + + clear_recovery.assert_called_once_with("QR-273") + remove_dispatched.assert_called_once_with("QR-273") + preflight.approve_for_dispatch.assert_called_once_with("QR-273") + assert "re-queued" in result["content"][0]["text"] + + @pytest.mark.asyncio + async def test_requeue_without_recovery_callback(self) -> None: + """requeue_task works when clear_recovery_callback is None.""" + tools, preflight, remove_dispatched, _ = self._make_requeue_deps() + + result = await tools["requeue_task"]({"task_key": "QR-100"}) + + remove_dispatched.assert_called_once_with("QR-100") + preflight.approve_for_dispatch.assert_called_once_with("QR-100") + assert "re-queued" in result["content"][0]["text"] + + # =================================================================== # Structural: every @tool in supervisor_tools must be in allowed_tools # ===================================================================