fix(providers): honour request_timeout for CLI providers with clear timeout errors and fallback#1847
Open
securityguy wants to merge 45 commits intosipeed:mainfrom
Open
fix(providers): honour request_timeout for CLI providers with clear timeout errors and fallback#1847securityguy wants to merge 45 commits intosipeed:mainfrom
securityguy wants to merge 45 commits intosipeed:mainfrom
Conversation
Some providers (via OpenRouter) reject assistant messages with "content": "" alongside tool_calls. The OpenAI spec permits content to be absent when tool_calls is set. Switch openaiMessage.Content from string to *string with omitempty and introduce msgContent() to return nil when content is empty and tool calls are present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ields Some OpenAI-compatible providers (e.g. OpenRouter routing to strict backends) reject non-standard fields in the request body such as reasoning_content in messages and extra_content / thought_signature in tool calls. Add a per-model strict_compat: true config option that strips these fields before serialization. Implementation: - Add StrictCompat bool to config.ModelConfig - Add WithStrictCompat option to openai_compat.Provider - Refactor HTTPProvider constructors into a single NewHTTPProviderWithOptions using variadic openai_compat.Option, eliminating the growing list of named constructors - Thread StrictCompat through CreateProviderFromConfig via composed options Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the claude CLI exits with a non-zero status, the previous error handler only checked stderr. However, the CLI writes its output (including error details) to stdout, especially when invoked with --output-format json. This left the caller with only "exit status 1" and no actionable information. Now includes both stderr and stdout in the error message so the actual failure reason is visible in logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add claude-cli and codex-cli to the supported vendors table and include vendor-specific configuration examples explaining: - No API key is required (uses existing CLI subscription) - The claude-code sentinel model ID skips --model flag so the CLI uses its own configured default model Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add channels.telegram_bots config allowing multiple Telegram bot tokens to be configured, each mapped to a separate channel (e.g. telegram-amber, telegram-karen). Each channel can be independently bound to an agent via the bindings config, enabling distinct AI personas behind separate bots. Backward compatibility is preserved: the existing channels.telegram single-entry config continues to work unchanged. On load it is normalized into telegram_bots as an entry with id "default", which produces the channel name "telegram" so all existing bindings remain valid. Key changes: - config: add TelegramBotConfig struct with ChannelName/AsTelegramConfig helpers; add TelegramBots field to ChannelsConfig; normalize legacy single entry into list on load - telegram: add NewTelegramChannelFromConfig constructor accepting TelegramConfig + explicit channel name (avoids import cycle) - channels: add TelegramBotFactory registry; add injectChannelDependencies helper to eliminate injection code duplication; add duplicate channel name guard in initTelegramBot; update initChannels to iterate over TelegramBots; add prefix-based rate limit fallback for named bots Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…le.json Add two disabled example bots (alice, bob) under channels.telegram_bots and corresponding top-level bindings to illustrate how multiple Telegram bots map to separate named channels and agents. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GeminiCliProvider that wraps the Gemini CLI as a subprocess,
following the same pattern as the existing claude-cli and codex-cli
providers.
The provider invokes:
gemini --yolo --output-format json --prompt ""
with the prompt sent via stdin. The --prompt "" flag enables
non-interactive (headless) mode, reading the full prompt from stdin.
Key details:
- Model sentinel: "gemini-cli" skips --model flag (uses CLI default)
- Explicit model: "gemini-cli/gemini-2.5-pro" passes --model gemini-2.5-pro
- System messages prepended to stdin (no --system-prompt flag in gemini)
- Parses JSON response format: {"response": "...", "stats": {"models": {...}}}
- Token usage summed across all models in stats.models (gemini uses
multiple internal models per request)
- Tool calls extracted from response text using shared extractToolCallsFromText
- New protocol: "gemini-cli" / alias "geminicli"
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add PR sipeed#1633 (gemini-cli provider) to contributions table - Add configuration guide section covering: - claude-cli, codex-cli, and gemini-cli providers with model_list examples - Multiple Telegram bots with bindings and per-agent config - Agent workspace and personality file notes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds GeminiCliProvider (PR sipeed#1633 against sipeed/picoclaw). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace Amber/Karen with Alice/Bob in all README examples for consistency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Previously all agents shared a single LLMProvider instance created from agents.defaults.model_name. Per-agent model config (agents.list[].model) only changed the model string passed to Chat() — it never changed which provider binary was invoked. This caused cross-provider fallback chains (e.g. gemini-cli falling back to claude-cli) to fail, and made it impossible to assign different CLI providers to different agents. Introduces ProviderDispatcher which lazily creates and caches provider instances keyed by "protocol/modelID". The fallback chain's run closure now resolves the correct provider via the dispatcher before falling back to agent.Provider for backward compatibility. References sipeed#1634 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in ProviderDispatcher fix (PR sipeed#1637 against sipeed/picoclaw). References sipeed#1634. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndling
Tool call detection previously relied on a literal strings.Index for
'{"tool_calls"' which failed whenever the LLM returned pretty-printed
JSON (newline after '{') or wrapped the output in markdown code fences.
Arguments typed as a JSON object instead of an encoded string also
caused a silent parse failure and leaked the raw JSON block to the user.
Changes:
- Strip markdown code fences (` + "```json" + ` / ` + "```" + `) before parsing
- Locate JSON candidate via first '{' / last '}' instead of literal match
- Unmarshal directly and check for top-level "tool_calls" key
- Accept arguments as either a JSON-encoded string or a plain JSON object
- Remove dead findMatchingBrace function and its tests
- Publish response.Content to the user immediately when a response
contains both text and tool calls (previously the text was silently
discarded into session history)
- Fix pre-existing test bug: TestCreateProvider_GeminiCliDefaultWorkspace
now clears Agents.Defaults.Workspace before testing the '.' fallback
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
requiresRuntimeProbe and probeLocalModelAvailability handled claude-cli and codex-cli but not gemini-cli, causing the launcher to report "default model has no credentials configured" and skip auto-start when gemini-cli was set as the default model. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
TryAutoStartGateway only checked gateway.cmd, which tracks processes the
launcher itself spawned. A gateway managed externally (e.g. via systemd)
was invisible to this check, causing two problems:
1. The launcher started a duplicate gateway instance on every launch.
2. The WebUI showed "Gateway Not Running" even when it was healthy.
Fix: probe the gateway health endpoint in two places:
- TryAutoStartGateway: skip auto-start if the health endpoint responds.
- gatewayStatusData: report "running" when the launcher has no owned
process but the health endpoint is responding. Launcher-owned
transition states (restarting/error) take precedence over the probe.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Brings in launcher external gateway detection fix (PR sipeed#1811 against sipeed/picoclaw). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The --system-prompt flag exposed agent instructions and tool definitions in the process argument list (visible via ps to all users on the host) and risked hitting OS ARG_MAX limits when many tools are registered. System prompt content is now prepended to the stdin payload, separated from the user message by a --- delimiter. This is consistent with how the gemini-cli and codex-cli providers already handle all input. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nd mixed response handling
… allowlist on self-spawn Previously SubagentManager was initialised with the global provider and the calling agent's model name string (e.g. "openrouter-gpt-5.4"). When the global provider happened to be claude-cli this caused it to be invoked with --model openrouter-gpt-5.4, which claude does not recognise, blowing up every spawn attempt. Two problems fixed together: 1. Provider dispatch: SubagentManager now holds the ProviderDispatcher and the calling agent's model candidates. When a subagent is spawned it resolves the correct provider through the same per-candidate dispatch used by the main agent loop. When agent_id names a different agent, that agent's candidates are resolved via a registry callback so the subagent runs with the target agent's configured model (e.g. spawning "karen" uses karen's claude-cli, not amber's openrouter). 2. Self-spawn allowlist: the allowlist check previously only ran when agent_id was explicitly set. Empty agent_id (self-spawn) now resolves to the caller's own ID before the check, so allow_agents: ["karen"] on amber correctly rejects an unqualified spawn attempt. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a named subagent (e.g. agent_id="karen") completes a task, its response is now prefixed with the agent's name in bold so the user can tell which agent produced the result. Self-spawns (no agent_id) are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lowlist self-spawn, response attribution
The list command previously showed only name, ID, schedule, status, and next run time. Channel, recipient, deliver mode, and message/command were hidden, making it impossible to tell from the CLI what a job would do or which agent would handle it. All payload fields are now displayed. Messages longer than 80 characters are truncated with an ellipsis. Jobs with a command show the command instead of a message. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ectly
Cron jobs fired via ProcessDirectWithChannel had no Peer set on the
InboundMessage, so the route resolver never matched peer-based bindings
and always fell through to the default agent.
Setting Peer{Kind: "channel", ID: chatID} when a real chatID is present
means a cron job targeted at a specific Slack channel ID will now be
routed to whichever agent has a matching peer binding for that channel,
consistent with how live inbound messages are routed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When a cron job with deliver=false runs, the agent response was silently discarded with a misleading comment "Will be sent by AgentLoop". The Run loop is never involved in cron execution — ProcessDirectWithChannel bypasses it entirely. As a result, Karen's response to scheduled tasks was computed but never delivered to Slack (or any other channel). Fix: explicitly publish the response via msgBus.PublishOutbound after ProcessDirectWithChannel returns, consistent with how command and deliver=true jobs already handle their output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…lish response to channel
… changes Two related bugs caused cron jobs added via the CLI to be silently lost while picoclaw was running: 1. checkJobs() called saveStoreUnsafe() unconditionally every second, overwriting the file with the running service's in-memory state. Any job written by the CLI was clobbered within at most one tick. 2. The running service never reloaded the store from disk, so CLI-added jobs were invisible in memory and would never execute even if the file was not overwritten. Fix: track the store file's mtime in fileModTime (updated after every load and save). At the start of each checkJobs() tick, stat the file; if its mtime is newer than fileModTime, reload from disk. Only call saveStoreUnsafe() when there are actually due jobs — no state changes means no write, which stops the clobbering entirely. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…only save when state changes
claude-cli, codex-cli, and gemini-cli ignored the request_timeout config field entirely — the factory created them without any timeout and the subprocesses ran indefinitely. This is particularly problematic for agentic tasks (e.g. cron jobs) where a runaway Claude Code session could block the agent loop forever. Each CLI provider gains a timeout field and a WithTimeout constructor. When request_timeout > 0 in the model config, Chat() wraps the incoming context with context.WithTimeout before passing it to exec.CommandContext, so the subprocess is killed and an error is returned if the deadline is exceeded. A value of 0 (the default) leaves behaviour unchanged — no timeout is applied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…der timeout
When a CLI provider subprocess was killed due to request_timeout, the error
reported was the raw subprocess signal ("signal: killed"), giving the user
no indication that a timeout was the cause. Additionally, the fallback chain
would not trigger because ClassifyError used == to match context.DeadlineExceeded,
which does not match wrapped errors.
- Each CLI provider now checks ctx.Err() == context.DeadlineExceeded after
cmd.Run() fails and returns a descriptive error (e.g. "claude cli timed out
after 30s") that wraps context.DeadlineExceeded
- ClassifyError updated to use errors.Is instead of == when checking for
context.DeadlineExceeded, so wrapped timeout errors correctly classify as
FailoverTimeout and trigger fallback to the next candidate in the chain
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
securityguy
added a commit
to securityguy/picoclaw
that referenced
this pull request
Mar 20, 2026
This was referenced Mar 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The three CLI providers (
claude-cli,codex-cli,gemini-cli) silently ignored therequest_timeoutfield inmodel_listconfig. The factory created them without any timeout, so subprocesses ran indefinitely — a serious problem for agentic tasks (e.g. cron jobs) where a runaway session could block the agent loop forever.Additionally,
ClassifyErrorused==to matchcontext.DeadlineExceeded, which does not match wrapped errors, so even if a timeout had been applied elsewhere it would not have triggered the fallback chain.Fix
pkg/providers/factory_provider.goWhen
request_timeout > 0in the model config, the factory now calls the newWithTimeoutconstructors for each CLI provider instead of the timeout-less defaults. A value of0(the default) leaves behaviour unchanged.pkg/providers/{claude,codex,gemini}_cli_provider.goEach provider gains a
timeout time.Durationfield and aNewXxxWithTimeoutconstructor. InChat(), iftimeout > 0the incoming context is wrapped withcontext.WithTimeoutbefore being passed toexec.CommandContext. Whencmd.Run()fails andctx.Err() == context.DeadlineExceeded, a descriptive error is returned (e.g."claude cli timed out after 30s") that wrapscontext.DeadlineExceededso the fallback chain can classify it.pkg/providers/error_classifier.goChanged
err == context.DeadlineExceededtoerrors.Is(err, context.DeadlineExceeded)so that wrapped timeout errors correctly classify asFailoverTimeoutand trigger fallback to the next candidate in the chain.Result
With
request_timeout: 10onclaude-cliandcodex-clias a fallback:claude-cliis killed after 10 seconds"claude cli timed out after 10s"codex-cli"signal: killed"Configuration
{ "model_name": "claude-cli", "model": "claude-cli/claude-code", "request_timeout": 1200 }Omitting
request_timeoutor setting it to0leaves behaviour unchanged — no timeout is applied.