Skip to content

feat: autobrowse skill — self-improving browser automation#71

Open
shubh24 wants to merge 11 commits intomainfrom
feat/autobrowse-skill
Open

feat: autobrowse skill — self-improving browser automation#71
shubh24 wants to merge 11 commits intomainfrom
feat/autobrowse-skill

Conversation

@shubh24
Copy link
Copy Markdown
Contributor

@shubh24 shubh24 commented Apr 7, 2026

Summary

  • Adds the autobrowse skill to the skills library
  • Auto-research loop for building reliable browser navigation skills: inner agent browses → outer agent reads trace → improves strategy.md → repeat
  • Supports single task (interactive) and parallel multi-task mode via Claude Code sub-agents
  • Inspired by Karpathy's autoresearch pattern applied to browser automation

What's included

File Purpose
SKILL.md /autobrowse skill — single entry point for everything
scripts/evaluate.ts Inner agent harness (Anthropic API + browse CLI)
references/example-task.md Template for writing task.md
references/example-skill.md Template for a graduated skill.md
README.md Setup + project structure guide
EXAMPLES.md Usage examples
REFERENCE.md CLI flags, env vars, trace artifacts

How it works

/autobrowse --task my-portal          # single task loop
/autobrowse --all --env remote        # parallel via sub-agents

Customer creates tasks/<name>/task.md in their project, runs the skill, gets back a skill.md they can drop into any stagehand/browser-use agent.

🤖 Generated with Claude Code


Note

Medium Risk
Adds a new Node-based harness that executes browse CLI commands and writes traces to disk; while it restricts execution to browse and avoids shell expansion, it still introduces new code that runs external processes and handles API credentials.

Overview
Introduces a new skills/autobrowse package that adds the /autobrowse Claude Code skill for iterating on website-specific strategy.md files and optionally running multiple tasks in parallel via sub-agents.

Adds an inner-agent runner (scripts/evaluate.mjs) that calls the Anthropic API, executes only browse CLI commands (no shell), and records per-run artifacts (summary.md, trace.json, messages.json, screenshots) under traces/<task>/ with a latest symlink, plus supporting docs/templates (README.md, REFERENCE.md, EXAMPLES.md, references/) and Node deps/env examples.

Reviewed by Cursor Bugbot for commit 70d9ad0. Bugbot is set up for automated code reviews on this repo. Configure here.

shubh24 and others added 3 commits April 6, 2026 18:12
Auto-research loop for building reliable browser navigation skills.
Inner agent browses the site, outer agent reads the trace and improves
strategy.md. Repeat until it passes consistently.

- SKILL.md: /autobrowse skill (single + parallel task modes via sub-agents)
- scripts/evaluate.ts: inner agent harness (Anthropic API + browse CLI)
- references/: example-task.md and example-skill.md templates
- README, EXAMPLES, REFERENCE docs

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
17 iterations, 3/3 consecutive passes, remote env.
Shows real site-specific gotchas: named Browserbase sessions,
ESRI map location, Verint form patterns, XPath vs ref tradeoffs.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- structured skill.md graduation template (not just copying strategy.md)
- graduate at max iterations too, not just on pass rate
- persistent session reports in reports/ with per-iteration cost table
- richer sub-agent prompt with key learnings output
- cost column in multi-task report table

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
- remove --session flags (skills must be portable, not session-coupled)
- add Known Failure Point section (turn budget exhaustion pattern)
- correct wait syntax docs (browse wait load / browse wait timeout N)
- 22 iterations, updated gotchas order

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@shubh24 shubh24 requested a review from shrey150 April 7, 2026 01:27
- [high] block shell metacharacters (;|&|`$()<>) to prevent injection bypass
- [medium] fix npm evaluate script path: evaluate.ts → scripts/evaluate.ts
- [low] fix tsconfig include: *.ts → scripts/**/*.ts
- [low] lastAssistantText += (accumulate, not overwrite)
- [low] remove unused deps: @browserbasehq/stagehand, zod

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Autofix Details

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Opus 4.6 pricing rates are 3x too high
    • Updated the pricing table to use Claude Opus 4.6 at 5/25 and Haiku 4.5 at 1/5 dollars per million tokens.
  • ✅ Fixed: Broken symlink silently prevents latest link updates
    • Reworked latest-link rotation to always attempt unlink with ENOENT-only suppression before creating the symlink and warn on failures.

Create PR

Or push these changes by commenting:

@cursor push b8ac79b21f
Preview (b8ac79b21f)
diff --git a/skills/autobrowse/scripts/evaluate.ts b/skills/autobrowse/scripts/evaluate.ts
--- a/skills/autobrowse/scripts/evaluate.ts
+++ b/skills/autobrowse/scripts/evaluate.ts
@@ -483,9 +483,9 @@
   const durationSec = (Date.now() - startTime) / 1000;
   // Pricing per million tokens (input/output)
   const pricing: Record<string, [number, number]> = {
-    "claude-opus-4-6": [15, 75],
+    "claude-opus-4-6": [5, 25],
     "claude-sonnet-4-6": [3, 15],
-    "claude-haiku-4-5-20251001": [0.80, 4],
+    "claude-haiku-4-5-20251001": [1, 5],
   };
   const [inputRate, outputRate] = pricing[model] ?? [3, 15];
   const costUsd = (totalInputTokens * inputRate + totalOutputTokens * outputRate) / 1_000_000;
@@ -534,7 +534,16 @@
 
   // Update latest symlink
   const latestLink = path.join(tracesDir, "latest");
-  try { if (fs.existsSync(latestLink)) fs.unlinkSync(latestLink); fs.symlinkSync(runId, latestLink); } catch {}
+  try {
+    try {
+      fs.unlinkSync(latestLink);
+    } catch (err: unknown) {
+      if ((err as NodeJS.ErrnoException).code !== "ENOENT") throw err;
+    }
+    fs.symlinkSync(runId, latestLink);
+  } catch (err: unknown) {
+    console.warn(`Warning: failed to update latest symlink: ${(err as Error).message}`);
+  }
 
   console.log(`\n${summary}`);
   console.log(`\n${"=".repeat(60)}`);

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

}
}

lastAssistantText = assistantText;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent output lost when last turn lacks text

Medium Severity

lastAssistantText is unconditionally overwritten with assistantText on every turn, even when assistantText is empty. If the agent exhausts MAX_TURNS and the final API response contains only tool_use blocks (no text blocks), lastAssistantText becomes "", wiping out any meaningful text the agent produced on prior turns. The summary then omits the "Agent Final Output" section entirely, which the outer agent depends on to understand the inner agent's result.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ef21429. Configure here.

for (const block of response.content) {
if (block.type === "text") {
reasoningText += block.text;
assistantText += block.text;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant identical variables obscure intent

Low Severity

reasoningText and assistantText are populated by the exact same text blocks in the same loop iteration and always hold identical values. The distinct names imply a semantic difference that doesn't exist, which could mislead future contributors into updating one but not the other.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ef21429. Configure here.

shubh24 and others added 2 commits April 9, 2026 18:37
Captures full pipeline from storytelling to render:
- IDEA ENGINE narrative structure
- Remotion gotchas (extrapolateLeft, staticFile, durationInFrames)
- Browserbase brand colors and typography
- UI mocking patterns with real reference video analysis
- Audio production (background, SFX, ffmpeg generation)
- Content authenticity (real failures vs invented ones)
- Render quality settings (CRF 8 for Twitter masters)

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
console.log(` [${turn}] ❌ error: ${output.slice(0, 100)}`);
} else {
console.log(` [${turn}] ✓ ${output.slice(0, 100)} (${duration_ms}ms)`);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snapshot/screenshot errors masked in summary and logs

Medium Severity

The if-else chain checks isSnapshot/isScreenshot before checking error, so a failed browse snapshot is reported as "📸 snapshot: 0 refs" instead of "❌ error: ...". The same pattern is repeated in the summary.md generation. Since the outer agent reads the summary to diagnose failures and improve strategy.md, a dead session error like "No page available" being misreported as "0 refs" leads it to the wrong hypothesis, wasting improvement iterations.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0734f7d. Configure here.

….io conventions

- Converted scripts/evaluate.ts → scripts/evaluate.mjs (plain ESM JavaScript)
- Removed tsx, typescript, @types/node devDependencies
- Removed tsconfig.json (no longer needed)
- Added --help flag with full usage documentation
- Moved diagnostics to stderr, structured JSON result to stdout
- Added license field to SKILL.md frontmatter
- Updated all references from tsx/evaluate.ts to node/evaluate.mjs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit a0d48ee. Configure here.

const result = {
task: taskName,
run: runId,
status: turn < MAX_TURNS ? "completed" : "max_turns",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Status misreported when agent completes on final turn

Medium Severity

The status field is computed as turn < MAX_TURNS ? "completed" : "max_turns", but turn equals MAX_TURNS both when the agent successfully completes on the final turn (via break) and when it exhausts all turns without completing. If the agent finishes with stop_reason === "end_turn" on turn 30, the status is incorrectly reported as "max_turns" instead of "completed". The outer agent uses this status to decide pass/fail, so a successful run can be misclassified as an incomplete one.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit a0d48ee. Configure here.

Graduated skills now install as Claude Code slash commands at
~/.claude/skills/<task-name>/SKILL.md instead of committing a local
skill.md file. Removed all git commit references from the loop —
the working directory (tasks/, traces/, strategy.md) doesn't need
to be a git repo.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants