Skip to content

Add data-extractor skill for structured web scraping#66

Open
aq17 wants to merge 1 commit intomainfrom
add-data-extractor-skill
Open

Add data-extractor skill for structured web scraping#66
aq17 wants to merge 1 commit intomainfrom
add-data-extractor-skill

Conversation

@aq17
Copy link
Copy Markdown
Contributor

@aq17 aq17 commented Apr 2, 2026

Summary

  • Adds a new data-extractor skill that teaches agents how to extract structured JSON from websites using browse CLI primitives (eval, get text, snapshot)
  • Covers 5 extraction patterns: single-page, list-page, paginated, search-then-extract, and authenticated extraction
  • Includes SKILL.md, REFERENCE.md (pattern specs, selector strategies, error reference), and EXAMPLES.md (7 worked examples)
  • Registers the skill in marketplace.json and adds it to README.md

Why

Over the last 30 days, 6+ customers (Vivian Health, Drata, Rippling, Zogo, Freshsauce, HappyStack) are all trying to extract structured data from websites. The existing browser skill only shows basic browse get text "body". This skill fills the gap by teaching the browse eval "JSON.stringify(...)" pattern and pagination workflows.

The browse CLI stays LLM-free and deterministic — the agent is the intelligence that reads page structure and decides what to extract.

Stress tested against

  • Hacker News (list extraction with querySelectorAll + .map())
  • GitHub Trending (list extraction with varied selectors)
  • Wikipedia (table extraction with IIFE headers-as-keys pattern)
  • Single-field extraction with browse get text

Test plan

  • Verify SKILL.md frontmatter parses correctly
  • Verify marketplace.json is valid JSON
  • Run Example 1 (single-page extract) against a real URL
  • Run Example 3 (paginated extract) to test pagination flow
  • Run Example 7 (table extract) against a Wikipedia table
  • Confirm skill appears in Claude Code when installed

🤖 Generated with Claude Code


Note

Low Risk
Low risk: adds new documentation-only data-extractor skill content and registers it in the marketplace/README without changing runtime code paths.

Overview
Adds a new data-extractor skill (docs + examples) that guides agents to extract structured JSON from web pages using browse primitives (notably eval/JSON.stringify, snapshot, waiting, and pagination patterns).

Registers the skill in .claude-plugin/marketplace.json under the browse plugin and lists it in README.md, including new SKILL.md, REFERENCE.md, EXAMPLES.md, and an MIT LICENSE.txt.

Written by Cursor Bugbot for commit 119df5b. This will update automatically on new commits. Configure here.

Teaches agents how to extract structured JSON from websites using
browse CLI primitives (eval, get text, snapshot). Covers five patterns:
single-page, list-page, paginated, search-then-extract, and
authenticated extraction — mapped to top customer use cases.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aq17 aq17 force-pushed the add-data-extractor-skill branch from 119df5b to e4283b0 Compare April 2, 2026 20:01
@aq17 aq17 requested a review from shubh24 April 2, 2026 20:22
@aq17 aq17 closed this Apr 3, 2026
@aq17 aq17 reopened this Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant