Skip to content

Latest commit

 

History

History
366 lines (289 loc) · 9.34 KB

File metadata and controls

366 lines (289 loc) · 9.34 KB

Browser Scraper 🚀

Multi-agent browser automation framework based on page-agent. Uses DOM-based manipulation for fast and reliable web interaction.

✨ Features

  • 🤖 Multi-Agent Workflow: Planner → Browser Agent → Extractor pipeline
  • 📄 DOM-Based: Direct DOM manipulation (no screenshots needed)
  • 🎯 Flexible: Works in user's real browser or headless
  • ⚡ Fast: Direct DOM operations, no image processing
  • 🔄 Caching: Built-in result caching for performance
  • 🔌 MCP Server: External control via Model Context Protocol
  • 🎨 Full Toolset: All page-agent tools enabled (javascript, user interaction, etc.)

🏗️ Architecture

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   PLANNER    │────▶│   SCRAPER    │────▶│  EXTRACTOR   │
│  (qwen3.5)   │     │  (glm-5.1)   │     │  (gemini)    │
└──────────────┘     └──────────────┘     └──────────────┘
                         │
                         ▼
                  ┌──────────────┐
                  │  PAGE-AGENT  │
                  │  (DOM-based) │
                  └──────────────┘
                         │
                         ▼
                  ┌──────────────┐
                  │   BROWSER    │
                  │  (User's)    │
                  └──────────────┘

📦 Installation

# Clone repository
git clone https://github.com/mamidevs/browser-scraper.git
cd browser-scraper

# Install dependencies
npm install

# Setup environment variables
cp .env.example .env
# Edit .env and add your OpenRouter API key

🔧 Configuration

Create .env file:

# OpenRouter API Key
OPENROUTER_API_KEY=your_api_key_here

# Model Configuration (optional - defaults shown)
PLANNER_MODEL=qwen/qwen3.5-27b
SCRAPER_MODEL=z-ai/glm-5.1
EXTRACTOR_MODEL=google/gemini-2.5-flash

# Base URL (optional)
OPENROUTER_BASE_URL=https://openrouter.ai/api/v1

🚀 Usage

Basic Scraping (Single Agent)

import { ScrapingAgent } from 'browser-scraper';

const agent = new ScrapingAgent({
    model: 'z-ai/glm-5.1',
    apiKey: process.env.OPENROUTER_API_KEY,
});

await agent.initialize();

// Navigate and extract
await agent.navigate('https://example.com');
const result = await agent.execute('Extract all product names and prices');

console.log(result.data);

Multi-Agent Workflow

import { ScrapingOrchestrator } from 'browser-scraper';

const orchestrator = new ScrapingOrchestrator({
    planner: { model: 'qwen/qwen3.5-27b', apiKey: '...' },
    scraper: { model: 'z-ai/glm-5.1', apiKey: '...' },
    extractor: { model: 'google/gemini-2.5-flash', apiKey: '...' },
});

await orchestrator.initialize();

const result = await orchestrator.scrape({
    task: 'Find all laptops under $1000 with ratings above 4.0',
    url: 'https://amazon.com',
    schema: {
        type: 'array',
        items: {
            type: 'object',
            properties: {
                name: { type: 'string' },
                price: { type: 'number' },
                rating: { type: 'number' }
            }
        }
    }
});

console.log('Plan:', result.plan);
console.log('Data:', result.data);

Batch Scraping

const urls = [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3',
];

const results = await orchestrator.scrapeBatch(
    urls.map(url => ({
        task: 'Extract product names and prices',
        url,
        schema: { /* ... */ }
    })),
    3 // concurrency
);

MCP Server

import { ScrapingMCPServer } from 'browser-scraper';

const server = new ScrapingMCPServer();

await server.initialize({
    planner: { model: 'qwen/qwen3.5-27b', apiKey: '...' },
    scraper: { model: 'z-ai/glm-5.1', apiKey: '...' },
    extractor: { model: 'google/gemini-2.5-flash', apiKey: '...' },
});

// Available tools:
// - scrape-url: Single URL scraping
// - scrape-batch: Multiple URLs in parallel
// - extract-schema: Schema-based extraction
// - clear-cache: Clear all caches
// - get-stats: Get statistics

// Call from external MCP client
const result = await server.callTool({
    name: 'scrape-url',
    arguments: {
        url: 'https://example.com',
        task: 'Extract all links',
    }
});

📚 Examples

Run the examples:

# Basic scraping
npm run example:basic

# Multi-agent workflow
npm run example:multi

# MCP client
npm run example:mcp

🎯 Use Cases

1. E-commerce Price Monitoring

const result = await orchestrator.scrape({
    task: 'Extract product name, price, availability, and ratings',
    url: 'https://amazon.com/dp/B08N5KWB9H',
    schema: {
        type: 'object',
        properties: {
            name: { type: 'string' },
            price: { type: 'number' },
            currency: { type: 'string' },
            availability: { type: 'string' },
            rating: { type: 'number' },
            reviews: { type: 'integer' }
        }
    }
});

2. Job Board Aggregation

const result = await orchestrator.scrape({
    task: `
        1. Search for "software engineer" jobs
        2. Filter by location "Remote"
        3. Extract job title, company, salary, and application URL
    `,
    url: 'https://linkedin.com/jobs',
    schema: {
        type: 'array',
        items: {
            type: 'object',
            properties: {
                title: { type: 'string' },
                company: { type: 'string' },
                salary: { type: 'string' },
                location: { type: 'string' },
                applyUrl: { type: 'string' }
            }
        }
    }
});

3. News Article Collection

const urls = [
    'https://techcrunch.com',
    'https://theverge.com',
    'https://arstechnica.com',
];

const results = await orchestrator.scrapeBatch(
    urls.map(url => ({
        task: 'Extract article title, author, publish date, and summary',
        url,
        schema: {
            type: 'array',
            items: {
                type: 'object',
                properties: {
                    title: { type: 'string' },
                    author: { type: 'string' },
                    date: { type: 'string' },
                    summary: { type: 'string' },
                    url: { type: 'string' }
                }
            }
        }
    })),
    3
);

🔧 API Reference

ScrapingAgent

Single-agent wrapper around page-agent.

Methods:

  • initialize(): Initialize the agent (must be called first)
  • navigate(url): Navigate to a URL
  • execute(task): Execute a natural language task
  • extractData(schema): Extract structured data using schema
  • clearCache(): Clear cached results

ScrapingOrchestrator

Multi-agent orchestration.

Methods:

  • initialize(): Initialize all agents
  • scrape(taskConfig): Execute scraping with multi-agent workflow
  • scrapeBatch(tasks, concurrency): Execute multiple scraping tasks in parallel
  • clearCache(): Clear all caches

ScrapingMCPServer

MCP server for external control.

Tools:

  • scrape-url: Single URL scraping
  • scrape-batch: Batch scraping
  • extract-schema: Schema-based extraction
  • clear-cache: Clear cache
  • get-stats: Get statistics

🏃 Running

# Build
npm run build

# Run examples
npm run example:basic
npm run example:multi
npm run example:mcp

# Test
npm test

📁 Project Structure

browser-scraper/
├── src/
│   ├── agents/
│   │   └── orchestrator.ts    # Multi-agent coordinator
│   ├── page-agent/
│   │   └── wrapper.ts         # Page-agent integration
│   ├── storage/
│   │   └── cache.ts           # Caching layer
│   ├── mcp/
│   │   └── server.ts          # MCP server
│   └── index.ts               # Main entry
├── examples/
│   ├── basic.ts               # Basic example
│   ├── multi-agent.ts         # Multi-agent workflow
│   └── mcp-client.ts          # MCP client
├── tests/
├── package.json
├── tsconfig.json
└── README.md

🔒 Privacy & Ethics

  • Respect robots.txt: Always check website's robots.txt
  • Rate limiting: Use caching and respect rate limits
  • Terms of Service: Ensure scraping is allowed by ToS
  • Data usage: Use scraped data responsibly and legally

🤝 Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

📄 License

MIT License - see LICENSE file for details.

🔗 Related Projects

📞 Support


Built with ❤️ using page-agent and Hermes Agent