scrapy-plasmate

Scrapy downloader middleware powered by Plasmate's Semantic Object Model.

Stop writing fragile CSS selectors. Let Plasmate parse pages into clean, structured data — then use simple Python to extract what you need.

The Problem

Scraping Hacker News with vanilla Scrapy means fighting table-based HTML:

# Without Plasmate — brittle, verbose, breaks when HN changes layout
def parse(self, response):
    for row in response.css('tr.athing'):
        title_el = row.css('td.title > span.titleline > a')
        yield {
            'title': title_el.css('::text').get(),
            'url': title_el.attrib.get('href', ''),
            'rank': row.css('td.title > span.rank::text').get('').strip('.'),
        }

Raw HTML from HN: ~42 KB → ~11,000 tokens
Plasmate SOM output: ~6 KB → ~1,500 tokens (86% reduction)

The Solution

# With Plasmate — clean, structural, resilient
from scrapy_plasmate import extract_links

def parse(self, response):
    som = response.meta['plasmate_som']
    for link in extract_links(som):
        yield {'title': link['text'], 'url': link['url']}

Installation

pip install scrapy-plasmate

You also need the Plasmate CLI:

# macOS
brew install nicholasgasior/plasmate/plasmate

# From source
cargo install plasmate

# Or download from https://github.com/nicholasgasior/plasmate/releases

Quick Start

1. Enable the middleware

In your Scrapy project's settings.py:

DOWNLOADER_MIDDLEWARES = {
    'scrapy_plasmate.PlasmateDownloaderMiddleware': 543,
}

2. Use SOM in your spider

import scrapy
from scrapy_plasmate import extract_text, extract_links, extract_headings

class MySpider(scrapy.Spider):
    name = 'example'
    start_urls = ['https://example.com']

    def parse(self, response):
        som = response.meta['plasmate_som']

        # Extract all text
        text = extract_text(som)

        # Extract all links
        links = extract_links(som)

        # Extract headings
        headings = extract_headings(som)

        yield {
            'url': response.url,
            'title': som.get('title', ''),
            'text': text,
            'links': links,
            'headings': headings,
        }

Settings

Setting	Type	Default	Description
`PLASMATE_ENABLED`	bool	`True`	Enable/disable the middleware
`PLASMATE_TIMEOUT`	int	`30`	Timeout in seconds
`PLASMATE_JAVASCRIPT`	bool	`True`	Enable JS rendering
`PLASMATE_FORMAT`	str	`'json'`	Output format: `'json'` or `'text'`
`PLASMATE_BINARY`	str	`'plasmate'`	Path to plasmate binary
`PLASMATE_EXTRA_ARGS`	list	`[]`	Additional CLI arguments

Per-Request Control

Skip Plasmate for specific requests:

yield scrapy.Request(url, meta={'plasmate_skip': True})

Access the SOM after the middleware runs:

def parse(self, response):
    som = response.meta.get('plasmate_som')
    if som is None:
        # Plasmate was skipped or failed — response is raw HTML
        pass

Utility Functions

All utilities work with the parsed SOM dict from response.meta['plasmate_som']:

from scrapy_plasmate import (
    extract_text,      # All text content as a string
    extract_links,     # [{'url': '...', 'text': '...'}]
    extract_headings,  # [{'level': 1, 'text': '...'}]
    extract_tables,    # Table regions/elements from the SOM
)
from scrapy_plasmate.utils import (
    extract_images,    # [{'src': '...', 'alt': '...'}]
    extract_by_role,   # Filter elements by SOM role
)

Comparison

Feature	Raw Scrapy	scrapy-plasmate
Setup	CSS/XPath per site	Same utils everywhere
Resilience	Breaks on layout changes	Semantic = stable
Token efficiency	Full HTML (~11K tokens/page)	SOM (~1.5K tokens/page)
JS rendering	Needs scrapy-splash or playwright	Built-in
Learning curve	CSS selectors, XPath	`extract_text(som)`

Examples

See examples/example_spider.py for a complete Hacker News spider.

# Run the example
scrapy runspider examples/example_spider.py -o stories.json

Fallback Behavior

If Plasmate fails (timeout, binary not found, non-zero exit), the middleware returns None and Scrapy falls back to its default downloader. Your spider still works — it just gets raw HTML instead of a SOM.

License

Apache 2.0 — see LICENSE.

Part of the Plasmate Ecosystem


Engine	plasmate - The browser engine for agents
MCP	plasmate-mcp - Claude Code, Cursor, Windsurf
Extension	plasmate-extension - Chrome cookie export
SDKs	Python / Node.js / Go / Rust
Frameworks	LangChain / CrewAI / AutoGen / Smolagents
Tools	Scrapy / Audit / A11y / GitHub Action
Resources	Awesome Plasmate / Notebooks / Benchmarks
Docs	docs.plasmate.app
W3C	Web Content Browser for AI Agents

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
scrapy_plasmate		scrapy_plasmate
tests		tests
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy-plasmate

The Problem

The Solution

Installation

Quick Start

1. Enable the middleware

2. Use SOM in your spider

Settings

Per-Request Control

Utility Functions

Comparison

Examples

Fallback Behavior

License

Part of the Plasmate Ecosystem

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

scrapy-plasmate

The Problem

The Solution

Installation

Quick Start

1. Enable the middleware

2. Use SOM in your spider

Settings

Per-Request Control

Utility Functions

Comparison

Examples

Fallback Behavior

License

Part of the Plasmate Ecosystem

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages