A MarkItDown plugin that converts live URLs via Plasmate instead of BeautifulSoup — returning 10-100x fewer tokens with no API key required.
MarkItDown's built-in HTML converter fetches a URL, strips <script> tags, and converts whatever remains with BeautifulSoup. For a typical news article that means ~60,000 tokens of navigation menus, cookie banners, sidebar widgets, and footer links wrapped around ~2,000 tokens of actual content.
Plasmate is an open-source Rust browser engine that renders the page properly and returns only the meaningful content as clean Markdown. The token difference is significant:
| Site | Raw HTML (BeautifulSoup) | Plasmate | Reduction |
|---|---|---|---|
| TechCrunch article | ~75,000 tokens | ~975 tokens | 77× |
| Average (45 sites) | ~45,000 tokens | ~2,500 tokens | 17.7× |
The plugin slots in specifically for http:// and https:// URL inputs — local files (PDF, Word, Excel, etc.) continue to use MarkItDown's native converters unchanged.
pip install markitdown-plasmate
pip install plasmate # the Rust browser engineOr with cargo:
cargo install plasmatemarkitdown --use-plugins https://techcrunch.com/2025/04/08/some-article/from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
result = md.convert("https://blog.cloudflare.com/ai-crawler-traffic-by-purpose-and-industry/")
print(result.markdown)
# → clean article content, ~2,000 tokens instead of ~60,000Pass plugin options via MarkItDown kwargs:
md = MarkItDown(
enable_plugins=True,
plasmate_format="markdown", # markdown | text | som | links
plasmate_timeout=30, # seconds
plasmate_selector="article", # CSS selector to scope extraction
)Or use PlasmateConverter directly:
from markitdown_plasmate import PlasmateConverter
from markitdown import MarkItDown
md = MarkItDown()
md.register_converter(PlasmateConverter(output_format="markdown", selector="main"))
result = md.convert("https://example.com")| Format | Description |
|---|---|
markdown |
Clean Markdown (default) |
text |
Plain text, no markup |
som |
Structured Object Model — semantic JSON tree |
links |
Extracted hyperlinks only |
The plugin only intercepts http:// and https:// URLs. All other MarkItDown input types (PDF, Word, Excel, images, audio, local HTML files) are unaffected.
- Python 3.10+
markitdown >= 0.1.0plasmatebinary on PATH (pip install plasmateorcargo install plasmate)
The plugin is constructable without the binary — ImportError is raised on the first conversion attempt with clear install instructions.
- Plasmate — the open-source Rust browser engine
- somspec.org — Structured Object Model specification
- MarkItDown — the Python file-to-Markdown converter this plugin extends