feat(scrapegraph): migrate tool to scrapegraph-py v2 SDK#135
Conversation
The scrapegraph-py 2.x SDK replaces the old `Client` with `ScrapeGraphAI`
and returns `ApiResult` objects instead of raising exceptions. The old
capability surface (smartscraper, markdownify, searchscraper, smartcrawler,
sitemap) no longer exists upstream, so this is a full rewrite of the tool
against the new endpoints.
Capabilities exposed:
- scrape(url, format) — markdown/html/links/summary
- extract(prompt, url, schema) — AI structured extraction
- search(query, num_results) — web search + optional extraction
- crawl(url, max_pages, ...) — start a crawl job
- get_crawl_result(crawl_id) — poll crawl status/result
- monitor(url, interval, ...) — schedule a page monitor (cron)
- credits() — plan / remaining credits
- health() — API health check
Also:
- Bump `scrapegraph-py` optional dep to `>=2.1.0`
- Accept `SGAI_API_KEY` (new SDK default), with fallback to legacy
`SCRAPEGRAPH_API_KEY` so existing users aren't broken
- Tests cover success, API-level error (ApiResult.status=="error"),
exception paths, missing-dep guard, and env-var resolution
- Example rewritten to exercise the new surface
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughThe ScrapeGraphAI integration has undergone a comprehensive upgrade from SDK v1 to v2. The tool's capability set has been reorganized—replacing legacy methods ( Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes The refactoring spans multiple files with heterogeneous logic changes—new method signatures across the tool class, SDK integration details, format configuration builders, and corresponding test suite updates. While changes follow a consistent pattern, each method requires separate verification of argument forwarding, result handling, and error management. Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| prompt: Optional extraction prompt applied to the results. | ||
| """ | ||
| try: | ||
| result = self.client.search(query, num_results=num_results, prompt=prompt) | ||
| return _format_result(result, "search") | ||
| except Exception as e: | ||
| logger.exception("ScrapeGraphAI SearchScraper Error") | ||
| return f"Error in searchscraper: {str(e)}" | ||
| logger.exception("ScrapeGraphAI search error") |
There was a problem hiding this comment.
JsonFormatConfig imported but never used
JsonFormatConfig is imported (and nulled out in the fallback) but never referenced in _FORMAT_BUILDERS, in crawl/monitor, or anywhere else. Ruff will flag this as F401 and fail the lint pre-commit hook / CI linter step. Either add "json" as a supported format in _FORMAT_BUILDERS/ScrapeFormat, or drop the import entirely.
| prompt: Optional extraction prompt applied to the results. | |
| """ | |
| try: | |
| result = self.client.search(query, num_results=num_results, prompt=prompt) | |
| return _format_result(result, "search") | |
| except Exception as e: | |
| logger.exception("ScrapeGraphAI SearchScraper Error") | |
| return f"Error in searchscraper: {str(e)}" | |
| logger.exception("ScrapeGraphAI search error") | |
| from scrapegraph_py import ( | |
| HtmlFormatConfig, | |
| LinksFormatConfig, | |
| MarkdownFormatConfig, | |
| SummaryFormatConfig, | |
| ) |
| url: str, | ||
| max_pages: int = 10, | ||
| max_depth: int = 2, | ||
| include_patterns: Optional[List[str]] = None, | ||
| exclude_patterns: Optional[List[str]] = None, | ||
| ) -> str: |
There was a problem hiding this comment.
JsonFormatConfig also missing from fallback
The except ImportError fallback block does not assign JsonFormatConfig = None. If the import is kept and the library is absent, any reference to JsonFormatConfig would raise a NameError rather than degrade gracefully. If the import is removed from the try-block (see above), also remove it from the fallback.
| url: str, | |
| max_pages: int = 10, | |
| max_depth: int = 2, | |
| include_patterns: Optional[List[str]] = None, | |
| exclude_patterns: Optional[List[str]] = None, | |
| ) -> str: | |
| _SGAIClient = None | |
| MarkdownFormatConfig = None | |
| HtmlFormatConfig = None | |
| LinksFormatConfig = None | |
| SummaryFormatConfig = None |
| Args: | ||
| website_url: The URL of the website to crawl | ||
| user_prompt: Prompt describing what to extract (used when extraction_mode=True) | ||
| max_depth: Maximum depth of crawling (default: 1) | ||
| max_pages: Maximum number of pages to crawl (default: 3) | ||
| sitemap: Whether to use sitemap for crawling (default: True) | ||
| extraction_mode: Whether to use extraction mode (requires data_schema if True, default: False) | ||
| data_schema: Data schema for extraction (required if extraction_mode=True) | ||
| url: Page to monitor. | ||
| interval: Cron expression, e.g. "0 * * * *" for hourly. | ||
| name: Optional monitor name. | ||
| webhook_url: Optional webhook to receive change notifications. | ||
| """ | ||
| try: | ||
| crawl_params = { | ||
| "url": website_url, | ||
| "depth": max_depth, | ||
| "max_pages": max_pages, | ||
| "sitemap": sitemap, | ||
| "extraction_mode": extraction_mode, | ||
| } | ||
|
|
||
| # Include prompt and data_schema only when extraction_mode=True | ||
| if extraction_mode: | ||
| if data_schema is None: | ||
| raise ValueError( | ||
| "data_schema is required when extraction_mode=True" | ||
| ) | ||
| crawl_params["prompt"] = user_prompt | ||
| crawl_params["data_schema"] = data_schema | ||
| response = self.client.crawl(**crawl_params) | ||
| return str(response) | ||
| result = self.client.monitor.create( |
There was a problem hiding this comment.
self.api_key stores unresolved value
super().__init__(api_key) is called before resolved_key is computed, so self.api_key ends up holding None (or the raw, unresolved argument) even when the key was actually read from an environment variable. Any code that later reads tool.api_key to inspect the active credential will see None. Computing resolved_key first and then passing it to super().__init__ would keep the stored attribute consistent with what self.client was initialised with.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
examples/scrapegraphai_example.py (1)
24-24: Small thing — the explicitos.environ.get(...)is redundant.The constructor already resolves
SGAI_API_KEY(and the legacy one) on its own, so passing it in fromos.environ.getis belt-and-braces. Not wrong, just noise. You could simply writeScrapeGraphAI()here and let the tool do its job. Keep it if you prefer explicitness — no harm done.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/scrapegraphai_example.py` at line 24, The example instantiates ScrapeGraphAI by explicitly passing os.environ.get("SGAI_API_KEY"), which is redundant because the ScrapeGraphAI constructor already resolves SGAI_API_KEY (and the legacy key) internally; update the instantiation to call ScrapeGraphAI() with no arguments (i.e., remove the os.environ.get(...) argument) so the constructor handles env var resolution itself, leaving the rest of the example unchanged.src/agentor/tools/scrapegraphai.py (3)
74-92: The parameter nameformatis shadowing a Python builtin — tidy it up.Ruff's already whistling at us (A002) on line 74. It won't break anything today, but any code inside
scrapethat reaches for the builtinformat()will be in for a surprise. Rename it, and the Literal type stays just as tight.♻️ Proposed rename
- def scrape(self, url: str, format: ScrapeFormat = "markdown") -> str: + def scrape(self, url: str, output_format: ScrapeFormat = "markdown") -> str: """Fetch a webpage and return its content in the requested format. Args: url: The URL to scrape. - format: One of "markdown", "html", "links", "summary". Defaults to markdown. + output_format: One of "markdown", "html", "links", "summary". Defaults to markdown. """ try: - builder = _FORMAT_BUILDERS.get(format) + builder = _FORMAT_BUILDERS.get(output_format) if builder is None: return ( - f"Error in scrape: unsupported format '{format}'. " + f"Error in scrape: unsupported format '{output_format}'. " "Use one of: markdown, html, links, summary." ) result = self.client.scrape(url, formats=[builder()]) return _format_result(result, "scrape")Mind you — this is a public capability signature, and the tests in
tests/tools/test_scrapegraphai.py(lines 32, 43) and the example docstring currently call it asformat=.... If you take this route, update those too, or slap a# noqa: A002on the line and leave the signature alone. Your call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agentor/tools/scrapegraphai.py` around lines 74 - 92, The method scrape currently uses the parameter name format which shadows the built-in format() causing linter A002; rename the parameter (for example to out_format or fmt) in the scrape signature (def scrape(self, url: str, out_format: ScrapeFormat = "markdown") -> str), update all internal uses (the lookup _FORMAT_BUILDERS.get(format) -> _FORMAT_BUILDERS.get(out_format) and any references to format within the function such as the client.scrape call and result formatting), and update the public/API callers and tests (tests/tools/test_scrapegraphai.py, example docstring) to pass the new parameter name or call positionally; if you prefer to keep the public name, remove the change and instead add a `# noqa: A002` comment to the original parameter to silence the linter.
81-225: Same try/except/log/format dance repeated eight times — worth a little decorator.Every capability does the same thing: call the SDK, format the result, catch
Exception, log, returnf"Error in <name>: ...". It's clean enough now, but the next capability you add will copy-paste the same seven lines. A thin wrapper keeps the intent obvious and the surface tight.♻️ Sketch of a wrapper
from functools import wraps def _safe_capability(name: str): def deco(fn): `@wraps`(fn) def inner(self, *args, **kwargs): try: result = fn(self, *args, **kwargs) return _format_result(result, name) except Exception as e: logger.exception("ScrapeGraphAI %s error", name) return f"Error in {name}: {e}" return inner return decoThen each capability just returns the raw SDK result (or the unsupported-format string, which would need a small tweak). Not a blocker — file as "next time you're in here."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agentor/tools/scrapegraphai.py` around lines 81 - 225, Introduce a small decorator (e.g. _safe_capability) and apply it to each capability method (scrape, extract, search, crawl, get_crawl_result, monitor, credits, health) to centralize the try/except/log/_format_result pattern: the decorator should call the wrapped method, if the return is a string (existing error message like the unsupported format case in scrape) return it unchanged, otherwise call _format_result(result, name); on exception log with logger.exception("ScrapeGraphAI %s error", name) and return f"Error in {name}: {e}". Update each capability to return the raw SDK result (or the existing string error) and remove the repeated try/except blocks so the decorator handles them.
186-205: Themonitormethod locks formats to Markdown—same concern ascrawl. Worth discussing.Right, listen. You've spotted something worth noting here. The
scrapemethod lets callers pick their format using thatScrapeFormatknob. Markdown, HTML, links, summary—the lot. Butmonitorandcrawlboth hardcodeMarkdownFormatConfig()with no way round it. The web confirmsmonitor.createsupports theformatsparameter, so the capability's there—it's just not wired up.It's workable as is, mind you. Markdown's a sensible default for scheduled monitors. But if an agent needs to ask for HTML or a summary on a scheduled run, they've got nothing. The
_FORMAT_BUILDERSmapping already exists and handles all four formats cleanly.The suggested implementation follows the
scrapepattern directly: addformat: ScrapeFormat = "markdown"to the signature, use the builder, handle unsupported formats properly. Straightforward piece of work, no complications. Low priority for now, but worth considering when you next touch this code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agentor/tools/scrapegraphai.py` around lines 186 - 205, The monitor method currently hardcodes Markdown by passing MarkdownFormatConfig() to self.client.monitor.create; change the signature of monitor (the monitor method) to accept a format: ScrapeFormat = "markdown" parameter, use the existing _FORMAT_BUILDERS mapping to build the appropriate format config (like scrape does), replace the hardcoded MarkdownFormatConfig() with the builder output, and raise/handle an error if the provided format is unsupported before calling self.client.monitor.create so monitors can be scheduled in HTML/links/summary as well as markdown.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agentor/tools/scrapegraphai.py`:
- Around line 66-71: The code silently passes a None API key into _SGAIClient by
assigning resolved_key from api_key or env vars; add a guard after computing
resolved_key (before calling _SGAIClient) to check if resolved_key is falsy and
raise a clear exception (e.g., ValueError or RuntimeError) with a message
instructing the caller to provide api_key or set
SGAI_API_KEY/SCRAPEGRAPH_API_KEY; update the instantiation site where
self.client = _SGAIClient(api_key=resolved_key) to run only after the check so
the error is explicit and not a downstream SDK stack trace.
---
Nitpick comments:
In `@examples/scrapegraphai_example.py`:
- Line 24: The example instantiates ScrapeGraphAI by explicitly passing
os.environ.get("SGAI_API_KEY"), which is redundant because the ScrapeGraphAI
constructor already resolves SGAI_API_KEY (and the legacy key) internally;
update the instantiation to call ScrapeGraphAI() with no arguments (i.e., remove
the os.environ.get(...) argument) so the constructor handles env var resolution
itself, leaving the rest of the example unchanged.
In `@src/agentor/tools/scrapegraphai.py`:
- Around line 74-92: The method scrape currently uses the parameter name format
which shadows the built-in format() causing linter A002; rename the parameter
(for example to out_format or fmt) in the scrape signature (def scrape(self,
url: str, out_format: ScrapeFormat = "markdown") -> str), update all internal
uses (the lookup _FORMAT_BUILDERS.get(format) ->
_FORMAT_BUILDERS.get(out_format) and any references to format within the
function such as the client.scrape call and result formatting), and update the
public/API callers and tests (tests/tools/test_scrapegraphai.py, example
docstring) to pass the new parameter name or call positionally; if you prefer to
keep the public name, remove the change and instead add a `# noqa: A002` comment
to the original parameter to silence the linter.
- Around line 81-225: Introduce a small decorator (e.g. _safe_capability) and
apply it to each capability method (scrape, extract, search, crawl,
get_crawl_result, monitor, credits, health) to centralize the
try/except/log/_format_result pattern: the decorator should call the wrapped
method, if the return is a string (existing error message like the unsupported
format case in scrape) return it unchanged, otherwise call
_format_result(result, name); on exception log with
logger.exception("ScrapeGraphAI %s error", name) and return f"Error in {name}:
{e}". Update each capability to return the raw SDK result (or the existing
string error) and remove the repeated try/except blocks so the decorator handles
them.
- Around line 186-205: The monitor method currently hardcodes Markdown by
passing MarkdownFormatConfig() to self.client.monitor.create; change the
signature of monitor (the monitor method) to accept a format: ScrapeFormat =
"markdown" parameter, use the existing _FORMAT_BUILDERS mapping to build the
appropriate format config (like scrape does), replace the hardcoded
MarkdownFormatConfig() with the builder output, and raise/handle an error if the
provided format is unsupported before calling self.client.monitor.create so
monitors can be scheduled in HTML/links/summary as well as markdown.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: daf1abfc-4df7-4ad7-8d54-5e3b9ce95b6c
📒 Files selected for processing (4)
examples/scrapegraphai_example.pypyproject.tomlsrc/agentor/tools/scrapegraphai.pytests/tools/test_scrapegraphai.py
| resolved_key = ( | ||
| api_key | ||
| or os.environ.get("SGAI_API_KEY") | ||
| or os.environ.get("SCRAPEGRAPH_API_KEY") | ||
| ) | ||
| self.client = _SGAIClient(api_key=resolved_key) |
There was a problem hiding this comment.
Silent fallback to None when no API key is resolved.
If the caller passes nothing and neither env var is set, resolved_key quietly becomes None and gets shovelled into the SDK. The SDK will eventually bark, but the error won't be as clean as the one we raise for missing deps just above. A quick guard here saves a confusing stack trace down the road.
🛡️ Proposed guard
resolved_key = (
api_key
or os.environ.get("SGAI_API_KEY")
or os.environ.get("SCRAPEGRAPH_API_KEY")
)
+ if not resolved_key:
+ raise ValueError(
+ "ScrapeGraphAI API key not provided. Pass `api_key=...` or set "
+ "SGAI_API_KEY (or legacy SCRAPEGRAPH_API_KEY) in the environment."
+ )
self.client = _SGAIClient(api_key=resolved_key)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/agentor/tools/scrapegraphai.py` around lines 66 - 71, The code silently
passes a None API key into _SGAIClient by assigning resolved_key from api_key or
env vars; add a guard after computing resolved_key (before calling _SGAIClient)
to check if resolved_key is falsy and raise a clear exception (e.g., ValueError
or RuntimeError) with a message instructing the caller to provide api_key or set
SGAI_API_KEY/SCRAPEGRAPH_API_KEY; update the instantiation site where
self.client = _SGAIClient(api_key=resolved_key) to run only after the check so
the error is explicit and not a downstream SDK stack trace.
Summary
ScrapeGraphAItool to the new scrapegraph-py 2.x SDK. The oldClientclass and its methods (smartscraper,markdownify,searchscraper,smartcrawler,sitemap) no longer exist upstream, so this is a full rewrite against the new endpoint surface.scrapegraphextra toscrapegraph-py>=2.1.0.SGAI_API_KEY(the new SDK default) and falls back to the legacySCRAPEGRAPH_API_KEYso existing users aren't broken.New capabilities
scrape(url, format)client.scrape(...)withMarkdown/Html/Links/SummaryFormatConfigextract(prompt, url, schema)client.extract(...)— AI structured extractionsearch(query, num_results, prompt)client.search(...)crawl(url, max_pages, max_depth, include/exclude)client.crawl.start(...)get_crawl_result(crawl_id)client.crawl.get(...)monitor(url, interval, name, webhook_url)client.monitor.create(...)credits()client.credits()health()client.health()Responses from the SDK are
ApiResultobjects; the tool turns successful results into a JSON string and surfacesresult.erroras"Error in <capability>: ..."so the LLM gets a consistent string return.Breaking change note
The previous capability names (
smartscraper,markdownify, etc.) are removed. Any agent prompt that hard-coded those names needs to be updated — seeexamples/scrapegraphai_example.pyfor the new surface.Test plan
pytest tests/tools/test_scrapegraphai.py— 16/16 pass (success paths, API error paths, exception paths, missing-dep guard, env-var resolution including legacy fallback)health,credits,scrape,extract,search,crawl(start +get_crawl_resultpoll to completion), and error path for invalid URLSummary by CodeRabbit
Release Notes
New Features
Documentation
SCRAPEGRAPH_API_KEYtoSGAI_API_KEY.Chores
Greptile Summary
This PR rewrites the
ScrapeGraphAItool against the scrapegraph-py v2 SDK, replacing the five v1 capabilities (smartscraper,markdownify, etc.) with eight new ones (scrape,extract,search,crawl,get_crawl_result,monitor,credits,health). The new env-var fallback chain and_format_resulthelper are well-structured, and the test suite is thorough.JsonFormatConfigis imported fromscrapegraph_pybut never referenced in_FORMAT_BUILDERSor any capability method. This will fail the project's ruff linter (F401) in CI. The import (and itsNoneassignment in the fallback block) should be removed unless\"json\"format support is intended.Confidence Score: 4/5
Safe to merge after fixing the unused
JsonFormatConfigimport, which will fail ruff linting in CI.One P1 finding (unused import that breaks ruff/CI linting) keeps this at 4. The remaining findings are P2 style/consistency items that do not affect runtime correctness.
src/agentor/tools/scrapegraphai.py— remove or wire up theJsonFormatConfigimport and keep the fallback block in sync.Important Files Changed
JsonFormatConfigis imported but unused (ruff F401/CI fail), and the fallbackexceptblock must stay in sync with the try-block imports. Additionallyself.api_keyis set to the unresolved value before env-var lookup.scrapegraph-pyfrom>=1.46.0to>=2.1.0in both thescrapegraphextra and theallextra — correctly paired.SGAI_API_KEYenv var; covers scrape, extract, search, crawl/get_crawl_result, and credits.Sequence Diagram
sequenceDiagram participant Agent as LLM Agent participant Tool as ScrapeGraphAI Tool participant SDK as scrapegraph-py v2 SDK participant API as ScrapeGraphAI API Agent->>Tool: scrape(url, format) Tool->>Tool: _FORMAT_BUILDERS[format]() Tool->>SDK: client.scrape(url, formats=[...]) SDK->>API: HTTP POST /scrape API-->>SDK: ApiResult SDK-->>Tool: ApiResult Tool->>Tool: _format_result(result, scrape) Tool-->>Agent: JSON string or Error in scrape Agent->>Tool: extract(prompt, url, schema) Tool->>SDK: client.extract(prompt, url, schema) SDK->>API: HTTP POST /extract API-->>SDK: ApiResult SDK-->>Tool: ApiResult Tool-->>Agent: JSON string or Error in extract Agent->>Tool: crawl(url, max_pages, max_depth) Tool->>SDK: client.crawl.start(url, formats, ...) SDK->>API: HTTP POST /crawl API-->>SDK: ApiResult with crawl_id SDK-->>Tool: ApiResult Tool-->>Agent: JSON string with crawl_id Agent->>Tool: get_crawl_result(crawl_id) Tool->>SDK: client.crawl.get(crawl_id) SDK->>API: HTTP GET /crawl/id API-->>SDK: ApiResult SDK-->>Tool: ApiResult Tool-->>Agent: JSON string with status and resultsReviews (1): Last reviewed commit: "feat(scrapegraph): migrate to scrapegrap..." | Re-trigger Greptile