Skip to content

feat: add HTML to markdown conversion for http_request tool #63

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mkmeral
Copy link

@mkmeral mkmeral commented May 30, 2025

Description

This PR enhances the http_request tool with HTML to markdown conversion capabilities, making web content more readable and suitable for AI processing.

Key Features:

  • New Parameter: convert_to_markdown boolean parameter to enable conversion
  • Smart Detection: Automatically detects HTML content by checking Content-Type headers and document structure
  • Clean Conversion: Uses readabilipy to extract main content and markdownify to convert to clean markdown
  • Graceful Fallback: Returns original content if conversion fails
  • User Feedback: Shows success notification when conversion occurs

Use Cases:

  • Scraping articles and blog posts for better readability
  • Converting HTML documentation to markdown format
  • Processing web content for AI analysis
  • Creating clean text versions of web pages

Example Usage:

# Convert HTML webpage to markdown
response = agent.tool.http_request(
    method="GET",
    url="https://example.com/article",
    convert_to_markdown=True
)

Related Issues

N/A

Documentation PR

N/A - Documentation updated in this PR

Type of Change

  • Bug fix
  • New Tool
  • Enhancement to existing tool
  • Breaking change
  • Other (please describe):

Testing

Automated Testing:

  • hatch fmt --linter
  • hatch fmt --formatter
  • hatch test --all ✅ (540 passed, 5 skipped)

Test Coverage:

  • Added unit tests for HTML conversion functionality
  • Manually tested, Claude's comment on the markdown and HTML content:
## Results Summary:

**First request (without markdown conversion):**
- Retrieved the raw HTML content of the blog post
- Shows the complete HTML structure with all tags, CSS, JavaScript, and metadata
- Content is in its original HTML format with full page structure

**Second request (with markdown conversion):**
- Retrieved the same content but converted to clean, readable markdown format
- Stripped out all the HTML boilerplate, navigation, headers, footers, and styling
- Focused only on the main article content in an easy-to-read markdown format

Checklist

  • I have read the CONTRIBUTING document

  • I have added tests that prove my fix is effective or my feature works

  • I have updated the documentation accordingly

  • I have added an appropriate example to the documentation to outline the feature

  • My changes generate no new warnings

  • Any dependent changes have been merged and published

  • By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

- Add markdownify and readabilipy dependencies
- Add convert_to_markdown parameter to http_request tool
- Automatically detect and convert HTML responses to markdown
- Add tests and documentation with usage examples
@mkmeral mkmeral requested a review from a team as a code owner May 30, 2025 22:23
Copy link
Member

@awsarron awsarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome PR, thank you @mkmeral!

One small comment on the test coverage and then I think this is good to merge.

result_text = extract_result_text(result)
assert "Status Code: 200" in result_text
# The exact markdown format depends on whether the optional packages are installed
# So we just verify that the request succeeded with the parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be worth testing that the markdown parsing worked I think. Could we assert that HTML tags are not present, and that expected text content is present?

Which optional packages are being referred to in this comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants