Skip to content

Parse an HTML document and extract the main content. Similar to the reader mode in web browsers.

License

Notifications You must be signed in to change notification settings

lamasters/html-reader-mode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HTML Reader Mode

A Python library to extract the main content from an HTML document, similar to the "Reader Mode" feature found in web browsers. It filters out navigation, ads, sidebars, and other non-content elements.

Installation

pip install html-reader-mode

Usage

from html_reader_mode import HTMLReaderMode

html_content = """
<html>
    <body>
        <div id="header">Header content</div>
        <div id="content">
            <h1>Article Title</h1>
            <p>This is the main content of the article.</p>
        </div>
        <div id="footer">Footer content</div>
    </body>
</html>
"""

reader = HTMLReaderMode()
content = reader.sanitize(html_content)

print(content)
# Output:
# [{'tag': 'h1', 'content': 'Article Title'}, {'tag': 'p', 'content': 'This is the main content of the article.'}]

Features

  • Content Extraction: Identifies and extracts the main text blocks.
  • Noise Reduction: Removes scripts, styles, and high-link-density blocks (like navigation menus).
  • Customizable: Configure block tags, script tags, and filtering thresholds.

License

MIT

About

Parse an HTML document and extract the main content. Similar to the reader mode in web browsers.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages