Skip to content

fix(#1184): preserve button text in web read; remove button from STRIPPED_TAGS#1186

Open
SokandeSujal wants to merge 1 commit intojackwener:mainfrom
SokandeSujal:fix/issue-1184-button-stripped-tags
Open

fix(#1184): preserve button text in web read; remove button from STRIPPED_TAGS#1186
SokandeSujal wants to merge 1 commit intojackwener:mainfrom
SokandeSujal:fix/issue-1184-button-stripped-tags

Conversation

@SokandeSujal
Copy link
Copy Markdown

Summary

Fixes #1184

opencli web read was silently stripping all <button> elements because 'button' was listed in the STRIPPED_TAGS array, which is fed to TurndownService.remove() before HTML?Markdown conversion. On many real-world pages (exam archives, dashboards, e-commerce sites) buttons carry meaningful labels such as Download All that both humans and AI agents need to see in the output. This also made web read inconsistent with opencli browser state, which does surface button text.

Root Cause

src/download/article-download.ts lines 95-100:

s const STRIPPED_TAGS = [ 'script', 'style', 'noscript', 'canvas', 'form', 'button', 'dialog', // ? unconditionally removes ALL buttons 'header', 'footer', 'nav', 'aside', ]; s

Changes

src/download/article-download.ts

  • Removed 'button' from STRIPPED_TAGS
  • Added a dedicated buttonElement Turndown rule that:
    • Preserves the trimmed textContent of every <button> as inline Markdown text
    • Silently drops icon-only / decorative buttons whose trimmed text is empty
    • Buttons inside <form> are unaffected - the parent <form> is still stripped, so children are never visited

src/download/article-download.test.ts

  • Updated the existing form-stripping test to use a clearer button label (click-in-form) to make the assertion unambiguous
  • Added: preserves standalone button text content as inline Markdown - covers the exact Download All scenario from the issue
  • Added: drops icon-only buttons that have no visible text - covers the empty-button edge case

Testing

All new and existing tests in src/download/article-download.test.ts cover the fix. The test environment on this machine runs Node 20 which cannot load undici (requires Node ? 22); this is a pre-existing environment limitation unrelated to this change - confirmed identical error on main before any changes.

…from STRIPPED_TAGS

opencli web read was silently dropping all <button> elements because
'button' was unconditionally listed in STRIPPED_TAGS, which is passed
to TurndownService.remove() before HTML→Markdown conversion.

On many real-world pages (exam archives, dashboards, e-commerce) buttons
carry meaningful labels such as 'Download All' that agents and humans
need to see.  The old behaviour also made web read inconsistent with
opencli browser state, which does surface button text.

Fix:
- Remove 'button' from STRIPPED_TAGS.
- Add a dedicated 'buttonElement' Turndown rule that preserves the
  trimmed textContent of every <button> as inline Markdown text.
- Icon-only / purely decorative buttons whose trimmed text is empty are
  silently dropped, keeping output clean.
- Buttons inside <form> are unaffected: the parent <form> is still in
  STRIPPED_TAGS, so the entire form subtree is removed before Turndown
  ever visits its children.

Tests:
- Updated the existing form-stripping test to use a more specific marker
  ('click-in-form') so it clearly tests the form-strips-children path.
- Added 'preserves standalone button text content as inline Markdown'
  covering the reported Download All / Copy scenario.
- Added 'drops icon-only buttons that have no visible text' for the
  empty-button edge case.

Fixes jackwener#1184
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: opencli web read strips all <button> elements via STRIPPED_TAGS, losing meaningful content like 'Download All'

1 participant