Skip to content

Problem when converting HTML strings when php is compiled with the new libxml2 (2.14) #263

@renanrodrigo

Description

@renanrodrigo

Version(s) affected

5.1.1 (but potentially earlier too)

Description

When PHP is built with the recently(ish) released libxml2 version 2.14, it uses HTML5 standards when running loadHTML and similar functions.
In src/HtmlConverter.php:121, the famous "XML Hack" is being used to make sure that the HTML is loaded using UTF-8. The sanitize function in line 236 of the same file has logic to exclude the XML node by checking the resulting markdown text.
However, due to the libxml2 change, this tag may not be present in the start of the document as expected. We have found issues where it appears as a comment (<!--?xml encoding="UTF-8"-->)

  • In the start of the document, just like the regular node was before, or
  • Inside of the <body> of the document

Because of that, the resulting markdown for some examples may contain this comment (which is never sanitizes) together with some <htm> or <head> tags, not removed by the sanitize function due to the mismatch in the expected HTML. Please check the bugs below for specific failures.

This bug was found when packaging html-to-markdown in Debian/Ubuntu:

The sanitize function can be patched to fix this test, but maybe taking a look into DOM\HTMLDocument instead of DOMDocument would be nice for the future, to use HTML5 and UTF-8 by default everywhere.

How to reproduce

  • Build php8.4 using libxml2 >= 2.14
  • Run the test suite

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions