-
-
Notifications
You must be signed in to change notification settings - Fork 211
Description
Version(s) affected
5.1.1 (but potentially earlier too)
Description
When PHP is built with the recently(ish) released libxml2 version 2.14, it uses HTML5 standards when running loadHTML
and similar functions.
In src/HtmlConverter.php:121
, the famous "XML Hack" is being used to make sure that the HTML is loaded using UTF-8. The sanitize
function in line 236
of the same file has logic to exclude the XML node by checking the resulting markdown text.
However, due to the libxml2 change, this tag may not be present in the start of the document as expected. We have found issues where it appears as a comment (<!--?xml encoding="UTF-8"-->
)
- In the start of the document, just like the regular node was before, or
- Inside of the
<body>
of the document
Because of that, the resulting markdown for some examples may contain this comment (which is never sanitizes) together with some <htm>
or <head>
tags, not removed by the sanitize function due to the mismatch in the expected HTML. Please check the bugs below for specific failures.
This bug was found when packaging html-to-markdown
in Debian/Ubuntu:
The sanitize
function can be patched to fix this test, but maybe taking a look into DOM\HTMLDocument
instead of DOMDocument
would be nice for the future, to use HTML5 and UTF-8 by default everywhere.
How to reproduce
- Build php8.4 using libxml2 >= 2.14
- Run the test suite