feat(omml): add OMML equation extraction utility for DOCX documents (closes #259)#262
feat(omml): add OMML equation extraction utility for DOCX documents (closes #259)#262Abdeltoto wants to merge 2 commits intoHKUDS:mainfrom
Conversation
|
Thanks for the detailed implementation. The opt-in utility direction makes sense, but I found a few issues that should be addressed or explicitly documented before merge.
The overall approach is useful, and I would be happy to re-review after the crash path and unknown n-ary fallback are fixed or covered by tests/limitations. |
Add `raganything/omml_extractor.py`, a zero-dependency, pure-stdlib utility
that extracts Office Math Markup Language (OMML) equations from DOCX files
and converts them to LaTeX so the structured math content survives the
DOCX → PDF conversion that both the MinerU and Docling parsers rely on.
Why
---
Word stores inline and display math as `<m:oMath>` elements inside
`word/document.xml`. When a DOCX is converted to PDF (via LibreOffice for
MinerU, or natively for Docling), equations are typically rasterized to
images or replaced with placeholder glyphs. The structured math text is
lost, which silently degrades retrieval quality on technical documents
(papers, lecture notes, engineering reports) — exactly the workloads where
the current parsers are weakest.
What
----
Two public functions:
- `extract_omml_equations(docx_path)`: opens the DOCX as a ZIP archive,
parses `word/document.xml` with the stdlib `xml.etree`, walks every
`<m:oMath>` element in document order, and returns a list of dicts with
`text` (LaTeX), `text_format`, `index`, and `raw_omml` (the source XML
for callers that want to plug in their own converter).
- `enrich_content_list_with_docx_equations(content_list, docx_path)`:
takes a parsed MinerU-compatible content list and appends one
`{"type": "equation", "text": "<latex>", "text_format": "latex", ...}`
block per extracted equation. Deduplicates against equations already
present in the list (e.g. those the parser already produced as image
placeholders) by default, with an opt-out flag for callers that want the
raw extraction.
A recursive `omml_to_latex(element)` transformer handles the most common
OMML constructs: runs (`m:r`/`m:t`), fractions (`m:f`), super/subscripts
(`m:sSup`/`m:sSub`/`m:sSubSup`/`m:sPre`), radicals (`m:rad`), n-ary
operators (`m:nary` with the standard sum/prod/integral characters),
applied functions (`m:func`, with whitelist for `\sin`, `\cos`, `\log`,
etc.), delimiters (`m:d`), matrices (`m:m`), accents (`m:acc`, `m:bar`),
limits (`m:limLow`/`m:limUpp`), grouping characters (`m:groupChr`), and
phantoms (`m:phant`). Unknown elements gracefully fall back to their text
content rather than raising — for RAG indexing, recall beats fidelity.
Tests
-----
`tests/test_omml_extractor.py` covers 26 cases:
- direct OMML → LaTeX conversion of every supported construct;
- end-to-end extraction from in-memory DOCX archives built with `zipfile`;
- enrichment behavior (append, deduplicate, copy semantics, empty input);
- error handling for missing files, invalid ZIPs, and DOCX archives that
lack `word/document.xml`.
All tests pass with the stdlib alone — no `python-docx`, no `lxml`, no
`pandoc` required.
Notes for reviewers
-------------------
- The helper currently appends equations at the tail of the content list
with the last block's `page_idx`. Positional placement inside a
PDF-derived list is intrinsically lossy (DOCX paragraphs do not carry
page numbers); precise positioning could be a follow-up that uses the
`raw_omml` field as a join key against the parsed image placeholders.
- The `omml_to_latex` converter is intentionally compact (~400 LoC). It
covers the high-frequency cases observed in academic and engineering
documents; rare constructs (e.g. `m:eqArr` equation arrays) fall through
to text concatenation. Extending coverage is straightforward — add a new
handler to `_HANDLERS` and a test case.
Made-with: Cursor
…tors Address review feedback on PR HKUDS#262: - _convert_children() now accepts None and returns "" so that fraction, script, and radical handlers degrade to an empty string when a required child (m:num, m:den, m:e, m:deg, m:sub, m:sup, ...) is missing, instead of raising from list iteration on None. Keeps the module on its recall-over-correctness contract for malformed DOCX. - _h_nary() now preserves the original Unicode character when the m:chr value is not in our LaTeX table (was silently rewritten to \int). The default-to-\int fallback is kept for the no-chr case per ECMA-376 §22.1.2.74. - Documented limitations of enrich_content_list_with_docx_equations: exact-text-only deduplication (does not match parser placeholders or image-OCR equations), and tail-append with inherited page_idx. - Added robustness tests under TestRobustness covering the new fallback paths and the unknown-operator preservation. Made-with: Cursor
2b0e840 to
8363607
Compare
|
Thanks @LarFii — addressed all four:
A few small tests added for the new fallbacks. |
Summary
Closes #259.
Adds
raganything/omml_extractor.py, a zero-dependency, pure-stdlib utility that extracts OMML (Office Math Markup Language) equations from.docxfiles and converts them to LaTeX, so the structured math content survives the DOCX → PDF conversion that both the MinerU and Docling parsers rely on.Why
Word stores inline and display math as
<m:oMath>elements insideword/document.xml. When a DOCX is ingested today, both parsers convert it to PDF first (LibreOffice for MinerU, native conversion for Docling). During that conversion, equations are typically rasterized to images or replaced with placeholder glyphs — the structured math text is lost before it ever reachesEquationModalProcessor. This silently degrades retrieval quality for academic papers, lecture notes, and engineering reports — exactly the workloads where structured math matters most.See #259 for the full motivation and the proposed approach.
What this PR adds
Two public functions and one transformer:
extract_omml_equations(docx_path)— opens the DOCX as a ZIP, parsesword/document.xmlwithxml.etree, walks every<m:oMath>element in document order, and returns a list of dicts withtext(LaTeX),text_format,index, andraw_omml(the source XML, useful for callers that want to plug in their own converter).enrich_content_list_with_docx_equations(content_list, docx_path)— takes a parsed MinerU-compatible content list and appends{"type": "equation", "text": "<latex>", "text_format": "latex", ...}blocks for each extracted equation. Deduplicates against equations already present in the list (e.g. those the parser already produced as image placeholders) by default.omml_to_latex(element)— recursive transformer covering the most common OMML constructs: runs (m:r/m:t), fractions (m:f), super/subscripts (m:sSup/m:sSub/m:sSubSup/m:sPre), radicals (m:rad), n-ary operators (m:narywith the standard sum/prod/integral characters), applied functions (m:func, with whitelist for\sin,\cos,\log, etc.), delimiters (m:d), matrices (m:m), accents (m:acc,m:bar), limits (m:limLow/m:limUpp), grouping characters (m:groupChr), and phantoms (m:phant). Unknown elements gracefully fall back to their text content rather than raising — for RAG indexing, recall beats fidelity.Tests
tests/test_omml_extractor.pycovers 26 cases, all passing locally:zipfile(single equation, multiple equations preserve document order).FileNotFoundError, non-ZIP →ValueError, ZIP missingword/document.xml→ValueError.All 26 tests pass with the stdlib alone — no
python-docx, nolxml, nopandoc.Backward compatibility
extract_omml_equations()orenrich_content_list_with_docx_equations().EquationModalProcessorwithout code changes anywhere else.Out of scope (possible follow-ups)
page_idx. Precise positional weaving could be a follow-up usingraw_ommlas a join key against parser-produced image placeholders.process_document_complete()for.docxinputs — deliberately punted to a second PR so the utility lands first and can be reviewed in isolation.Test plan
ruff formatandruff check --ignore=E402pass on the new files..docxwith mixed inline and display math and confirm the LaTeX is searchable inside the resulting knowledge graph.Note: the test file is added with
git add -fbecause.gitignoreincludes a broadtest_*pattern that would otherwise hide it; the existingtests/test_*.pyfiles are tracked the same way, so this matches the established convention.