Skip to content

feat(omml): add OMML equation extraction utility for DOCX documents (closes #259)#262

Open
Abdeltoto wants to merge 2 commits intoHKUDS:mainfrom
Abdeltoto:feat/docx-omml-equation-extraction
Open

feat(omml): add OMML equation extraction utility for DOCX documents (closes #259)#262
Abdeltoto wants to merge 2 commits intoHKUDS:mainfrom
Abdeltoto:feat/docx-omml-equation-extraction

Conversation

@Abdeltoto
Copy link
Copy Markdown

Summary

Closes #259.

Adds raganything/omml_extractor.py, a zero-dependency, pure-stdlib utility that extracts OMML (Office Math Markup Language) equations from .docx files and converts them to LaTeX, so the structured math content survives the DOCX → PDF conversion that both the MinerU and Docling parsers rely on.

Why

Word stores inline and display math as <m:oMath> elements inside word/document.xml. When a DOCX is ingested today, both parsers convert it to PDF first (LibreOffice for MinerU, native conversion for Docling). During that conversion, equations are typically rasterized to images or replaced with placeholder glyphs — the structured math text is lost before it ever reaches EquationModalProcessor. This silently degrades retrieval quality for academic papers, lecture notes, and engineering reports — exactly the workloads where structured math matters most.

See #259 for the full motivation and the proposed approach.

What this PR adds

Two public functions and one transformer:

  • extract_omml_equations(docx_path) — opens the DOCX as a ZIP, parses word/document.xml with xml.etree, walks every <m:oMath> element in document order, and returns a list of dicts with text (LaTeX), text_format, index, and raw_omml (the source XML, useful for callers that want to plug in their own converter).

  • enrich_content_list_with_docx_equations(content_list, docx_path) — takes a parsed MinerU-compatible content list and appends {"type": "equation", "text": "<latex>", "text_format": "latex", ...} blocks for each extracted equation. Deduplicates against equations already present in the list (e.g. those the parser already produced as image placeholders) by default.

  • omml_to_latex(element) — recursive transformer covering the most common OMML constructs: runs (m:r/m:t), fractions (m:f), super/subscripts (m:sSup/m:sSub/m:sSubSup/m:sPre), radicals (m:rad), n-ary operators (m:nary with the standard sum/prod/integral characters), applied functions (m:func, with whitelist for \sin, \cos, \log, etc.), delimiters (m:d), matrices (m:m), accents (m:acc, m:bar), limits (m:limLow/m:limUpp), grouping characters (m:groupChr), and phantoms (m:phant). Unknown elements gracefully fall back to their text content rather than raising — for RAG indexing, recall beats fidelity.

Tests

tests/test_omml_extractor.py covers 26 cases, all passing locally:

  • Direct OMML → LaTeX conversion for every supported construct (fractions, sup/sub, radicals, summation with bounds, integrals with default operator, known and unknown functions, delimiters with default and explicit braces, 2×2 matrices, overline/underline, Unicode symbol substitution).
  • End-to-end extraction from in-memory DOCX archives built with zipfile (single equation, multiple equations preserve document order).
  • Enrichment behavior: append, deduplicate against existing equations, opt-out of dedup, empty input returns a defensive copy.
  • Error handling: missing file → FileNotFoundError, non-ZIP → ValueError, ZIP missing word/document.xmlValueError.

All 26 tests pass with the stdlib alone — no python-docx, no lxml, no pandoc.

Backward compatibility

  • New module — no changes to existing parsers or processors.
  • Strictly opt-in: callers explicitly invoke extract_omml_equations() or enrich_content_list_with_docx_equations().
  • Output dicts use the same schema as existing equation blocks, so they slot into EquationModalProcessor without code changes anywhere else.
  • No new required dependencies.

Out of scope (possible follow-ups)

  • Positional placement: DOCX paragraphs do not carry page numbers, so this PR appends extracted equations at the tail of the content list with the last block's page_idx. Precise positional weaving could be a follow-up using raw_omml as a join key against parser-produced image placeholders.
  • Auto-integration into process_document_complete() for .docx inputs — deliberately punted to a second PR so the utility lands first and can be reviewed in isolation.

Test plan

  • All 26 unit tests pass locally on Python 3.12 / Windows.
  • ruff format and ruff check --ignore=E402 pass on the new files.
  • Maintainer-side: try it on a real .docx with mixed inline and display math and confirm the LaTeX is searchable inside the resulting knowledge graph.

Note: the test file is added with git add -f because .gitignore includes a broad test_* pattern that would otherwise hide it; the existing tests/test_*.py files are tracked the same way, so this matches the established convention.

@Abdeltoto Abdeltoto marked this pull request as ready for review April 22, 2026 03:45
@LarFii
Copy link
Copy Markdown
Collaborator

LarFii commented Apr 25, 2026

Thanks for the detailed implementation. The opt-in utility direction makes sense, but I found a few issues that should be addressed or explicitly documented before merge.

  1. P1: several OMML conversion paths can crash on malformed or partial OMML. For example, fraction/script/radical helpers call child conversion even when a required child like num or e is missing. Since the extractor is intended to prefer recall and tolerate imperfect documents, these should return an empty string / fallback text rather than raising from _convert_children(None).

  2. P1: unknown m:nary operators appear to fall back to \int. That silently turns an unsupported operator into an integral, which is worse than preserving the original Unicode/operator text for retrieval. Please preserve the unknown operator or use an explicit unknown placeholder instead of defaulting to integral.

  3. P1/P2: deduplicate_existing_equations is exact-text-only. That is fine for an opt-in utility, but it will not deduplicate parser-produced formula placeholders or image-derived equations from Docling/MinerU. Please document that limitation clearly, or add a test showing the expected behavior when existing equation blocks contain non-LaTeX placeholder text.

  4. P2: all appended equations inherit the last content block's page_idx and are appended at the end. Please document this as a known limitation so users do not treat page-level filtering/order as precise for the enriched equations.

The overall approach is useful, and I would be happy to re-review after the crash path and unknown n-ary fallback are fixed or covered by tests/limitations.

Add `raganything/omml_extractor.py`, a zero-dependency, pure-stdlib utility
that extracts Office Math Markup Language (OMML) equations from DOCX files
and converts them to LaTeX so the structured math content survives the
DOCX → PDF conversion that both the MinerU and Docling parsers rely on.

Why
---
Word stores inline and display math as `<m:oMath>` elements inside
`word/document.xml`. When a DOCX is converted to PDF (via LibreOffice for
MinerU, or natively for Docling), equations are typically rasterized to
images or replaced with placeholder glyphs. The structured math text is
lost, which silently degrades retrieval quality on technical documents
(papers, lecture notes, engineering reports) — exactly the workloads where
the current parsers are weakest.

What
----
Two public functions:

- `extract_omml_equations(docx_path)`: opens the DOCX as a ZIP archive,
  parses `word/document.xml` with the stdlib `xml.etree`, walks every
  `<m:oMath>` element in document order, and returns a list of dicts with
  `text` (LaTeX), `text_format`, `index`, and `raw_omml` (the source XML
  for callers that want to plug in their own converter).

- `enrich_content_list_with_docx_equations(content_list, docx_path)`:
  takes a parsed MinerU-compatible content list and appends one
  `{"type": "equation", "text": "<latex>", "text_format": "latex", ...}`
  block per extracted equation. Deduplicates against equations already
  present in the list (e.g. those the parser already produced as image
  placeholders) by default, with an opt-out flag for callers that want the
  raw extraction.

A recursive `omml_to_latex(element)` transformer handles the most common
OMML constructs: runs (`m:r`/`m:t`), fractions (`m:f`), super/subscripts
(`m:sSup`/`m:sSub`/`m:sSubSup`/`m:sPre`), radicals (`m:rad`), n-ary
operators (`m:nary` with the standard sum/prod/integral characters),
applied functions (`m:func`, with whitelist for `\sin`, `\cos`, `\log`,
etc.), delimiters (`m:d`), matrices (`m:m`), accents (`m:acc`, `m:bar`),
limits (`m:limLow`/`m:limUpp`), grouping characters (`m:groupChr`), and
phantoms (`m:phant`). Unknown elements gracefully fall back to their text
content rather than raising — for RAG indexing, recall beats fidelity.

Tests
-----
`tests/test_omml_extractor.py` covers 26 cases:
- direct OMML → LaTeX conversion of every supported construct;
- end-to-end extraction from in-memory DOCX archives built with `zipfile`;
- enrichment behavior (append, deduplicate, copy semantics, empty input);
- error handling for missing files, invalid ZIPs, and DOCX archives that
  lack `word/document.xml`.

All tests pass with the stdlib alone — no `python-docx`, no `lxml`, no
`pandoc` required.

Notes for reviewers
-------------------
- The helper currently appends equations at the tail of the content list
  with the last block's `page_idx`. Positional placement inside a
  PDF-derived list is intrinsically lossy (DOCX paragraphs do not carry
  page numbers); precise positioning could be a follow-up that uses the
  `raw_omml` field as a join key against the parsed image placeholders.
- The `omml_to_latex` converter is intentionally compact (~400 LoC). It
  covers the high-frequency cases observed in academic and engineering
  documents; rare constructs (e.g. `m:eqArr` equation arrays) fall through
  to text concatenation. Extending coverage is straightforward — add a new
  handler to `_HANDLERS` and a test case.

Made-with: Cursor
…tors

Address review feedback on PR HKUDS#262:

- _convert_children() now accepts None and returns "" so that fraction,
  script, and radical handlers degrade to an empty string when a
  required child (m:num, m:den, m:e, m:deg, m:sub, m:sup, ...) is
  missing, instead of raising from list iteration on None. Keeps the
  module on its recall-over-correctness contract for malformed DOCX.

- _h_nary() now preserves the original Unicode character when the
  m:chr value is not in our LaTeX table (was silently rewritten to
  \int). The default-to-\int fallback is kept for the no-chr case
  per ECMA-376 §22.1.2.74.

- Documented limitations of enrich_content_list_with_docx_equations:
  exact-text-only deduplication (does not match parser placeholders or
  image-OCR equations), and tail-append with inherited page_idx.

- Added robustness tests under TestRobustness covering the new
  fallback paths and the unknown-operator preservation.

Made-with: Cursor
@Abdeltoto Abdeltoto force-pushed the feat/docx-omml-equation-extraction branch from 2b0e840 to 8363607 Compare April 25, 2026 22:28
@Abdeltoto
Copy link
Copy Markdown
Author

Thanks @LarFii — addressed all four:

  1. _convert_children now tolerates None, so fraction/script/radical handlers degrade to "" on missing children instead of crashing.
  2. _h_nary preserves the original Unicode character when m:chr is unknown (was silently rewritten to \int). Default-to-\int kept for the no-chr case per ECMA-376 §22.1.2.74.
  3. & 4. Added a Limitations block to the enrich helper docstring covering exact-text dedup scope and the tail-append / inherited page_idx.

A few small tests added for the new fallbacks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Extract OMML equations from DOCX before parser converts them away

2 participants