fix(structure): fix table cell matching, batch formula inference, and improve markdown output#100
Conversation
… improve markdown output - Fix IoA space mismatch in wired table stitching by always using structure cell bboxes; add cross-row OCR deduplication for large cells - Add predict_images() for cross-page formula batching into a single ONNX inference call, reducing overhead for multi-page documents - Improve markdown: downgrade ABSTRACT/REFERENCES to h2, require text on both sides for inline formulas, add bullet list formatting, fix paragraph continuation across figures/tables - Speed up formula preprocessing with bilinear resize (~4x faster) - Remove premature dedup_by in cluster_positions to match PaddleX - Use <br/> instead of space for multi-line OCR content in table cells
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the document structure analysis pipeline, focusing on improving accuracy and performance for complex elements like tables and mathematical formulas. It introduces a more robust table cell matching mechanism with OCR deduplication, optimizes formula recognition through cross-page batching, and enhances the quality of markdown output for various document elements. These changes collectively lead to more precise content extraction and a more faithful representation of document structure in the generated output. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces several significant improvements and fixes. The main changes include adding batch formula inference via predict_images(), which should improve performance for multi-page documents. The markdown output is enhanced by downgrading some headers, improving inline formula detection, and adding bullet list formatting. There are also important bug fixes in table cell matching and a performance boost in formula preprocessing. The code is well-structured, especially the refactoring to support batch processing. I've found one area for improvement in the new bullet list formatting logic to make it more robust.
Note: Security Review did not run due to the size of the PR.
There was a problem hiding this comment.
Pull request overview
This PR improves document-structure postprocessing for OCR results, focusing on more reliable table cell ↔ OCR matching, cross-page batching for formula recognition, and higher-quality Markdown generation.
Changes:
- Adjust table cell clustering/matching and table-cell text joining (including
<br/>for multi-line cell content). - Add
predict_images()to batch formula crops across pages and refactor single-page processing intoprepare_page+complete_page. - Improve Markdown output rules (heading levels for ABSTRACT/REFERENCES, inline-formula detection tightening, bullet list formatting, and paragraph continuation behavior).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
src/oarocr/table_analyzer.rs |
Removes early coordinate dedup in cluster_positions to align clustering behavior with upstream expectations. |
src/oarocr/structure.rs |
Refactors page processing and adds predict_images() with cross-page formula batching. |
src/oarocr/stitching.rs |
Fixes table cell matching space mismatch and changes multi-line cell joins to use <br/>. |
oar-ocr-core/src/processors/formula_preprocess.rs |
Switches resize filter to a faster option for formula preprocessing. |
oar-ocr-core/src/domain/structure.rs |
Updates markdown heading semantics and improves paragraph/bullet/inline-formula formatting logic. |
examples/structure.rs |
Updates example CLI flow to use the new batch API. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
instead of space for multi-line OCR content in table cells