Skip to content

fix(structure): fix table cell matching, batch formula inference, and improve markdown output#100

Merged
GreatV merged 2 commits intomainfrom
fix_2
Mar 8, 2026
Merged

fix(structure): fix table cell matching, batch formula inference, and improve markdown output#100
GreatV merged 2 commits intomainfrom
fix_2

Conversation

@GreatV
Copy link
Owner

@GreatV GreatV commented Mar 8, 2026

  • Fix IoA space mismatch in wired table stitching by always using structure cell bboxes; add cross-row OCR deduplication for large cells
  • Add predict_images() for cross-page formula batching into a single ONNX inference call, reducing overhead for multi-page documents
  • Improve markdown: downgrade ABSTRACT/REFERENCES to h2, require text on both sides for inline formulas, add bullet list formatting, fix paragraph continuation across figures/tables
  • Speed up formula preprocessing with bilinear resize (~4x faster)
  • Remove premature dedup_by in cluster_positions to match PaddleX
  • Use
    instead of space for multi-line OCR content in table cells

… improve markdown output

- Fix IoA space mismatch in wired table stitching by always using structure cell bboxes; add cross-row OCR deduplication for large cells
- Add predict_images() for cross-page formula batching into a single ONNX inference call, reducing overhead for multi-page documents
- Improve markdown: downgrade ABSTRACT/REFERENCES to h2, require text on both sides for inline formulas, add bullet list formatting, fix paragraph continuation across figures/tables
- Speed up formula preprocessing with bilinear resize (~4x faster)
- Remove premature dedup_by in cluster_positions to match PaddleX
- Use <br/> instead of space for multi-line OCR content in table cells
@GreatV GreatV requested a review from Copilot March 8, 2026 02:37
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the document structure analysis pipeline, focusing on improving accuracy and performance for complex elements like tables and mathematical formulas. It introduces a more robust table cell matching mechanism with OCR deduplication, optimizes formula recognition through cross-page batching, and enhances the quality of markdown output for various document elements. These changes collectively lead to more precise content extraction and a more faithful representation of document structure in the generated output.

Highlights

  • Table Cell Matching & OCR Deduplication: Improved table cell content extraction by consistently using structure cell bounding boxes for Intersection-over-Area (IoA) matching and added cross-row OCR deduplication for large cells to prevent content duplication.
  • Batch Formula Inference: Implemented predict_images() to enable batch processing of formulas across multiple pages, significantly reducing ONNX inference overhead for multi-page documents.
  • Markdown Output Enhancements: Enhanced markdown generation by downgrading "ABSTRACT" and "REFERENCES" titles to H2, requiring text on both sides for inline formula classification, adding bullet list formatting, and ensuring correct paragraph continuation across figures and tables.
  • Formula Preprocessing Speed: Accelerated formula preprocessing by switching to bilinear resize (FilterType::Triangle), resulting in approximately 4x faster execution.
  • Table Analysis Alignment: Adjusted table position clustering by removing a premature deduplication step to better align with PaddleX's behavior.
  • Multi-line Table Content: Changed the separator for multi-line OCR content within table cells from a space to
    for better markdown rendering.
Changelog
  • examples/structure.rs
    • Refactored image processing loop to collect images for batch processing via predict_images.
    • Updated result processing to iterate over batch results.
  • oar-ocr-core/src/domain/structure.rs
    • Downgraded semantic title level for "ABSTRACT" and "REFERENCES" from H1 to H2.
    • Modified markdown generation to track prev_text_element for improved paragraph continuation across non-text elements.
    • Refined inline formula detection logic to require surrounding text on both sides.
    • Added helper functions has_bullet_markers and format_as_bullet_list to support bullet list formatting in markdown.
    • Updated test cases to reflect the new title level for keywords.
  • oar-ocr-core/src/processors/formula_preprocess.rs
    • Changed the image resize filter from Lanczos3 to Triangle for faster formula preprocessing.
  • src/oarocr/stitching.rs
    • Replaced space with
      as the separator for multi-line OCR content in table cells.
    • Revised table cell matching to consistently use structure cell bounding boxes for IoA calculations.
    • Introduced optional cross-row OCR deduplication for large cells when cell_bboxes_override is active.
    • Adjusted mapping of table data (td) positions to original cell indices using cell_aligned.
  • src/oarocr/structure.rs
    • Introduced a PreparedPage struct to encapsulate intermediate page processing results.
    • Refactored predict_image into prepare_page (for initial processing) and complete_page (for final assembly).
    • Added predict_images function to enable batch processing of multiple images, optimizing formula recognition across pages.
  • src/oarocr/table_analyzer.rs
    • Removed the dedup_by call from cluster_positions to align with PaddleX's clustering behavior.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several significant improvements and fixes. The main changes include adding batch formula inference via predict_images(), which should improve performance for multi-page documents. The markdown output is enhanced by downgrading some headers, improving inline formula detection, and adding bullet list formatting. There are also important bug fixes in table cell matching and a performance boost in formula preprocessing. The code is well-structured, especially the refactoring to support batch processing. I've found one area for improvement in the new bullet list formatting logic to make it more robust.

Note: Security Review did not run due to the size of the PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves document-structure postprocessing for OCR results, focusing on more reliable table cell ↔ OCR matching, cross-page batching for formula recognition, and higher-quality Markdown generation.

Changes:

  • Adjust table cell clustering/matching and table-cell text joining (including <br/> for multi-line cell content).
  • Add predict_images() to batch formula crops across pages and refactor single-page processing into prepare_page + complete_page.
  • Improve Markdown output rules (heading levels for ABSTRACT/REFERENCES, inline-formula detection tightening, bullet list formatting, and paragraph continuation behavior).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/oarocr/table_analyzer.rs Removes early coordinate dedup in cluster_positions to align clustering behavior with upstream expectations.
src/oarocr/structure.rs Refactors page processing and adds predict_images() with cross-page formula batching.
src/oarocr/stitching.rs Fixes table cell matching space mismatch and changes multi-line cell joins to use <br/>.
oar-ocr-core/src/processors/formula_preprocess.rs Switches resize filter to a faster option for formula preprocessing.
oar-ocr-core/src/domain/structure.rs Updates markdown heading semantics and improves paragraph/bullet/inline-formula formatting logic.
examples/structure.rs Updates example CLI flow to use the new batch API.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@GreatV GreatV merged commit 66ee9c0 into main Mar 8, 2026
3 checks passed
@GreatV GreatV deleted the fix_2 branch March 8, 2026 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants