Skip to content

fix(structure): improve layout sorting, title levels, and inline formula handling#99

Merged
GreatV merged 5 commits intomainfrom
fix
Mar 7, 2026
Merged

fix(structure): improve layout sorting, title levels, and inline formula handling#99
GreatV merged 5 commits intomainfrom
fix

Conversation

@GreatV
Copy link
Owner

@GreatV GreatV commented Mar 5, 2026

This pull request introduces several improvements and fixes across the OCR core and structure modules. The most significant changes include the addition of a label field to TextRegion for better semantic classification, improved table cell assignment logic using Intersection over Area (IoA), and more robust handling of table detection thresholds. There are also several bug fixes to bounding box rotation logic and test updates to ensure correctness. Below are the most important changes grouped by theme:

Semantic Labeling and API Improvements

  • Added an optional label field to TextRegion (with helper methods), enabling distinction between normal text, formulas, seals, etc., and updated constructors and usages to support this field. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Table Detection and Assignment Logic

  • Changed table cell assignment from Intersection over Union (IoU) to Intersection over Area (IoA) of the detected box, improving robustness for table cells with row/colspan and better matching PaddleX behavior. [1] [2]
  • Lowered the default box_threshold for table detection to 0.4 in both general OCR and structure pipelines, matching PaddleX defaults for better table extraction, and ensured related config propagation. [1] [2] [3] [4]

Bounding Box Rotation Bug Fixes

  • Fixed off-by-one errors in bounding box inverse rotation logic for 90°, 180°, and 270° angles, and updated corresponding tests to reflect the corrected calculations. [1] [2] [3] [4]

Sorting and Layout Enhancements

  • Made projection_by_bboxes and split_projection_profile functions public within the crate to improve testability and modularity, and updated sorting logic to use the new SortableElement struct for clearer code. [1] [2] [3] [4] [5]

Other Minor Improvements

  • Marked the standalone calculate_iou function as #[allow(dead_code)] to clarify its usage and avoid compiler warnings.

These changes collectively improve the flexibility, correctness, and maintainability of the OCR pipeline, especially for documents with complex table layouts and semantic region labeling.

GreatV added 2 commits March 5, 2026 14:32
…ula handling

- Port PaddleX's xycut_enhanced algorithm faithfully with proper constants, cross-layout detection, overlapping box shrinking, and weighted distance insertion for reading order sorting
- Add multi-signal title level inference using semantic numbering, relative indentation, and font-size k-means clustering
- Inject inline formulas into text regions before stitching to prevent duplicate rendering, wrapping LaTeX with $...$ delimiters
- Move markdown generation from examples into core StructureResult::to_markdown()
- Fix cell detection to also run when use_cells_trans_to_html is enable
- Add max_side_len=4000 default and lower box_threshold for table pipelines
 - Add line_height_iou_threshold for improved line grouping in stitching
@GreatV GreatV requested a review from Copilot March 5, 2026 14:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances OCR structure extraction correctness and output formatting by adding semantic labeling to text regions, improving table cell matching (IoA + row-aware stitching), refining layout sorting, and expanding markdown generation/post-processing behaviors.

Changes:

  • Added TextRegion.label and propagated it through OCR/structure/stitching to support formula handling and richer semantics.
  • Improved table processing: conditional cell detection usage, detected-cell bbox propagation, row-aware IoA-based matching, and checkbox symbol normalization.
  • Upgraded layout sorting inputs (SortableElement + num_lines) and expanded markdown generation (title-level inference, paragraph continuity flags, formula/table handling).

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/oarocr/table_analyzer.rs Adjusts cell-detection selection, regenerates structure tokens from detected cells, and propagates detected cell bboxes + E2E flag into TableResult.
src/oarocr/structure.rs Updates structure pipeline defaults (table-friendly thresholds), uses enhanced sortable input, injects formulas into OCR pool as labeled TextRegions.
src/oarocr/stitching.rs Adds inline-formula absorption, row-aware IoA matching with optional detected bboxes, enhanced line grouping/spacing, and table postprocessing.
src/oarocr/ocr.rs Adds text_type="table" defaults and propagates TextRegion.label.
oar-ocr-vl/src/doc_parser.rs Updates enhanced layout sorting call site to use SortableElement + num_lines.
oar-ocr-core/src/processors/sorting.rs Makes projection helpers visible for testing and marks calculate_iou as potentially dead code.
oar-ocr-core/src/processors/layout_utils.rs Switches cell assignment to IoA(det) and removes local IoU helper.
oar-ocr-core/src/processors/layout_sorting.rs Expands enhanced sorting algorithm and changes API to accept SortableElement { bbox, element_type, num_lines }.
oar-ocr-core/src/processors/geometry.rs Fixes bbox inverse-rotation off-by-one errors and updates tests.
oar-ocr-core/src/domain/text_region.rs Introduces label field and helper methods (is_formula, with_label, etc.).
oar-ocr-core/src/domain/structure.rs Adds title-level inference, paragraph continuity metadata, table/formula markdown adjustments, and formula-related postprocessing changes.
examples/utils/markdown.rs Simplifies example utilities to rely on core markdown + concatenation + postprocessing.
Comments suppressed due to low confidence (1)

src/oarocr/stitching.rs:46

  • The doc comment contradicts the actual exclusion list: it says 'only Table and Seal are excluded' but the constant also excludes Formula. Update the comment to reflect the true behavior (e.g., formulas are excluded from stitch_layout_elements after inline injection to avoid duplication).
/// Labels excluded from OCR text matching in `stitch_layout_elements`.
/// PaddleX: formula results are injected into the OCR pool (via
/// `convert_formula_res_to_ocr_format`), so formula blocks participate
/// in normal OCR matching — only Table and Seal are excluded.
///
/// NOTE: After inline formula injection, formula elements have been absorbed
/// into text regions and should be excluded from stitching to prevent duplication.
const EXCLUDED_FROM_OCR_LABELS: [LayoutElementType; 3] = [
    LayoutElementType::Table,
    LayoutElementType::Seal,
    LayoutElementType::Formula, // Exclude formulas to prevent duplicate rendering after injection
];

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the document structure analysis and markdown generation capabilities. It introduces a new semantic labeling feature for text regions, enhances table processing by improving cell assignment and detection thresholds, and fixes critical bounding box rotation issues. Furthermore, the layout sorting algorithm has been upgraded to provide a more robust reading order, and the markdown export now offers more faithful reproduction of document structure, including better handling of titles and inline formulas.

Highlights

  • Semantic Labeling: Introduced a 'label' field to 'TextRegion' for better semantic classification (e.g., "formula", "seal").
  • Table Cell Assignment: Switched table cell assignment from Intersection over Union (IoU) to Intersection over Area (IoA) of the detected box, improving robustness for complex table layouts.
  • Table Detection Thresholds: Lowered the default 'box_threshold' for table detection to 0.4 in both general OCR and structure pipelines, aligning with PaddleX defaults.
  • Bounding Box Rotation Fixes: Corrected off-by-one errors in inverse rotation logic for 90°, 180°, and 270° angles.
  • Enhanced Layout Sorting: Implemented an 'xycut_enhanced' algorithm with cross-layout detection, overlapping box shrinking, weighted distance insertion, and child block association for more accurate reading order.
  • Markdown Generation Improvements: Refactored markdown export to leverage core library logic, including advanced title level inference, improved inline formula handling, and paragraph continuity detection.
Changelog
  • examples/utils/markdown.rs
    • Removed local markdown formatting functions, delegating to the core library.
    • Updated markdown export functions to utilize the core library's enhanced generation and post-processing.
  • oar-ocr-core/src/domain/structure.rs
    • Added advanced title level inference functions and modified 'format_title_with_level' to use clustered level hints.
    • Updated 'StructureResult::to_markdown' for new title formatting, paragraph continuity, and improved formula/image rendering.
    • Added 'seg_start_x', 'seg_end_x', 'num_lines' fields to 'LayoutElement' for paragraph continuity detection.
    • Added 'detected_cell_bboxes' and 'is_e2e' fields to 'TableResult'.
    • Refined 'dehyphenate' logic to match PaddleX's behavior (hyphen-newline patterns only).
    • Introduced 'format_first_line' for abstract/reference formatting.
    • Enhanced 'postprocess_markdown' to handle dollar signs within formula blocks.
  • oar-ocr-core/src/domain/text_region.rs
    • Added a 'label' field to 'TextRegion' struct for semantic classification.
    • Provided 'has_label', 'is_formula', and 'with_label' helper methods to 'TextRegion'.
  • oar-ocr-core/src/processors/geometry.rs
    • Corrected off-by-one errors in 'rotate_back_to_original' for 90°, 180°, and 270° rotations.
    • Updated corresponding tests to reflect corrected calculations.
  • oar-ocr-core/src/processors/layout_sorting.rs
    • Updated 'OrderLabel' enum with 'CrossReference' and removed 'is_header'/'is_footer' methods.
    • Modified 'SortableBlock' to include 'element_type', 'num_lines', and 'text_line_height', and added helper methods.
    • Introduced 'SortableElement' struct for input to sorting.
    • Refactored 'sort_layout_enhanced' to implement PaddleX's 'xycut_enhanced' algorithm, including cross-layout detection, box shrinking, and weighted distance insertion.
    • Updated 'weighted_distance_insert' to use new constants and logic.
    • Changed 'calculate_projection_overlap' to 'calculate_projection_overlap_ratio'.
  • oar-ocr-core/src/processors/layout_utils.rs
    • Modified 'reconcile_table_cells' to use Intersection over Area (IoA) of the detected box instead of IoU for cell assignment.
    • Removed the standalone 'calculate_iou' function.
  • oar-ocr-core/src/processors/sorting.rs
    • Made 'projection_by_bboxes' and 'split_projection_profile' functions public.
    • Added '#[allow(dead_code)]' attribute to 'calculate_iou'.
  • oar-ocr-vl/src/doc_parser.rs
    • Updated the 'sort_layout_enhanced' function call to pass 'SortableElement' instances.
  • src/oarocr/ocr.rs
    • Added a 'table' text type to 'OAROCRBuilder' with specific text detection defaults ('box_threshold = 0.4').
    • Initialized 'label' field to 'None' in 'TextRegion' creation.
  • src/oarocr/stitching.rs
    • Updated 'EXCLUDED_FROM_OCR_LABELS' to exclude 'Formula' elements after inline injection.
    • Added 'line_height_iou_threshold' to 'StitchConfig'.
    • Modified 'ResultStitcher::stitch' to include 'fill_formula_elements' and 'inject_inline_formulas' before 'stitch_layout_elements'.
    • Updated orphan text region handling to filter out text inside image/chart regions and inferred figure regions, and to create 'LayoutElementType::Formula' for formula-labeled text regions.
    • Replaced 'sort_layout_elements' with 'sort_layout_elements_enhanced'.
    • Removed 'fill_formula_content' (now 'fill_formula_elements').
    • Modified 'stitch_tables' to inject formula results into the OCR candidate pool, use 'is_e2e' flag, and pass 'detected_cell_bboxes' to row-aware matching.
    • Added 'normalize_checkbox_symbols_in_table' for post-processing checkbox-style tables.
    • Introduced 'collect_cell_texts_for_tokens' for robust cell text collection.
    • Updated 'match_table_ocr_to_cells_row_aware' to use 'detected_cell_bboxes' override and IoA for matching.
    • Refined 'sort_and_join_texts' for better line grouping, paragraph break detection, dehyphenation, and inline/display formula wrapping.
    • Added 'is_same_text_line_bbox', 'last_non_whitespace_char', and 'is_non_break_line_end_punctuation' helper functions.
  • src/oarocr/structure.rs
    • Adjusted text detection thresholds for table processing, setting 'box_threshold = 0.4' when table pipelines are enabled.
    • Added 'max_side_len = 4000' default for text detection.
    • Initialized 'label' field to 'None' in 'TextRegion' creation.
    • Modified 'sort_layout_elements' to use 'SortableElement' for enhanced sorting.
    • Updated 'create_expanded_ocr_for_table' to use 'is_e2e' flag.
    • Injected formula results into the overall OCR pool with 'label="formula"'.
  • src/oarocr/table_analyzer.rs
    • Refined 'cell_adapter' logic to use cell detection when 'use_cells_trans_to_html' is enabled, even in E2E mode.
    • Added 'detected_page_bboxes' to 'TableResult' for stitcher.
    • Added logic to generate structure tokens from cell positions if missing.
    • Set 'is_e2e' flag in 'TableResult'.
    • Fixed bounding box rotation tests to remove off-by-one errors.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of improvements to the document structure analysis, significantly enhancing layout sorting, title level detection, and the handling of formulas. While these changes greatly improve the correctness and maintainability of the OCR structure pipeline, a critical security vulnerability has been identified. Specifically, the markdown generation process in to_markdown and the text joining logic in sort_and_join_texts lack proper sanitization of OCR-recognized text and table HTML, which can lead to HTML injection and Cross-Site Scripting (XSS) vulnerabilities if the generated markdown is rendered in a web environment. It is recommended to implement HTML escaping for all untrusted OCR content before including it in the final output to address this.

@GreatV GreatV requested a review from Copilot March 5, 2026 16:28
@GreatV
Copy link
Owner Author

GreatV commented Mar 5, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a substantial set of improvements to the OCR structure analysis and markdown generation. Key changes include a more sophisticated layout sorting algorithm that better handles complex documents, improved title level detection using semantic analysis and clustering, and more robust handling of inline formulas. The logic for table cell assignment has been improved by using Intersection over Area (IoA), and several bugs in bounding box rotation have been fixed. Additionally, the markdown generation utilities have been significantly refactored, centralizing logic into the core library for better maintainability. I have one suggestion regarding the sanitization of formulas to make it less destructive. Overall, these are excellent changes that enhance the correctness, robustness, and maintainability of the system.

Note: Security Review did not run due to the size of the PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…; adjust usage in PDF processing and ONNX session handling
@GreatV
Copy link
Owner Author

GreatV commented Mar 7, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to the document structure analysis, particularly in layout sorting, title level detection, and formula handling. It refactors a lot of logic from examples into the core library, which is a great move for maintainability. The changes include a more sophisticated xycut_enhanced sorting algorithm, bug fixes for bounding box rotation, using IoA instead of IoU for more robust table cell assignment, and better handling of inline formulas and paragraph continuity. The code quality is high, and the changes are well-documented with comments explaining the alignment with PaddleX. I have one suggestion for a potential performance improvement.

Note: Security Review did not run due to the size of the PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@GreatV
Copy link
Owner Author

GreatV commented Mar 7, 2026

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a substantial pull request that significantly enhances the layout analysis and markdown generation capabilities of the library, bringing it more in line with PaddleX's advanced structure analysis. The introduction of the xycut_enhanced sorting algorithm, a more sophisticated title-level inference system, and improved inline formula handling are major improvements. The refactoring to centralize markdown logic in the core library is a great simplification. I've also noted important bug fixes, including corrections to bounding box rotation logic and the move from IoU to IoA for more robust table cell assignment.

I have a couple of minor suggestions: one regarding a potentially misleading #[allow(dead_code)] attribute on a function that appears to be in use, and another suggesting a safer way to handle special characters within formula blocks to prevent data loss. Overall, this is an excellent set of changes that greatly improves the library's functionality.

Note: Security Review did not run due to the size of the PR.

@GreatV GreatV merged commit 56767b0 into main Mar 7, 2026
3 checks passed
@GreatV GreatV deleted the fix branch March 7, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants