Conversation
…ula handling - Port PaddleX's xycut_enhanced algorithm faithfully with proper constants, cross-layout detection, overlapping box shrinking, and weighted distance insertion for reading order sorting - Add multi-signal title level inference using semantic numbering, relative indentation, and font-size k-means clustering - Inject inline formulas into text regions before stitching to prevent duplicate rendering, wrapping LaTeX with $...$ delimiters - Move markdown generation from examples into core StructureResult::to_markdown() - Fix cell detection to also run when use_cells_trans_to_html is enable - Add max_side_len=4000 default and lower box_threshold for table pipelines - Add line_height_iou_threshold for improved line grouping in stitching
There was a problem hiding this comment.
Pull request overview
This PR enhances OCR structure extraction correctness and output formatting by adding semantic labeling to text regions, improving table cell matching (IoA + row-aware stitching), refining layout sorting, and expanding markdown generation/post-processing behaviors.
Changes:
- Added
TextRegion.labeland propagated it through OCR/structure/stitching to support formula handling and richer semantics. - Improved table processing: conditional cell detection usage, detected-cell bbox propagation, row-aware IoA-based matching, and checkbox symbol normalization.
- Upgraded layout sorting inputs (
SortableElement+num_lines) and expanded markdown generation (title-level inference, paragraph continuity flags, formula/table handling).
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/oarocr/table_analyzer.rs | Adjusts cell-detection selection, regenerates structure tokens from detected cells, and propagates detected cell bboxes + E2E flag into TableResult. |
| src/oarocr/structure.rs | Updates structure pipeline defaults (table-friendly thresholds), uses enhanced sortable input, injects formulas into OCR pool as labeled TextRegions. |
| src/oarocr/stitching.rs | Adds inline-formula absorption, row-aware IoA matching with optional detected bboxes, enhanced line grouping/spacing, and table postprocessing. |
| src/oarocr/ocr.rs | Adds text_type="table" defaults and propagates TextRegion.label. |
| oar-ocr-vl/src/doc_parser.rs | Updates enhanced layout sorting call site to use SortableElement + num_lines. |
| oar-ocr-core/src/processors/sorting.rs | Makes projection helpers visible for testing and marks calculate_iou as potentially dead code. |
| oar-ocr-core/src/processors/layout_utils.rs | Switches cell assignment to IoA(det) and removes local IoU helper. |
| oar-ocr-core/src/processors/layout_sorting.rs | Expands enhanced sorting algorithm and changes API to accept SortableElement { bbox, element_type, num_lines }. |
| oar-ocr-core/src/processors/geometry.rs | Fixes bbox inverse-rotation off-by-one errors and updates tests. |
| oar-ocr-core/src/domain/text_region.rs | Introduces label field and helper methods (is_formula, with_label, etc.). |
| oar-ocr-core/src/domain/structure.rs | Adds title-level inference, paragraph continuity metadata, table/formula markdown adjustments, and formula-related postprocessing changes. |
| examples/utils/markdown.rs | Simplifies example utilities to rely on core markdown + concatenation + postprocessing. |
Comments suppressed due to low confidence (1)
src/oarocr/stitching.rs:46
- The doc comment contradicts the actual exclusion list: it says 'only Table and Seal are excluded' but the constant also excludes
Formula. Update the comment to reflect the true behavior (e.g., formulas are excluded fromstitch_layout_elementsafter inline injection to avoid duplication).
/// Labels excluded from OCR text matching in `stitch_layout_elements`.
/// PaddleX: formula results are injected into the OCR pool (via
/// `convert_formula_res_to_ocr_format`), so formula blocks participate
/// in normal OCR matching — only Table and Seal are excluded.
///
/// NOTE: After inline formula injection, formula elements have been absorbed
/// into text regions and should be excluded from stitching to prevent duplication.
const EXCLUDED_FROM_OCR_LABELS: [LayoutElementType; 3] = [
LayoutElementType::Table,
LayoutElementType::Seal,
LayoutElementType::Formula, // Exclude formulas to prevent duplicate rendering after injection
];
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the document structure analysis and markdown generation capabilities. It introduces a new semantic labeling feature for text regions, enhances table processing by improving cell assignment and detection thresholds, and fixes critical bounding box rotation issues. Furthermore, the layout sorting algorithm has been upgraded to provide a more robust reading order, and the markdown export now offers more faithful reproduction of document structure, including better handling of titles and inline formulas. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of improvements to the document structure analysis, significantly enhancing layout sorting, title level detection, and the handling of formulas. While these changes greatly improve the correctness and maintainability of the OCR structure pipeline, a critical security vulnerability has been identified. Specifically, the markdown generation process in to_markdown and the text joining logic in sort_and_join_texts lack proper sanitization of OCR-recognized text and table HTML, which can lead to HTML injection and Cross-Site Scripting (XSS) vulnerabilities if the generated markdown is rendered in a web environment. It is recommended to implement HTML escaping for all untrusted OCR content before including it in the final output to address this.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a substantial set of improvements to the OCR structure analysis and markdown generation. Key changes include a more sophisticated layout sorting algorithm that better handles complex documents, improved title level detection using semantic analysis and clustering, and more robust handling of inline formulas. The logic for table cell assignment has been improved by using Intersection over Area (IoA), and several bugs in bounding box rotation have been fixed. Additionally, the markdown generation utilities have been significantly refactored, centralizing logic into the core library for better maintainability. I have one suggestion regarding the sanitization of formulas to make it less destructive. Overall, these are excellent changes that enhance the correctness, robustness, and maintainability of the system.
Note: Security Review did not run due to the size of the PR.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…; adjust usage in PDF processing and ONNX session handling
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces significant improvements to the document structure analysis, particularly in layout sorting, title level detection, and formula handling. It refactors a lot of logic from examples into the core library, which is a great move for maintainability. The changes include a more sophisticated xycut_enhanced sorting algorithm, bug fixes for bounding box rotation, using IoA instead of IoU for more robust table cell assignment, and better handling of inline formulas and paragraph continuity. The code quality is high, and the changes are well-documented with comments explaining the alignment with PaddleX. I have one suggestion for a potential performance improvement.
Note: Security Review did not run due to the size of the PR.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/gemini review |
There was a problem hiding this comment.
Code Review
This is a substantial pull request that significantly enhances the layout analysis and markdown generation capabilities of the library, bringing it more in line with PaddleX's advanced structure analysis. The introduction of the xycut_enhanced sorting algorithm, a more sophisticated title-level inference system, and improved inline formula handling are major improvements. The refactoring to centralize markdown logic in the core library is a great simplification. I've also noted important bug fixes, including corrections to bounding box rotation logic and the move from IoU to IoA for more robust table cell assignment.
I have a couple of minor suggestions: one regarding a potentially misleading #[allow(dead_code)] attribute on a function that appears to be in use, and another suggesting a safer way to handle special characters within formula blocks to prevent data loss. Overall, this is an excellent set of changes that greatly improves the library's functionality.
Note: Security Review did not run due to the size of the PR.
This pull request introduces several improvements and fixes across the OCR core and structure modules. The most significant changes include the addition of a
labelfield toTextRegionfor better semantic classification, improved table cell assignment logic using Intersection over Area (IoA), and more robust handling of table detection thresholds. There are also several bug fixes to bounding box rotation logic and test updates to ensure correctness. Below are the most important changes grouped by theme:Semantic Labeling and API Improvements
labelfield toTextRegion(with helper methods), enabling distinction between normal text, formulas, seals, etc., and updated constructors and usages to support this field. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]Table Detection and Assignment Logic
box_thresholdfor table detection to 0.4 in both general OCR and structure pipelines, matching PaddleX defaults for better table extraction, and ensured related config propagation. [1] [2] [3] [4]Bounding Box Rotation Bug Fixes
Sorting and Layout Enhancements
projection_by_bboxesandsplit_projection_profilefunctions public within the crate to improve testability and modularity, and updated sorting logic to use the newSortableElementstruct for clearer code. [1] [2] [3] [4] [5]Other Minor Improvements
calculate_ioufunction as#[allow(dead_code)]to clarify its usage and avoid compiler warnings.These changes collectively improve the flexibility, correctness, and maintainability of the OCR pipeline, especially for documents with complex table layouts and semantic region labeling.