Feature/mineru improvements #11938

concertdictate · 2025-12-12T22:02:11Z

我已在下面的评论中用中文重复说明。

What problem does this PR solve?

Summary

This PR enhances the MinerU document parser with additional configuration options, giving users more control over PDF parsing behavior and improving support for multilingual documents.

Changes

Backend (`deepdoc/parser/mineru_parser.py`)

Added configurable parsing options:
- Parse Method: auto, txt, or ocr — allows users to choose the extraction strategy
- Formula Recognition: Toggle for enabling/disabling formula extraction (useful to disable for Cyrillic documents where it may cause issues)
- Table Recognition: Toggle for enabling/disabling table extraction
Added language code mapping (LANGUAGE_TO_MINERU_MAP) to translate RAGFlow language settings to MinerU-compatible language codes for better OCR accuracy
Improved parser configuration handling to pass these options through the processing pipeline

Frontend (`web/`)

Created new MinerUOptionsFormField component that conditionally renders when MinerU is selected as the layout recognition engine
Added UI controls for:
- Parse method selection (dropdown)
- Formula recognition toggle (switch)
- Table recognition toggle (switch)
Added i18n translations for English and Chinese
Integrated the options into both the dataset creation dialog and dataset settings page

Integration

Updated rag/app/naive.py to forward MinerU options to the parser
Updated task service to handle the new configuration parameters

Why

MinerU is a powerful document parser, but the default settings don't work well for all document types. This PR allows users to:

Choose the best parsing method for their documents
Disable formula recognition for Cyrillic/non-Latin scripts where it causes issues
Control table extraction based on document needs
Benefit from automatic language detection for better OCR results

Testing

Tested MinerU parsing with different parse methods
Verified UI renders correctly when MinerU is selected/deselected
Confirmed settings persist correctly in dataset configuration

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

…igurations

…thod configurations

concertdictate · 2025-12-12T22:07:07Z

这个 PR 解决了什么问题？

概要

本 PR 为 MinerU 文档解析器增加了更多可配置选项，使用户能够更精细地控制 PDF 解析行为，并提升对多语言文档的支持能力。

变更内容

后端（deepdoc/parser/mineru_parser.py）

新增可配置的解析选项：

解析方式（Parse Method）：auto、txt 或 ocr —— 允许用户选择提取策略
公式识别（Formula Recognition）：用于启用/禁用公式提取（对西里尔字母文档尤其有用，因其可能引发问题）
表格识别（Table Recognition）：用于启用/禁用表格提取

新增语言代码映射（LANGUAGE_TO_MINERU_MAP），用于将 RAGFlow 的语言设置转换为 MinerU 兼容的语言代码，从而提升 OCR 准确率。

改进了解析器的配置处理逻辑，使这些选项能够在处理流水线中正确传递。

前端（web/）

新增 MinerUOptionsFormField 组件，当 MinerU 被选为版面识别引擎时条件渲染
新增 UI 控件：
- 解析方式选择（下拉框）
- 公式识别开关（Switch）
- 表格识别开关（Switch）
新增英文和中文的 i18n 翻译
将这些选项集成到数据集创建对话框和数据集设置页面中

集成

更新 rag/app/naive.py，将 MinerU 选项传递给解析器
更新任务服务以支持新的配置参数

为什么要这样做？

MinerU 是一个强大的文档解析器，但默认设置并不适用于所有文档类型。本 PR 使用户能够：

为其文档选择最合适的解析方式
在西里尔/非拉丁文字脚本中禁用可能引发问题的公式识别
根据文档需求控制是否进行表格提取
通过自动语言检测获得更好的 OCR 结果

测试

使用不同解析方式测试了 MinerU 的解析效果
验证在选择/取消选择 MinerU 时，UI 能正确渲染
确认配置在数据集设置中能够正确持久化

变更类型

Bug 修复（非破坏性变更，修复了问题）
新功能（非破坏性变更，新增功能）
文档更新
重构
性能改进
其他（请描述）：

…tions

yongtenglei · 2025-12-15T09:11:30Z

Hi, @concertdictate

Thank you for your contribution. Have you tested all the backends you added here?

class MinerUBackend(StrEnum):
    PIPELINE = "pipeline"  # Traditional multimodel pipeline (default)
    VLM_TRANSFORMERS = "vlm-transformers"  # Vision-language model using HuggingFace Transformers
    VLM_MLX_ENGINE = "vlm-mlx-engine"  # Faster, requires Apple Silicon and macOS 13.5+
    VLM_VLLM_ENGINE = "vlm-vllm-engine"  # Local vLLM engine, requires local GPU
    VLM_VLLM_ASYNC_ENGINE = "vlm-vllm-async-engine"  # Asynchronous vLLM engine, new in MinerU API
    VLM_LMDEPLOY_ENGINE = "vlm-lmdeploy-engine"  # LMDeploy engine
    VLM_HTTP_CLIENT = "vlm-http-client"  # HTTP client for remote vLLM server (CPU only)

At the moment, vlm-vllm-async-engine and vlm-lmdeploy-engine are not included because they haven't been tested yet. No worries, I just want to make sure that what we add is runnable.

Cheers.

concertdictate · 2025-12-15T09:38:26Z

I have tested only VLM_MLX_ENGINE, VLM_VLLM_ENGINE, and PIPELINE (especially the latter).
I saw VLM_VLLM_ASYNC_ENGINE and vlm-lmdeploy-engine in the latest MinerU update (I’m attaching it in a screenshot).
This is for compatibility with upcoming updates.

concertdictate · 2025-12-15T12:21:39Z

Let me know if anything needs to be fixed; I’m happy to take care of it.

yongtenglei · 2025-12-16T02:24:46Z

Hi, @concertdictate

I tested the code and observed that MinerU is not actually used during file parsing, since it also needs to be configured for files. Or, did I miss anything?

Cheers.

P.S. For your information, we will be refactoring MinerU shortly (potentially today). The goal is to move away from a "batteries-included" local deployment model, as maintaining it has become a burden. Instead, we will only maintain the MinerU-API and the vllm-http-client for remote use.
Please expect some changes that may affect your workflow. Thank you in advance for your patience and great work on this feature!

Update:

I used wrong file to test this feature, that is my bad. However, we still need to offer a place to configure options for files. I will handle it. Thank you!

concertdictate · 2025-12-16T07:50:26Z

P.S. For your information, we will be refactoring MinerU shortly (potentially today). The goal is to move away from a "batteries-included" local deployment model, as maintaining it has become a burden. Instead, we will only maintain the MinerU-API and the vllm-http-client for remote use. Please expect some changes that may affect your workflow. Thank you in advance for your patience and great work on this feature!

That would be very appropriate, thank you for your work.

user210 and others added 2 commits December 12, 2025 20:07

Feat: Enhance MinerU options with language selection and parsing conf…

03d22b9

…igurations

Feat: Introduce MinerU parsing options with backend, language, and me…

4353de3

…thod configurations

dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 💞 feature Feature request, pull request that fullfill a new feature. labels Dec 12, 2025

KevinHuSh requested a review from yongtenglei December 15, 2025 03:42

Feat: Update MinerU parser to use dynamic formula and table enable op…

5dd9a36

…tions

KevinHuSh added the ci Continue Integration label Dec 15, 2025

KevinHuSh marked this pull request as draft December 15, 2025 11:21

KevinHuSh marked this pull request as ready for review December 15, 2025 11:21

dosubot bot added the 🌈 python Pull requests that update Python code label Dec 15, 2025

Merge branch 'main' into feature/mineru-improvements

0ca8f19

KevinHuSh merged commit 49c74d0 into infiniflow:main Dec 16, 2025
1 check passed

concertdictate deleted the feature/mineru-improvements branch December 16, 2025 07:51

This was referenced Dec 30, 2025

[Bug]: Incorrect Chunk Thumbnails and Positioning for Chinese PDFs (ragflow 0.23.0 + mineru 2.6.6) — Regression from 0.22.1 #12309

Open

[Question]: The sliced data cannot be parsed as standard LaTeX formulas. #12450

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/mineru improvements #11938

Feature/mineru improvements #11938

Uh oh!

concertdictate commented Dec 12, 2025 •

edited

Loading

Uh oh!

concertdictate commented Dec 12, 2025

Uh oh!

yongtenglei commented Dec 15, 2025

Uh oh!

concertdictate commented Dec 15, 2025

Uh oh!

concertdictate commented Dec 15, 2025

Uh oh!

yongtenglei commented Dec 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

concertdictate commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feature/mineru improvements #11938

Feature/mineru improvements #11938

Uh oh!

Conversation

concertdictate commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Summary

Changes

Backend (deepdoc/parser/mineru_parser.py)

Frontend (web/)

Integration

Why

Testing

Type of change

Uh oh!

concertdictate commented Dec 12, 2025

这个 PR 解决了什么问题？

概要

变更内容

后端（deepdoc/parser/mineru_parser.py）

前端（web/）

集成

为什么要这样做？

测试

变更类型

Uh oh!

yongtenglei commented Dec 15, 2025

Uh oh!

concertdictate commented Dec 15, 2025

Uh oh!

concertdictate commented Dec 15, 2025

Uh oh!

yongtenglei commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

concertdictate commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

concertdictate commented Dec 12, 2025 •

edited

Loading

Backend (`deepdoc/parser/mineru_parser.py`)

Frontend (`web/`)

yongtenglei commented Dec 16, 2025 •

edited

Loading