Skip to content

Conversation

@concertdictate
Copy link
Contributor

@concertdictate concertdictate commented Dec 12, 2025

我已在下面的评论中用中文重复说明。

What problem does this PR solve?

Summary

This PR enhances the MinerU document parser with additional configuration options, giving users more control over PDF parsing behavior and improving support for multilingual documents.

Changes

Backend (deepdoc/parser/mineru_parser.py)

  • Added configurable parsing options:
    • Parse Method: auto, txt, or ocr — allows users to choose the extraction strategy
    • Formula Recognition: Toggle for enabling/disabling formula extraction (useful to disable for Cyrillic documents where it may cause issues)
    • Table Recognition: Toggle for enabling/disabling table extraction
  • Added language code mapping (LANGUAGE_TO_MINERU_MAP) to translate RAGFlow language settings to MinerU-compatible language codes for better OCR accuracy
  • Improved parser configuration handling to pass these options through the processing pipeline

Frontend (web/)

  • Created new MinerUOptionsFormField component that conditionally renders when MinerU is selected as the layout recognition engine
  • Added UI controls for:
    • Parse method selection (dropdown)
    • Formula recognition toggle (switch)
    • Table recognition toggle (switch)
  • Added i18n translations for English and Chinese
  • Integrated the options into both the dataset creation dialog and dataset settings page

Integration

  • Updated rag/app/naive.py to forward MinerU options to the parser
  • Updated task service to handle the new configuration parameters

Why

MinerU is a powerful document parser, but the default settings don't work well for all document types. This PR allows users to:

  1. Choose the best parsing method for their documents
  2. Disable formula recognition for Cyrillic/non-Latin scripts where it causes issues
  3. Control table extraction based on document needs
  4. Benefit from automatic language detection for better OCR results

Testing

  • Tested MinerU parsing with different parse methods
  • Verified UI renders correctly when MinerU is selected/deselected
  • Confirmed settings persist correctly in dataset configuration

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. 💞 feature Feature request, pull request that fullfill a new feature. labels Dec 12, 2025
@concertdictate
Copy link
Contributor Author

这个 PR 解决了什么问题?

概要

本 PR 为 MinerU 文档解析器增加了更多可配置选项,使用户能够更精细地控制 PDF 解析行为,并提升对多语言文档的支持能力。


变更内容

后端(deepdoc/parser/mineru_parser.py)

新增可配置的解析选项:

  • 解析方式(Parse Method):auto、txt 或 ocr —— 允许用户选择提取策略
  • 公式识别(Formula Recognition):用于启用/禁用公式提取(对西里尔字母文档尤其有用,因其可能引发问题)
  • 表格识别(Table Recognition):用于启用/禁用表格提取

新增语言代码映射(LANGUAGE_TO_MINERU_MAP),用于将 RAGFlow 的语言设置转换为 MinerU 兼容的语言代码,从而提升 OCR 准确率。

改进了解析器的配置处理逻辑,使这些选项能够在处理流水线中正确传递。


前端(web/)

  • 新增 MinerUOptionsFormField 组件,当 MinerU 被选为版面识别引擎时条件渲染
  • 新增 UI 控件:
    • 解析方式选择(下拉框)
    • 公式识别开关(Switch)
    • 表格识别开关(Switch)
  • 新增英文和中文的 i18n 翻译
  • 将这些选项集成到数据集创建对话框和数据集设置页面中

集成

  • 更新 rag/app/naive.py,将 MinerU 选项传递给解析器
  • 更新任务服务以支持新的配置参数

为什么要这样做?

MinerU 是一个强大的文档解析器,但默认设置并不适用于所有文档类型。本 PR 使用户能够:

  • 为其文档选择最合适的解析方式
  • 在西里尔/非拉丁文字脚本中禁用可能引发问题的公式识别
  • 根据文档需求控制是否进行表格提取
  • 通过自动语言检测获得更好的 OCR 结果

测试

  • 使用不同解析方式测试了 MinerU 的解析效果
  • 验证在选择/取消选择 MinerU 时,UI 能正确渲染
  • 确认配置在数据集设置中能够正确持久化

变更类型

  • Bug 修复(非破坏性变更,修复了问题)
  • 新功能(非破坏性变更,新增功能)
  • 文档更新
  • 重构
  • 性能改进
  • 其他(请描述):

@yongtenglei
Copy link
Member

Hi, @concertdictate

Thank you for your contribution. Have you tested all the backends you added here?

class MinerUBackend(StrEnum):
    PIPELINE = "pipeline"  # Traditional multimodel pipeline (default)
    VLM_TRANSFORMERS = "vlm-transformers"  # Vision-language model using HuggingFace Transformers
    VLM_MLX_ENGINE = "vlm-mlx-engine"  # Faster, requires Apple Silicon and macOS 13.5+
    VLM_VLLM_ENGINE = "vlm-vllm-engine"  # Local vLLM engine, requires local GPU
    VLM_VLLM_ASYNC_ENGINE = "vlm-vllm-async-engine"  # Asynchronous vLLM engine, new in MinerU API
    VLM_LMDEPLOY_ENGINE = "vlm-lmdeploy-engine"  # LMDeploy engine
    VLM_HTTP_CLIENT = "vlm-http-client"  # HTTP client for remote vLLM server (CPU only)

At the moment, vlm-vllm-async-engine and vlm-lmdeploy-engine are not included because they haven't been tested yet. No worries, I just want to make sure that what we add is runnable.

Cheers.

@concertdictate
Copy link
Contributor Author

I have tested only VLM_MLX_ENGINE, VLM_VLLM_ENGINE, and PIPELINE (especially the latter).
I saw VLM_VLLM_ASYNC_ENGINE and vlm-lmdeploy-engine in the latest MinerU update (I’m attaching it in a screenshot).
This is for compatibility with upcoming updates.

image

@KevinHuSh KevinHuSh added the ci Continue Integration label Dec 15, 2025
@KevinHuSh KevinHuSh marked this pull request as draft December 15, 2025 11:21
@KevinHuSh KevinHuSh marked this pull request as ready for review December 15, 2025 11:21
@dosubot dosubot bot added the 🌈 python Pull requests that update Python code label Dec 15, 2025
@concertdictate
Copy link
Contributor Author

Let me know if anything needs to be fixed; I’m happy to take care of it.

@yongtenglei
Copy link
Member

yongtenglei commented Dec 16, 2025

Hi, @concertdictate

I tested the code and observed that MinerU is not actually used during file parsing, since it also needs to be configured for files. Or, did I miss anything?

image image

Cheers.

P.S. For your information, we will be refactoring MinerU shortly (potentially today). The goal is to move away from a "batteries-included" local deployment model, as maintaining it has become a burden. Instead, we will only maintain the MinerU-API and the vllm-http-client for remote use.
Please expect some changes that may affect your workflow. Thank you in advance for your patience and great work on this feature!


Update:

I used wrong file to test this feature, that is my bad. However, we still need to offer a place to configure options for files. I will handle it. Thank you!

@KevinHuSh KevinHuSh merged commit 49c74d0 into infiniflow:main Dec 16, 2025
1 check passed
@concertdictate
Copy link
Contributor Author

P.S. For your information, we will be refactoring MinerU shortly (potentially today). The goal is to move away from a "batteries-included" local deployment model, as maintaining it has become a burden. Instead, we will only maintain the MinerU-API and the vllm-http-client for remote use. Please expect some changes that may affect your workflow. Thank you in advance for your patience and great work on this feature!

That would be very appropriate, thank you for your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci Continue Integration 💞 feature Feature request, pull request that fullfill a new feature. 🌈 python Pull requests that update Python code size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants