Fix nemotron_parse: disable KV cache and suppress pdfium text by randerzander · Pull Request #1746 · NVIDIA/NeMo-Retriever

randerzander · 2026-03-28T13:42:23Z

Previously nemotron-parse inference wasn't actually happening (transformers version incompatibility), but fallback logic hid the failure by using pdfium extracted text instead.

This PR fixes severail things:

Enable nemotron_parse inference by disabling kv cache
Update handling to use nemotron_parse text instead of pdfium text
Don't include the x,y and content label headers in extracted text
convert latek table format to markdown

Claude description:

Disable KV cache in NemotronParseV12.invoke (use_cache=False) to fix AttributeError: 'tuple' object has no attribute 'update' with newer transformers versions that use DynamicCache objects
Suppress pdfium text extraction when text_extraction_method="nemotron_parse" so pdfium text is not duplicated alongside nemotron_parse output

…patibility The installed transformers version uses DynamicCache objects for past_key_values, but the custom MBart decoder code in hf_nemotron_parse_modeling.py passes raw tuples, causing AttributeError crashes during generation. Passing use_cache=False bypasses the KV cache entirely and unblocks inference. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When text_extraction_method is "nemotron_parse", mark the page as needing OCR so pdfium-extracted text is not included alongside the nemotron_parse output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

copy-pr-bot · 2026-03-28T13:42:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Remove <x_...>, <y_...>, <class_...> spatial annotation tags and convert \begin{tabular}...\end{tabular} blocks to pipe-delimited markdown tables. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace post-hoc tag removal with a structured parser that iterates detection spans (<x><y>CONTENT<x><y><class_LABEL>), extracting only the content text and skipping Picture entries. LaTeX tables are still converted to pipe-delimited markdown. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

randerzander and others added 2 commits March 28, 2026 09:20

Suppress pdfium text extraction when using nemotron_parse method

c825fe4

When text_extraction_method is "nemotron_parse", mark the page as needing OCR so pdfium-extracted text is not included alongside the nemotron_parse output. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

randerzander requested review from a team as code owners March 28, 2026 13:42

randerzander requested a review from jperez999 March 28, 2026 13:42

Strip position tags and convert LaTeX tables in NemotronParseV12 output

e603b0e

Remove <x_...>, <y_...>, <class_...> spatial annotation tags and convert \begin{tabular}...\end{tabular} blocks to pipe-delimited markdown tables. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

randerzander requested a review from edknv March 28, 2026 13:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nemotron_parse: disable KV cache and suppress pdfium text#1746

Fix nemotron_parse: disable KV cache and suppress pdfium text#1746
randerzander wants to merge 4 commits intoNVIDIA:mainfrom
randerzander:nemotron_parse_fixes

randerzander commented Mar 28, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

randerzander commented Mar 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

randerzander commented Mar 28, 2026 •

edited

Loading