Skip to content

Fix nemotron_parse: disable KV cache and suppress pdfium text#1746

Open
randerzander wants to merge 4 commits intoNVIDIA:mainfrom
randerzander:nemotron_parse_fixes
Open

Fix nemotron_parse: disable KV cache and suppress pdfium text#1746
randerzander wants to merge 4 commits intoNVIDIA:mainfrom
randerzander:nemotron_parse_fixes

Conversation

@randerzander
Copy link
Copy Markdown
Collaborator

@randerzander randerzander commented Mar 28, 2026

Previously nemotron-parse inference wasn't actually happening (transformers version incompatibility), but fallback logic hid the failure by using pdfium extracted text instead.

This PR fixes severail things:

  1. Enable nemotron_parse inference by disabling kv cache
  2. Update handling to use nemotron_parse text instead of pdfium text
  3. Don't include the x,y and content label headers in extracted text
  4. convert latek table format to markdown

Claude description:

  • Disable KV cache in NemotronParseV12.invoke (use_cache=False) to fix AttributeError: 'tuple' object has no attribute 'update' with newer transformers versions that use DynamicCache objects
  • Suppress pdfium text extraction when text_extraction_method="nemotron_parse" so pdfium text is not duplicated alongside nemotron_parse output

randerzander and others added 2 commits March 28, 2026 09:20
…patibility

The installed transformers version uses DynamicCache objects for past_key_values,
but the custom MBart decoder code in hf_nemotron_parse_modeling.py passes raw
tuples, causing AttributeError crashes during generation. Passing use_cache=False
bypasses the KV cache entirely and unblocks inference.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When text_extraction_method is "nemotron_parse", mark the page as
needing OCR so pdfium-extracted text is not included alongside the
nemotron_parse output.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@randerzander randerzander requested review from a team as code owners March 28, 2026 13:42
@randerzander randerzander requested a review from jperez999 March 28, 2026 13:42
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 28, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Remove <x_...>, <y_...>, <class_...> spatial annotation tags and convert
\begin{tabular}...\end{tabular} blocks to pipe-delimited markdown tables.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@randerzander randerzander requested a review from edknv March 28, 2026 13:56
Replace post-hoc tag removal with a structured parser that iterates
detection spans (<x><y>CONTENT<x><y><class_LABEL>), extracting only
the content text and skipping Picture entries. LaTeX tables are still
converted to pipe-delimited markdown.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant