Skip to content

Fix parquet loading crash from datasets version mismatch#1140

Open
yeyu-nvidia wants to merge 3 commits intomainfrom
yeyu/fix-parquet-datasets-compat
Open

Fix parquet loading crash from datasets version mismatch#1140
yeyu-nvidia wants to merge 3 commits intomainfrom
yeyu/fix-parquet-datasets-compat

Conversation

@yeyu-nvidia
Copy link
Copy Markdown
Contributor

@yeyu-nvidia yeyu-nvidia commented Mar 30, 2026

Summary

  • When local parquet files contain HF datasets metadata written by a different library version, load_dataset("parquet") raises a TypeError during feature deserialization
  • Added a fallback that catches the TypeError and reads parquet files directly via PyArrow, bypassing the incompatible metadata

Test plan

  • Run specdec_bench with EAGLE config against local parquet dataset files
  • Verify normal (compatible) parquet loading still works via the primary load_dataset path

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes
    • Improved robustness of the parquet dataset loader with a safer fallback path to ensure reliable loading across environments.
  • Chores
    • Broadened the supported version range for the datasets dependency to increase compatibility.

When local parquet files contain HF datasets metadata written by a
different version of the `datasets` library, `load_dataset("parquet")`
can raise a TypeError during feature deserialization. Fall back to
reading via PyArrow directly in that case.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
@yeyu-nvidia yeyu-nvidia requested a review from a team as a code owner March 30, 2026 16:24
@yeyu-nvidia yeyu-nvidia requested a review from h-guo18 March 30, 2026 16:24
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: b43c5829-7e39-4b28-afb5-014daf5152f8

📥 Commits

Reviewing files that changed from the base of the PR and between 45a33f8 and 94fadbd.

📒 Files selected for processing (2)
  • examples/specdec_bench/requirements_speed.txt
  • examples/specdec_bench/specdec_bench/datasets/speed.py
✅ Files skipped from review due to trivial changes (1)
  • examples/specdec_bench/specdec_bench/datasets/speed.py

📝 Walkthrough

Walkthrough

Added a try/except around parquet dataset loading in SPEEDBench._load_dataset; on TypeError the code falls back to reading parquet files with pyarrow, optionally strips Arrow schema metadata, concatenates tables, and constructs a HuggingFace Dataset from the resulting Arrow table.

Changes

Cohort / File(s) Summary
Parquet load fallback
examples/specdec_bench/specdec_bench/datasets/speed.py
Wrap datasets.load_dataset("parquet", ...) in try/except TypeError. On error, read files with pyarrow.parquet.read_table, pyarrow.concat_tables, strip b"huggingface" schema metadata if present, and build a datasets.Dataset from the Arrow table as fallback.
Dependency constraint update
examples/specdec_bench/requirements_speed.txt
Relaxed datasets version constraint from >=4.4.0,<5.0.0 to >=3.1.0 (upper bound removed).

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant SPEEDBench
    participant DatasetsLib as "datasets.load_dataset"
    participant PyArrow as "pyarrow.parquet"
    participant HF_Dataset as "datasets.Dataset"

    Caller->>SPEEDBench: request dataset load
    SPEEDBench->>DatasetsLib: load_dataset("parquet", data_files, split="test")
    alt success
        DatasetsLib-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    else TypeError
        DatasetsLib--xSPEEDBench: raises TypeError
        SPEEDBench->>PyArrow: read_table(file1), read_table(fileN)
        PyArrow-->>SPEEDBench: Table(s)
        SPEEDBench->>PyArrow: concat_tables(tables)
        PyArrow-->>SPEEDBench: concatenated Table
        SPEEDBench->>SPEEDBench: strip b"huggingface" metadata if present
        SPEEDBench->>HF_Dataset: Dataset(concatenated Table)
        HF_Dataset-->>SPEEDBench: Dataset
        SPEEDBench-->>Caller: return Dataset (possibly truncated)
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'Fix parquet loading crash from datasets version mismatch' directly addresses the main issue and solution: it identifies a specific problem (parquet loading crash) caused by a version mismatch, which aligns perfectly with both the code changes and PR objectives.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Security Anti-Patterns ✅ Passed Pull request introduces parquet loading fallback using PyArrow without security anti-patterns from SECURITY.md.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch yeyu/fix-parquet-datasets-compat

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 30, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1140/

Built to branch gh-pages at 2026-03-31 17:00 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/specdec_bench/specdec_bench/datasets/speed.py`:
- Around line 721-730: The except clause currently catching TypeError around the
parquet fallback (the block that imports pyarrow, pq.read_table,
pyarrow.concat_tables and constructs HFDataset(table) from data_files["test"])
should be changed so the fallback actually runs for Hugging Face metadata
incompatibility errors: replace `except TypeError:` with `except ValueError:`
(or `except (TypeError, ValueError):` if you want to handle both) so the
PyArrow-to-HFDataset fallback triggers when datasets raises the ValueError about
unknown feature types.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: faeefce7-a99c-401f-af42-d6d7bd1addf3

📥 Commits

Reviewing files that changed from the base of the PR and between a3f5c46 and 45a33f8.

📒 Files selected for processing (1)
  • examples/specdec_bench/specdec_bench/datasets/speed.py

Comment on lines +721 to +730
except TypeError:
# Fallback: parquet metadata may be incompatible with the installed
# ``datasets`` version. Read via PyArrow and convert directly.
import pyarrow
import pyarrow.parquet as pq
from datasets import Dataset as HFDataset

tables = [pq.read_table(f) for f in data_files["test"]]
table = pyarrow.concat_tables(tables) if len(tables) > 1 else tables[0]
dataset = HFDataset(table)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What exact exception types/messages does Hugging Face datasetsraise when loading Parquet files whose embeddedfeaturesmetadata is incompatible acrossdatasets versions?

💡 Result:

When datasets reads a Parquet file that contains embedded 🤗 Datasets schema metadata (stored under the Parquet/Arrow schema metadata key b"huggingface"), it reconstructs Features from the serialized dict. If that dict contains a feature _type that doesn’t exist in the installed datasets version, the load fails with:

  • Exception type: ValueError
  • Exception message (verbatim example seen when loading Parquet written by datasets>=4.0.0—which embeds _type: "List"—using datasets==3.6.0):
    ValueError: Feature type 'List' not found. Available feature types: ['Value', 'ClassLabel', 'Translation', 'TranslationVariableLanguages', 'LargeList', 'Sequence', 'Array2D', 'Array3D', 'Array4D', 'Array5D', 'Audio', 'Image', 'Video', 'Pdf'] [1][2]

This is raised from datasets.features.features.generate_from_dict(...) when _FEATURE_TYPES (in that installed version) has no entry for the serialized _type string. [1]

Sources: [1] huggingface/datasets#7676 [2] verl-project/verl#2723


🏁 Script executed:

# Find the full context of the affected function to understand the intent
fd -t f speed.py

Repository: NVIDIA/Model-Optimizer

Length of output: 167


🏁 Script executed:

# Read the file around lines 710-740 to understand the full context
cat -n examples/specdec_bench/specdec_bench/datasets/speed.py | sed -n '710,740p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1672


🏁 Script executed:

# Check if there are any comments or tests that clarify what TypeError is expected
rg -B5 -A10 'except TypeError' examples/specdec_bench/specdec_bench/datasets/speed.py

Repository: NVIDIA/Model-Optimizer

Length of output: 857


Change caught exception type from TypeError to ValueError to match the actual metadata incompatibility error from datasets.

At line 721, the code documents a fallback for parquet metadata incompatibility but catches TypeError. However, Hugging Face datasets raises ValueError when loading a Parquet file with embedded feature metadata (_type key) not available in the installed version. Example error: "Feature type 'List' not found. Available feature types: [...]" raised from datasets.features.features.generate_from_dict(...).

As written, the PyArrow fallback would never execute for the documented use case. Change except TypeError: to except ValueError: to properly trigger the fallback for metadata incompatibility, or clarify what TypeError scenario the current catch is intended to handle.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 721 -
730, The except clause currently catching TypeError around the parquet fallback
(the block that imports pyarrow, pq.read_table, pyarrow.concat_tables and
constructs HFDataset(table) from data_files["test"]) should be changed so the
fallback actually runs for Hugging Face metadata incompatibility errors: replace
`except TypeError:` with `except ValueError:` (or `except (TypeError,
ValueError):` if you want to handle both) so the PyArrow-to-HFDataset fallback
triggers when datasets raises the ValueError about unknown feature types.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 70.19%. Comparing base (a3f5c46) to head (94fadbd).
⚠️ Report is 7 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1140      +/-   ##
==========================================
+ Coverage   70.14%   70.19%   +0.04%     
==========================================
  Files         230      230              
  Lines       26053    26073      +20     
==========================================
+ Hits        18276    18302      +26     
+ Misses       7777     7771       -6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yeyu-nvidia and others added 2 commits March 31, 2026 09:54
The PyArrow fallback still failed because HFDataset(table) parses
the huggingface metadata embedded in the arrow schema, hitting the
same TypeError. Strip that metadata before constructing the Dataset.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
The tensorrt_llm 1.3.0rc5 container pins datasets==3.1.0. The previous
pin (>=4.4.0) caused concurrent pip installs across ranks to race and
corrupt the datasets package, breaking tensorrt_llm imports entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant