Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/specdec_bench/requirements_speed.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
datasets>=4.4.0,<5.0.0
datasets>=3.1.0
rich>=14.2.0
seaborn>=0.13.2
tiktoken>=0.12.0
22 changes: 21 additions & 1 deletion examples/specdec_bench/specdec_bench/datasets/speed.py
Original file line number Diff line number Diff line change
Expand Up @@ -716,7 +716,27 @@ def _load_dataset(self, config_name_or_dataset_path: config_type | str) -> "Data
}
else:
data_files = {"test": [str(config_name_or_dataset_path_path)]}
dataset = load_dataset("parquet", data_files=data_files, split="test")
try:
dataset = load_dataset("parquet", data_files=data_files, split="test")
except TypeError:
# Fallback: parquet metadata may be incompatible with the installed
# ``datasets`` version. Read via PyArrow and convert directly.
import pyarrow
import pyarrow.parquet as pq
from datasets import Dataset as HFDataset

tables = [pq.read_table(f) for f in data_files["test"]]
table = pyarrow.concat_tables(tables) if len(tables) > 1 else tables[0]
# Strip HF metadata from the schema to avoid Feature parsing errors
schema = table.schema
if schema.metadata and b"huggingface" in schema.metadata:
new_meta = {
k: v
for k, v in schema.metadata.items()
if k != b"huggingface"
}
table = table.replace_schema_metadata(new_meta or None)
dataset = HFDataset(table)
Comment on lines +721 to +739
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

What exact exception types/messages does Hugging Face datasetsraise when loading Parquet files whose embeddedfeaturesmetadata is incompatible acrossdatasets versions?

💡 Result:

When datasets reads a Parquet file that contains embedded 🤗 Datasets schema metadata (stored under the Parquet/Arrow schema metadata key b"huggingface"), it reconstructs Features from the serialized dict. If that dict contains a feature _type that doesn’t exist in the installed datasets version, the load fails with:

  • Exception type: ValueError
  • Exception message (verbatim example seen when loading Parquet written by datasets>=4.0.0—which embeds _type: "List"—using datasets==3.6.0):
    ValueError: Feature type 'List' not found. Available feature types: ['Value', 'ClassLabel', 'Translation', 'TranslationVariableLanguages', 'LargeList', 'Sequence', 'Array2D', 'Array3D', 'Array4D', 'Array5D', 'Audio', 'Image', 'Video', 'Pdf'] [1][2]

This is raised from datasets.features.features.generate_from_dict(...) when _FEATURE_TYPES (in that installed version) has no entry for the serialized _type string. [1]

Sources: [1] huggingface/datasets#7676 [2] verl-project/verl#2723


🏁 Script executed:

# Find the full context of the affected function to understand the intent
fd -t f speed.py

Repository: NVIDIA/Model-Optimizer

Length of output: 167


🏁 Script executed:

# Read the file around lines 710-740 to understand the full context
cat -n examples/specdec_bench/specdec_bench/datasets/speed.py | sed -n '710,740p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1672


🏁 Script executed:

# Check if there are any comments or tests that clarify what TypeError is expected
rg -B5 -A10 'except TypeError' examples/specdec_bench/specdec_bench/datasets/speed.py

Repository: NVIDIA/Model-Optimizer

Length of output: 857


Change caught exception type from TypeError to ValueError to match the actual metadata incompatibility error from datasets.

At line 721, the code documents a fallback for parquet metadata incompatibility but catches TypeError. However, Hugging Face datasets raises ValueError when loading a Parquet file with embedded feature metadata (_type key) not available in the installed version. Example error: "Feature type 'List' not found. Available feature types: [...]" raised from datasets.features.features.generate_from_dict(...).

As written, the PyArrow fallback would never execute for the documented use case. Change except TypeError: to except ValueError: to properly trigger the fallback for metadata incompatibility, or clarify what TypeError scenario the current catch is intended to handle.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/specdec_bench/specdec_bench/datasets/speed.py` around lines 721 -
730, The except clause currently catching TypeError around the parquet fallback
(the block that imports pyarrow, pq.read_table, pyarrow.concat_tables and
constructs HFDataset(table) from data_files["test"]) should be changed so the
fallback actually runs for Hugging Face metadata incompatibility errors: replace
`except TypeError:` with `except ValueError:` (or `except (TypeError,
ValueError):` if you want to handle both) so the PyArrow-to-HFDataset fallback
triggers when datasets raises the ValueError about unknown feature types.

if self.num_samples is not None:
dataset = dataset.select(range(self.num_samples))
return dataset
Expand Down
Loading