Skip to content

Conversation

@savitha-eng
Copy link
Collaborator

Description

Fix streaming dataset column removal for OpenGenome2's inconsistent schema

This PR addresses a bug discovered when training with OpenGenome2 on SLURM where dataset.column_names returns None for streaming datasets with inconsistent schemas across shards.

The Issue:

  • OpenGenome2 has inconsistent schemas: some shards contain ["text", "record"], others only ["text"]
  • When using streaming=True, HuggingFace's IterableDataset returns dataset.column_names = None due to this inconsistency
  • The original code used remove_columns=dataset.column_names, which failed silently, leaving raw text/record columns in the tokenized dataset
  • This caused dataloader collation errors when training on SLURM

The Fix:

  • Detect IterableDataset and explicitly list columns to remove: [sequence_column, "record"]
  • IterableDataset.map() handles missing columns gracefully, so it's safe to list "record" even when absent
  • For regular Dataset (non-streaming or consistent schema), continue using dataset.column_names
  • Added comprehensive comments and TODO to remove workaround once Arc Institute fixes the schema

Usage

The fix is transparent to users. Streaming datasets now work correctly:

from dataset import create_bshd_dataloader
from distributed_config import DistributedConfig

# This now works with OpenGenome2 streaming
load_dataset_kwargs = {
    "path": "arcinstitute/opengenome2",
    "split": "train",
    "streaming": True,  # Fixed: column_names can be None
    "data_dir": "json/pretraining_or_both_phases"
}

dataloader, _ = create_bshd_dataloader(
    distributed_config=DistributedConfig(),
    tokenizer_path="./example_checkpoint",
    load_dataset_kwargs=load_dataset_kwargs,
    micro_batch_size=2,
    num_workers=4,
    max_seq_length=8192,
    sequence_column="text",  # OpenGenome2 uses "text" not "sequence"
)

# Tokenized dataset now only contains tokenizer outputs:
# ["input_ids", "attention_mask", "token_type_ids", "overflow_to_sample_mapping"]
# No raw "text" or "record" columns

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

No special CI configuration needed. Default unit tests are sufficient.

  • ciflow:skip
  • ciflow:notebooks
  • ciflow:slow
  • ciflow:all
  • ciflow:all-recipes

Pre-submit Checklist

  • I have tested these changes locally
    • All 10 dataset tests pass (including 2 new regression tests)
    • Pre-commit checks pass (ruff, formatting, trailing whitespace)
    • Tested with both streaming and non-streaming datasets
  • I have updated the documentation accordingly
    • Added comprehensive inline comments explaining the workaround
    • Documented why ESM2 doesn't need this fix
    • Added TODO to remove once Arc Institute fixes schema
  • I have added/updated tests as needed
    • test_streaming_dataset_removes_columns_correctly - Verifies column removal
    • test_streaming_dataset_handles_missing_record_column - Verifies graceful handling
  • All existing tests pass successfully
    • 10/10 dataset tests passing
    • No regressions in existing functionality

Additional Notes

Future Work: This workaround should be removed once Arc Institute fixes OpenGenome2 schema consistency across all shards. When all shards have identical columns, dataset.column_names will work correctly for streaming datasets.

Validation: This fix was validated on actual SLURM training runs with OpenGenome2, where the original bug was discovered.

…streaming dataset to work with training

Signed-off-by: savitha-eng <[email protected]>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@savitha-eng savitha-eng marked this pull request as ready for review November 21, 2025 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants