Temporary workaround for the Arc Opengenome2 dataset inconsistency to enable training with streaming datasets to work #1340

savitha-eng · 2025-11-21T07:00:44Z

Description

Fix streaming dataset column removal for OpenGenome2's inconsistent schema

This PR addresses a bug discovered when training with OpenGenome2 on SLURM where dataset.column_names returns None for streaming datasets with inconsistent schemas across shards.

The Issue:

OpenGenome2 has inconsistent schemas: some shards contain ["text", "record"], others only ["text"]
When using streaming=True, HuggingFace's IterableDataset returns dataset.column_names = None due to this inconsistency
The original code used remove_columns=dataset.column_names, which failed silently, leaving raw text/record columns in the tokenized dataset
This caused dataloader collation errors when training on SLURM

The Fix:

Detect IterableDataset and explicitly list columns to remove: [sequence_column, "record"]
IterableDataset.map() handles missing columns gracefully, so it's safe to list "record" even when absent
For regular Dataset (non-streaming or consistent schema), continue using dataset.column_names
Added comprehensive comments and TODO to remove workaround once Arc Institute fixes the schema

Usage

The fix is transparent to users. Streaming datasets now work correctly:

from dataset import create_bshd_dataloader
from distributed_config import DistributedConfig

# This now works with OpenGenome2 streaming
load_dataset_kwargs = {
    "path": "arcinstitute/opengenome2",
    "split": "train",
    "streaming": True,  # Fixed: column_names can be None
    "data_dir": "json/pretraining_or_both_phases"
}

dataloader, _ = create_bshd_dataloader(
    distributed_config=DistributedConfig(),
    tokenizer_path="./example_checkpoint",
    load_dataset_kwargs=load_dataset_kwargs,
    micro_batch_size=2,
    num_workers=4,
    max_seq_length=8192,
    sequence_column="text",  # OpenGenome2 uses "text" not "sequence"
)

# Tokenized dataset now only contains tokenizer outputs:
# ["input_ids", "attention_mask", "token_type_ids", "overflow_to_sample_mapping"]
# No raw "text" or "record" columns

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

No special CI configuration needed. Default unit tests are sufficient.

Pre-submit Checklist

I have tested these changes locally
- All 10 dataset tests pass (including 2 new regression tests)
- Pre-commit checks pass (ruff, formatting, trailing whitespace)
- Tested with both streaming and non-streaming datasets
I have updated the documentation accordingly
- Added comprehensive inline comments explaining the workaround
- Documented why ESM2 doesn't need this fix
- Added TODO to remove once Arc Institute fixes schema
I have added/updated tests as needed
- test_streaming_dataset_removes_columns_correctly - Verifies column removal
- test_streaming_dataset_handles_missing_record_column - Verifies graceful handling
All existing tests pass successfully
- 10/10 dataset tests passing
- No regressions in existing functionality

Additional Notes

Future Work: This workaround should be removed once Arc Institute fixes OpenGenome2 schema consistency across all shards. When all shards have identical columns, dataset.column_names will work correctly for streaming datasets.

Validation: This fix was validated on actual SLURM training runs with OpenGenome2, where the original bug was discovered.

…streaming dataset to work with training Signed-off-by: savitha-eng <[email protected]>

copy-pr-bot · 2025-11-21T07:00:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: savitha-eng <[email protected]>

Temporary workaround for Opegenome2 data inconsistency to enable the …

2b426ab

…streaming dataset to work with training Signed-off-by: savitha-eng <[email protected]>

remove unneccesarry batch size change

4248983

Signed-off-by: savitha-eng <[email protected]>

savitha-eng marked this pull request as ready for review November 21, 2025 07:09

savitha-eng requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber, pstjohn and trvachov as code owners November 21, 2025 07:09

pstjohn approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into savitha/llama3-dataset-column-fix

5458cc7

savitha-eng enabled auto-merge November 21, 2025 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Temporary workaround for the Arc Opengenome2 dataset inconsistency to enable training with streaming datasets to work #1340

Temporary workaround for the Arc Opengenome2 dataset inconsistency to enable training with streaming datasets to work #1340

savitha-eng commented Nov 21, 2025

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Temporary workaround for the Arc Opengenome2 dataset inconsistency to enable training with streaming datasets to work #1340

Are you sure you want to change the base?

Temporary workaround for the Arc Opengenome2 dataset inconsistency to enable training with streaming datasets to work #1340

Conversation

savitha-eng commented Nov 21, 2025

Description

Usage

Type of changes

CI Pipeline Configuration

Pre-submit Checklist

Additional Notes

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants