Temporary workaround for the Arc Opengenome2 dataset inconsistency to enable training with streaming datasets to work #1340
+153
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Fix streaming dataset column removal for OpenGenome2's inconsistent schema
This PR addresses a bug discovered when training with OpenGenome2 on SLURM where
dataset.column_namesreturnsNonefor streaming datasets with inconsistent schemas across shards.The Issue:
["text", "record"], others only["text"]streaming=True, HuggingFace'sIterableDatasetreturnsdataset.column_names = Nonedue to this inconsistencyremove_columns=dataset.column_names, which failed silently, leaving raw text/record columns in the tokenized datasetThe Fix:
IterableDatasetand explicitly list columns to remove:[sequence_column, "record"]IterableDataset.map()handles missing columns gracefully, so it's safe to list "record" even when absentDataset(non-streaming or consistent schema), continue usingdataset.column_namesUsage
The fix is transparent to users. Streaming datasets now work correctly:
Type of changes
CI Pipeline Configuration
No special CI configuration needed. Default unit tests are sufficient.
Pre-submit Checklist
test_streaming_dataset_removes_columns_correctly- Verifies column removaltest_streaming_dataset_handles_missing_record_column- Verifies graceful handlingAdditional Notes
Future Work: This workaround should be removed once Arc Institute fixes OpenGenome2 schema consistency across all shards. When all shards have identical columns,
dataset.column_nameswill work correctly for streaming datasets.Validation: This fix was validated on actual SLURM training runs with OpenGenome2, where the original bug was discovered.