Iterable Dataset #2852

felipemello1 · 2025-06-26T14:33:36Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Enable Iterable datasets in torchtune.

CONTEXT: built on top of ongoing PR step-based-ckpt: #2384

TIps when reviewing this pr

Follow this order:

recipes/configs/llama3_2/3B_full.yaml: see the configs
torchtune/datasets/_iterable_base.py: base class for iterable dataset
torchtune/datasets/_hf_iterable.py: ds based on HF -- Can be replaced easily. Downstream does not expect HF.
torchtune/datasets/_interleaved.py: interleave the datasets
torchtune/data/_metrics.py: metrics transform to create the metrics
torchtune/data/_aggregator.py: aggregate the metrics at the recipe level
recipes/full_finetune_distributed.py: everything put together
unit tests

torchtune/datasets/_hf_iterable.py

Changelog

Datasets are infinite
User doesn't define epochs anymore, but training steps (how many times we update the optimizer)
Support for dataset mixing -- follow up PRs is to enable curriculum learning
Support for dataset metric logging -- User can understand epoch per dataset, distribution of token lens, etc. Easy to add new metrics.
HF agnostic. Even though the current dataset is HF, the dataloader, packed, datamixing, metric logging is agnostic to it
Well tested in distributed setting -- WARNING: need better testing for multiprocess dataloader. It doesnt guarantee determinism, so I postponed testing this setting

Config and builder design based on the discussions after this RFC: #2785

Next steps:
7. Gather feedback on metric logging. E.g. we can add more aggregation types.
8. Polish the code a little bit
9. Add packing from this RFC: #2819
10. Add curriculum learning
11. Docs?

Test plan

UNTESTED: resume from ckpt in the recipe. However, we have plenty of tests showing that resuming works for these iterable datasets.

…iterable_dataset_final

pytorch-bot · 2025-06-26T14:33:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2852

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Cancelled Jobs

As of commit f89eefe with merge base 3d73591 ():

NEW FAILURE - The following job has failed:

GPU tests / gpu_test (3.11, stable) (gh)
tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_training_state_on_resume_with_async_checkpointing[llama3/8B_qat_lora-llama3-tune-False]

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
tests/recipes/test_qat_lora_finetune_distributed.py::TestQATLoRAFinetuneDistributedRecipe::test_training_state_on_resume_with_async_checkpointing[llama3/8B_qat_lora-llama3-tune-False]
GPU tests / gpu_test (3.9, stable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1 · 2025-06-26T14:42:24Z

torchtune/datasets/_slimorca.py

@@ -94,3 +95,72 @@ def slimorca_dataset(
            )
        return PackedDataset(ds, max_seq_len=tokenizer.max_seq_len)
    return ds
+
+
+def slimorca_iterable_dataset(


added here to demonstrate datamix iterable dataset with this example. Personally, i dislike exposing all of the args and defaults. I would prefer to expose only whats specific to this builder.

torchtune/datasets/_sft.py

felipemello1 · 2025-06-26T14:43:51Z

torchtune/datasets/_interleaved.py

+                logger.warning(
+                    f"Child dataset {self._datasets[ds_name].dataset_name} was exhausted. "
+                    "This is unexpected for an infinite dataset. Re-initializing its iterator."
+                )


not 100% sure i like this

Let's do this: simply have a subclass for InfiniteIterable so this is super explicit

Where did this one land? I don't see InfiniteIterable anywhere (personally I don't know enough yet to have a strong preference here, just wanna understand where things currently stand)

i made changes but didnt push them yet. I added a dummy class that does nothing:

class InfiniteTuneIterableDataset(TuneIterableDataset): """Abstract base class for infinite datasets, which yield samples indefinitely. It only purpose is to make it explicit that the dataset is expected to be infinite, i.e. it never exhausts. This is helpful to avoid complexity due to some rank hanging because of lack of data"" pass

and replaced this logger.warning with raise ValueError.

I think its better to have zero tolerance. Datasets that are not infinite need work to make sure no rank hangs.

felipemello1 · 2025-06-26T14:46:30Z

torchtune/datasets/_alpaca.py

@@ -101,3 +102,64 @@ def alpaca_dataset(
 original Alpaca dataset, `yahma/alpaca-cleaned <https://huggingface.co/datasets/yahma/alpaca-cleaned>`_.
 See the dataset page and :func:`~torchtune.datasets.alpaca_dataset` for more details.
 """
+
+
+def alpaca_iterable_dataset(


added here to demonstrate datamix iterable dataset with this example. Personally, i dislike exposing all of the args and defaults. I would prefer to expose only whats specific to this builder.

But you are doing this with ``load_dataset_kwargs, right? Or did you mean something else?

nit: it's a function, so... get_alpaca_iterable_dataset?

the get makes sense, but its not the pattern we have in tune :/

Darktex

Great PR! I mainly had a question on the interaction with packing and on the SFT transform

torchtune/data/_aggregator.py

torchtune/data/_metrics.py

Darktex · 2025-06-26T16:40:04Z

torchtune/datasets/_alpaca.py

@@ -101,3 +102,64 @@ def alpaca_dataset(
 original Alpaca dataset, `yahma/alpaca-cleaned <https://huggingface.co/datasets/yahma/alpaca-cleaned>`_.
 See the dataset page and :func:`~torchtune.datasets.alpaca_dataset` for more details.
 """
+
+
+def alpaca_iterable_dataset(


But you are doing this with ``load_dataset_kwargs, right? Or did you mean something else?

torchtune/datasets/_sft.py

torchtune/datasets/_interleaved.py

Darktex · 2025-06-26T17:53:33Z

torchtune/datasets/_interleaved.py

+                logger.warning(
+                    f"Child dataset {self._datasets[ds_name].dataset_name} was exhausted. "
+                    "This is unexpected for an infinite dataset. Re-initializing its iterator."
+                )


Let's do this: simply have a subclass for InfiniteIterable so this is super explicit

Darktex · 2025-06-26T17:59:13Z

torchtune/datasets/_iterable_base.py

+from torch.utils.data import IterableDataset
+
+
+class TuneIterableDataset(IterableDataset, ABC):


We need this guy to interact with packing and IIUC I don't believe this is currently happening?

The algo we should implement is this:

One batch can be made of multiple calls to next. We keep taking until we exceed the max seq len. When we do, we put the last one aside (we'll use it to start the next batch), pad the current one to max len and return.

The calls to next will go to the interleaved dataset, therefore we automatically construct mixed batches from multiple datasets without much effort

Also, every time we call next we should make space for logging transforms (which we are, you already wrote them). I think it's ok to make your metrics transforms and aggregators an optional property here so the semantics are clearer

we have packing here: #2819

…htune into iterable_dataset_final

ebsmothers

I read every line of this PR. (Kidding but I tried to at least look at most of the important stuff.) Thanks for taking on this massive set of changes, I think the dataset classes are a big improvement

torchtune/datasets/_hf_iterable.py

ebsmothers · 2025-07-03T23:35:39Z

torchtune/data/metrics/_metric_transform.py

+            self.new_metric(
+                name="tokens_seen", value=token_len, agg_type=AggregationType.SUM
+            ),
+            self.new_metric(
+                name="seq_len", value=token_len, agg_type=AggregationType.DISTRIBUTION
+            ),


A minor thing, but to me metrics having the same value but different aggregation types should not actually be represented as distinct metrics. Like I should be able to just define how a metric is computed for a given sample, then separately choose different types of aggregation as needed

hmm, so agg_type being a List[AggregationType]?

self.new_metric( name="tokens_seen", value=token_len, agg_type=[AggregationType.SUM, AggregationType.MEAN ),

I dont know if the extra complexity is worth it. Adding two metrics is cheap. wdyt?

Today the user can just do: self.new_metric( name="tokens_seen_sum", value=token_len, agg_type=AggregationType.SUM ), self.new_metric( name="tokens_seen_mean", value=token_len, agg_type=AggregationType.MEAN ),

ebsmothers · 2025-07-03T23:51:38Z

torchtune/data/metrics/_metric_aggregator.py

+from torchtune.data.metrics._metric_transform import AggregationType, Metric
+
+
+class MetricsAggregator:


A high level comment: the relationship between this and the agg handlers is not super clear to me. It seems like we are using a registry pattern where the handlers are responsible for defining the actual aggregation logic. But then the all-gather happens in here. (Separately I stand by my claim that it would be better to hold off on more complex cases like distribution aggregators so as not to boil the ocean here.)

the handlers are responsible for defining the actual aggregation logic. But then the all-gather happens in here.

why is that a contradiction?

The MetricsAggregator calls the handler.finalize_local_agg

then does a single all_gather to get the results from all ranks for all metrics

Then calls handler._finalize_dist_agg([aggregated_results_per_rank]*n_ranks)

Do you wanna suggest a different way of doing it? Or is it hard to spot this pattern in the code?

my claim that it would be better to hold off on more complex cases like distribution aggregators

If we dont do aggregation across ranks, we wouldnt be able to count things like "tokens_seen", right? :/

Or do you mean that we should delete DistributionAggHandler? To clarify, this distribution has nothing to do with multiple gpus. Its just stats, e.g. std, percentiles, max, min, etc. Maybe i should rename if its causing confusion.

ebsmothers · 2025-07-03T23:52:59Z

recipes/full_finetune_distributed.py

+        if cfg.get("dataset_val") is not None:
+            raise NotImplementedError(
+                "Validation is not supported yet with iterable datasets."
+            )


Is there a specific technical reason here? Or we just haven't gotten to it yet

validation datasets are not infinite!!! Need to figure out how to solve this one, but it wont be on this PR

recipes/full_finetune_distributed.py

codecov-commenter · 2025-07-07T02:43:39Z

Codecov Report

Attention: Patch coverage is 74.18831% with 318 lines in your changes missing coverage. Please review.

Please upload report for BASE (impl-step-based-ckpt@54a48bb). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
recipes/full_finetune_distributed.py	0.00%	92 Missing ⚠️
tests/torchtune/datasets/test_interleaved.py	78.18%	53 Missing ⚠️
tests/torchtune/data/test_metrics_aggregator.py	71.42%	44 Missing ⚠️
tests/torchtune/datasets/test_hf_iterable.py	73.75%	37 Missing ⚠️
torchtune/data/metrics/_metric_agg_handlers.py	84.45%	23 Missing ⚠️
torchtune/data/metrics/_metric_aggregator.py	77.66%	23 Missing ⚠️
torchtune/datasets/_hf_iterable.py	84.76%	16 Missing ⚠️
torchtune/datasets/_sft.py	33.33%	16 Missing ⚠️
torchtune/datasets/_iterable_base.py	88.88%	4 Missing ⚠️
...htune/training/checkpointing/_checkpoint_client.py	0.00%	3 Missing ⚠️
... and 4 more

Additional details and impacted files

@@                   Coverage Diff                   @@
##             impl-step-based-ckpt    #2852   +/-   ##
=======================================================
  Coverage                        ?   60.64%           
=======================================================
  Files                           ?      449           
  Lines                           ?    28224           
  Branches                        ?        0           
=======================================================
  Hits                            ?    17116           
  Misses                          ?    11108           
  Partials                        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

divyanshk · 2025-07-07T18:24:02Z

torchtune/datasets/_hf_iterable.py

+                    # HuggingFace datasets bug where .map() causes incorrect checkpoint resumption.
+                    # See: https://github.com/huggingface/datasets/issues/7630
+                    # This ensures transforms are applied fresh on each sample during iteration.
+                    sample = self._apply_transforms(sample)


Applying transformations inside the dataset before returning every sample would mean no possibility of parallelizing them (either within every dataset or across datasets). Is that expected?

Synced offline. I had the wrong impression. We shard the dataset so if a user turned on num_workers>0 it would lead to multiple processes all reading the same dataset but different shards of it; and so apply transformations in each process.

divyanshk · 2025-07-07T21:26:33Z

torchtune/datasets/_hf_iterable.py

+        # Shuffle the dataset
+        if self._shuffle_buffer_size and self._shuffle_buffer_size > 0:
+            ds = ds.shuffle(seed=self._seed, buffer_size=self._shuffle_buffer_size)


Would it be better if we shuffled before sharding ?

i believe the sharding is happening when we call .to_iterable_dataset. If we do after split_by_node, then i guess the shuffle would only happen inside of the node, and not across nodes.

e.g.
shuffling before:
[0,1,2,3,4,5] -> [4,1,0,5,3,2]

shuffling after
[0,1], [2,3], [4,5] -> [1,0], [2,3], [5,4]

Right. So we should shuffle first, then shard with .to_iterable_dataset call, then split_by_node.

hmmm, why not this?

.to_iterable_dataset(num_shards)

shuffle

split_by_node

sharding happens at 1. Shouldnt we shuffle after sharding?

btw, thanks for helping me double check this

We can shuffle after sharding, but then the shuffle will only be within the shard (unless HF's ds.shuffle is doing something non-obvious here). The example you had above is clearly depicting that. Smaller the sample size, less optimal the shuffle. But obv this is something that can be tweaked based on perf.

We can shuffle after sharding, but then the shuffle will only be within the shard (

oh, i see. I guess the shuffle will still be across all shards, because the shards are not assigned to any rank yet. Just need to make sure that every rank uses the same seed. The issue is if we shuffle after split_by_node. But i need to double check that in their docs/forum. Last time i saw that was a few weeks ago.

felipemello1 and others added 6 commits June 25, 2025 12:41

first commit

3cab533

update tests

2212b19

Merge remote-tracking branch 'joecummings/impl-step-based-ckpt' into …

4345832

…iterable_dataset_final

linter

2eb68b6

tests pass

2e51e04

it works

93fa743

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 26, 2025

remove code

aa9e6f4

felipemello1 changed the title ~~first commit~~ Iterable Dataset Jun 26, 2025

felipemello1 commented Jun 26, 2025

View reviewed changes

Darktex requested changes Jun 26, 2025

View reviewed changes

felipemello1 and others added 11 commits July 1, 2025 21:18

update metrics to use handlers

5b188ed

remove file after refactoring

2eab08d

add distributed tsts

58491f1

Merge branch 'iterable_dataset_final' of github.com:felipemello1/torc…

da7245d

…htune into iterable_dataset_final

tests pass

96424d0

optimize SFTOutputTransform

853147b

use ds.sampling_weight

96bc317

add sampling log to interlead dataset

3c9d161

fix nested interleave

4804663

changes to TuneIterableDataset

2fe4b40

add IterableDataset back

72211c9

ebsmothers reviewed Jul 3, 2025

View reviewed changes

felipemello1 and others added 6 commits July 5, 2025 21:59

nested interleaved + dataset.info

b350ac7

nits hf_iterable

f9a1aec

update readme

f7a3aa7

make metric dataset name explicit

17878bf

update recipe to share log freq + validagtion msg

101e96e

update interleaved tests to do nesting

1b3f3fc

Felipe Mello added 3 commits July 6, 2025 11:27

lint

fac3fd5

error if duplicated metric name

29ba1cb

improve docs

f89eefe

divyanshk reviewed Jul 7, 2025

View reviewed changes

		from torch.utils.data import IterableDataset


		class TuneIterableDataset(IterableDataset, ABC):

		from torchtune.data.metrics._metric_transform import AggregationType, Metric


		class MetricsAggregator:

Iterable Dataset #2852

Are you sure you want to change the base?

Iterable Dataset #2852

Uh oh!

Conversation

felipemello1 commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

TIps when reviewing this pr

Changelog

Test plan

Uh oh!

pytorch-bot bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2852

❌ 1 New Failure, 2 Cancelled Jobs

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Darktex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

felipemello1 commented Jun 26, 2025 •

edited

Loading

pytorch-bot bot commented Jun 26, 2025 •

edited

Loading

felipemello1 Jul 4, 2025 •

edited

Loading

felipemello1 Jul 4, 2025 •

edited

Loading

codecov-commenter commented Jul 7, 2025 •

edited

Loading

felipemello1 Jul 7, 2025 •

edited

Loading

felipemello1 Jul 8, 2025 •

edited

Loading

felipemello1 Jul 8, 2025 •

edited

Loading