-
Notifications
You must be signed in to change notification settings - Fork 668
Iterable Dataset #2852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: impl-step-based-ckpt
Are you sure you want to change the base?
Iterable Dataset #2852
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2852
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Cancelled JobsAs of commit f89eefe with merge base 3d73591 ( NEW FAILURE - The following job has failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -94,3 +95,72 @@ def slimorca_dataset( | |||
) | |||
return PackedDataset(ds, max_seq_len=tokenizer.max_seq_len) | |||
return ds | |||
|
|||
|
|||
def slimorca_iterable_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added here to demonstrate datamix iterable dataset with this example. Personally, i dislike exposing all of the args and defaults. I would prefer to expose only whats specific to this builder.
torchtune/datasets/_interleaved.py
Outdated
logger.warning( | ||
f"Child dataset {self._datasets[ds_name].dataset_name} was exhausted. " | ||
"This is unexpected for an infinite dataset. Re-initializing its iterator." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% sure i like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this: simply have a subclass for InfiniteIterable
so this is super explicit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did this one land? I don't see InfiniteIterable
anywhere (personally I don't know enough yet to have a strong preference here, just wanna understand where things currently stand)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i made changes but didnt push them yet. I added a dummy class that does nothing:
class InfiniteTuneIterableDataset(TuneIterableDataset):
"""Abstract base class for infinite datasets, which yield samples indefinitely.
It only purpose is to make it explicit that the dataset is expected to be infinite, i.e.
it never exhausts. This is helpful to avoid complexity due to some rank hanging because
of lack of data""
pass
and replaced this logger.warning with raise ValueError.
I think its better to have zero tolerance. Datasets that are not infinite need work to make sure no rank hangs.
@@ -101,3 +102,64 @@ def alpaca_dataset( | |||
original Alpaca dataset, `yahma/alpaca-cleaned <https://huggingface.co/datasets/yahma/alpaca-cleaned>`_. | |||
See the dataset page and :func:`~torchtune.datasets.alpaca_dataset` for more details. | |||
""" | |||
|
|||
|
|||
def alpaca_iterable_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added here to demonstrate datamix iterable dataset with this example. Personally, i dislike exposing all of the args and defaults. I would prefer to expose only whats specific to this builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you are doing this with ``load_dataset_kwargs, right? Or did you mean something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: it's a function, so... get_alpaca_iterable_dataset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the get makes sense, but its not the pattern we have in tune :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great PR! I mainly had a question on the interaction with packing and on the SFT transform
@@ -101,3 +102,64 @@ def alpaca_dataset( | |||
original Alpaca dataset, `yahma/alpaca-cleaned <https://huggingface.co/datasets/yahma/alpaca-cleaned>`_. | |||
See the dataset page and :func:`~torchtune.datasets.alpaca_dataset` for more details. | |||
""" | |||
|
|||
|
|||
def alpaca_iterable_dataset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you are doing this with ``load_dataset_kwargs, right? Or did you mean something else?
torchtune/datasets/_interleaved.py
Outdated
logger.warning( | ||
f"Child dataset {self._datasets[ds_name].dataset_name} was exhausted. " | ||
"This is unexpected for an infinite dataset. Re-initializing its iterator." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this: simply have a subclass for InfiniteIterable
so this is super explicit
torchtune/datasets/_iterable_base.py
Outdated
from torch.utils.data import IterableDataset | ||
|
||
|
||
class TuneIterableDataset(IterableDataset, ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this guy to interact with packing and IIUC I don't believe this is currently happening?
The algo we should implement is this:
- One batch can be made of multiple calls to next. We keep taking until we exceed the max seq len. When we do, we put the last one aside (we'll use it to start the next batch), pad the current one to max len and return.
- The calls to next will go to the interleaved dataset, therefore we automatically construct mixed batches from multiple datasets without much effort
- Also, every time we call next we should make space for logging transforms (which we are, you already wrote them). I think it's ok to make your metrics transforms and aggregators an optional property here so the semantics are clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have packing here: #2819
…htune into iterable_dataset_final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read every line of this PR. (Kidding but I tried to at least look at most of the important stuff.) Thanks for taking on this massive set of changes, I think the dataset classes are a big improvement
self.new_metric( | ||
name="tokens_seen", value=token_len, agg_type=AggregationType.SUM | ||
), | ||
self.new_metric( | ||
name="seq_len", value=token_len, agg_type=AggregationType.DISTRIBUTION | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor thing, but to me metrics having the same value but different aggregation types should not actually be represented as distinct metrics. Like I should be able to just define how a metric is computed for a given sample, then separately choose different types of aggregation as needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so agg_type being a List[AggregationType]?
self.new_metric(
name="tokens_seen", value=token_len,
agg_type=[AggregationType.SUM, AggregationType.MEAN
),
I dont know if the extra complexity is worth it. Adding two metrics is cheap. wdyt?
Today the user can just do:
self.new_metric(
name="tokens_seen_sum", value=token_len,
agg_type=AggregationType.SUM
),
self.new_metric(
name="tokens_seen_mean", value=token_len,
agg_type=AggregationType.MEAN
),
from torchtune.data.metrics._metric_transform import AggregationType, Metric | ||
|
||
|
||
class MetricsAggregator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high level comment: the relationship between this and the agg handlers is not super clear to me. It seems like we are using a registry pattern where the handlers are responsible for defining the actual aggregation logic. But then the all-gather happens in here. (Separately I stand by my claim that it would be better to hold off on more complex cases like distribution aggregators so as not to boil the ocean here.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the handlers are responsible for defining the actual aggregation logic. But then the all-gather happens in here.
why is that a contradiction?
- The MetricsAggregator calls the handler.finalize_local_agg
- then does a single all_gather to get the results from all ranks for all metrics
- Then calls handler._finalize_dist_agg([aggregated_results_per_rank]*n_ranks)
Do you wanna suggest a different way of doing it? Or is it hard to spot this pattern in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my claim that it would be better to hold off on more complex cases like distribution aggregators
If we dont do aggregation across ranks, we wouldnt be able to count things like "tokens_seen", right? :/
Or do you mean that we should delete DistributionAggHandler
? To clarify, this distribution has nothing to do with multiple gpus. Its just stats, e.g. std, percentiles, max, min, etc. Maybe i should rename if its causing confusion.
recipes/full_finetune_distributed.py
Outdated
if cfg.get("dataset_val") is not None: | ||
raise NotImplementedError( | ||
"Validation is not supported yet with iterable datasets." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific technical reason here? Or we just haven't gotten to it yet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validation datasets are not infinite!!! Need to figure out how to solve this one, but it wont be on this PR
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## impl-step-based-ckpt #2852 +/- ##
=======================================================
Coverage ? 60.64%
=======================================================
Files ? 449
Lines ? 28224
Branches ? 0
=======================================================
Hits ? 17116
Misses ? 11108
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
# HuggingFace datasets bug where .map() causes incorrect checkpoint resumption. | ||
# See: https://github.com/huggingface/datasets/issues/7630 | ||
# This ensures transforms are applied fresh on each sample during iteration. | ||
sample = self._apply_transforms(sample) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying transformations inside the dataset before returning every sample would mean no possibility of parallelizing them (either within every dataset or across datasets). Is that expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synced offline. I had the wrong impression. We shard the dataset so if a user turned on num_workers>0 it would lead to multiple processes all reading the same dataset but different shards of it; and so apply transformations in each process.
# Shuffle the dataset | ||
if self._shuffle_buffer_size and self._shuffle_buffer_size > 0: | ||
ds = ds.shuffle(seed=self._seed, buffer_size=self._shuffle_buffer_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better if we shuffled before sharding ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe the sharding is happening when we call .to_iterable_dataset. If we do after split_by_node, then i guess the shuffle would only happen inside of the node, and not across nodes.
e.g.
shuffling before:
[0,1,2,3,4,5] -> [4,1,0,5,3,2]
shuffling after
[0,1], [2,3], [4,5] -> [1,0], [2,3], [5,4]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. So we should shuffle first, then shard with .to_iterable_dataset call, then split_by_node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, why not this?
- .to_iterable_dataset(num_shards)
- shuffle
- split_by_node
sharding happens at 1. Shouldnt we shuffle after sharding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, thanks for helping me double check this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can shuffle after sharding, but then the shuffle will only be within the shard (unless HF's ds.shuffle
is doing something non-obvious here). The example you had above is clearly depicting that. Smaller the sample size, less optimal the shuffle. But obv this is something that can be tweaked based on perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can shuffle after sharding, but then the shuffle will only be within the shard (
oh, i see. I guess the shuffle will still be across all shards, because the shards are not assigned to any rank yet. Just need to make sure that every rank uses the same seed. The issue is if we shuffle after split_by_node. But i need to double check that in their docs/forum. Last time i saw that was a few weeks ago.
Context
What is the purpose of this PR? Is it to
Enable Iterable datasets in torchtune.
CONTEXT: built on top of ongoing PR step-based-ckpt: #2384
TIps when reviewing this pr
Follow this order:
torchtune/datasets/_hf_iterable.py
Changelog
Config and builder design based on the discussions after this RFC: #2785
Next steps:
7. Gather feedback on metric logging. E.g. we can add more aggregation types.
8. Polish the code a little bit
9. Add packing from this RFC: #2819
10. Add curriculum learning
11. Docs?
Test plan
UNTESTED: resume from ckpt in the recipe. However, we have plenty of tests showing that resuming works for these iterable datasets.