Replies: 12 comments
-
|
@Serjio42 Use Iterabledataset; refer to https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable |
Beta Was this translation helpful? Give feedback.
-
|
@rupaut98 when I use “map-style” datasets (with compatible or more dataset size) in pure PyTorch for with CNNs everything is OK. It looks like correctly initialized map-style dataset would be OK here too. |
Beta Was this translation helpful? Give feedback.
-
|
@Serjio42 For now, you can only use a dataset with each sample containing a text and an image. It can't be text-only sample or image-only sample (basically not a mixed dataset). |
Beta Was this translation helpful? Give feedback.
-
|
@rupaut98 I know that. |
Beta Was this translation helpful? Give feedback.
-
|
You're looking for streaming datasets. I don't think it is currently implemented |
Beta Was this translation helpful? Give feedback.
-
|
@Oseltamivir why streaming, just ordinary map-style dataset that loads samples (images) from the disk by provided index and not stores all the sampes in RAM simultaneously. Like ordinary torch.utils.data.Dataset do. |
Beta Was this translation helpful? Give feedback.
-
|
Oh, I assumed you're looking for this:
Which is what streaming does |
Beta Was this translation helpful? Give feedback.
-
|
@Oseltamivir No, streaming means that data arrives continuously or is too large to fit in disk or memory — it is indeed usually handled by IterableDataset. |
Beta Was this translation helpful? Give feedback.
-
|
Oh I see. Then you'll need to write your own Maybe something like the following, up to you. |
Beta Was this translation helpful? Give feedback.
-
|
@Oseltamivir it is just guess and the same have I. I am just not sure that Unsloth using the same interface with |
Beta Was this translation helpful? Give feedback.
-
|
Unsloth's trainer inherits from hf's SFTTrainer. It should. You can try experimenting with it and submit your notebook to unslothai/notebooks if @shimmyshimmer thinks the example is appropriate |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
I am reproducing the Qwen2-VL finetuning notebooks from documentation (1, 2, 3) on my own images dataset and facing such a problem:
when the size of dataset goes to 20-30k images all my RAM is full (training even crashes when dataset is bigger than the critical size). It also depends on images size (the bigger, the less images fits my RAM).
I think the issue connected to this line:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]where all dataset is represented as a list of samples and then goes to this constuctor:
How can I create a dataset with my images that reads them as a files from the disk when they are needed and not store all the images in RAM?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions