[QST] Disable RAM overconsumption while finetuning Visual model with custom dataset #2834

Serjio42 · 2025-04-08T17:54:34Z

Serjio42
Apr 8, 2025

Hi Team,

I am reproducing the Qwen2-VL finetuning notebooks from documentation (1, 2, 3) on my own images dataset and facing such a problem:
when the size of dataset goes to 20-30k images all my RAM is full (training even crashes when dataset is bigger than the critical size). It also depends on images size (the bigger, the less images fits my RAM).
I think the issue connected to this line:
converted_dataset = [convert_to_conversation(sample) for sample in dataset]
where all dataset is represented as a list of samples and then goes to this constuctor:

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    data_collator = UnslothVisionDataCollator(model, tokenizer), # Must use!
    train_dataset = converted_dataset,
...

How can I create a dataset with my images that reads them as a files from the disk when they are needed and not store all the images in RAM?

Thanks.

rupaut98 · 2025-04-09T03:12:04Z

rupaut98
Apr 9, 2025

@Serjio42 Use Iterabledataset; refer to https://huggingface.co/docs/datasets/en/about_mapstyle_vs_iterable

0 replies

Serjio42 · 2025-04-09T07:23:38Z

Serjio42
Apr 9, 2025
Author

@rupaut98 when I use “map-style” datasets (with compatible or more dataset size) in pure PyTorch for with CNNs everything is OK. It looks like correctly initialized map-style dataset would be OK here too.
The question is what type of dataset with images and text instructions we should use with Unsloth?

0 replies

rupaut98 · 2025-04-09T16:27:21Z

rupaut98
Apr 9, 2025

@Serjio42 For now, you can only use a dataset with each sample containing a text and an image. It can't be text-only sample or image-only sample (basically not a mixed dataset).

0 replies

Serjio42 · 2025-04-09T18:52:32Z

Serjio42
Apr 9, 2025
Author

@rupaut98 I know that.
I you do not know how to help with the issue let's just look at the more sophisticated answers, and/or dig deeper in understanding how this machinery works.

0 replies

Oseltamivir · 2025-04-11T07:29:17Z

Oseltamivir
Apr 11, 2025

You're looking for streaming datasets. I don't think it is currently implemented

0 replies

Serjio42 · 2025-04-11T07:46:28Z

Serjio42
Apr 11, 2025
Author

@Oseltamivir why streaming, just ordinary map-style dataset that loads samples (images) from the disk by provided index and not stores all the sampes in RAM simultaneously. Like ordinary torch.utils.data.Dataset do.

0 replies

Oseltamivir · 2025-04-11T08:22:35Z

Oseltamivir
Apr 11, 2025

Oh, I assumed you're looking for this:

reads them as a files from the disk when they are needed and not store all the images in RAM

Which is what streaming does

0 replies

Serjio42 · 2025-04-11T08:42:53Z

Serjio42
Apr 11, 2025
Author

@Oseltamivir No, streaming means that data arrives continuously or is too large to fit in disk or memory — it is indeed usually handled by IterableDataset.
But I have all the data on my disk and it can be randomly accessed using indices, so ordinary static map‑style dataset is what I am looking for to use with Unsloth.

0 replies

Oseltamivir · 2025-04-11T09:05:13Z

Oseltamivir
Apr 11, 2025

Oh I see.

Then you'll need to write your own dataset with a __get_item__ method to basically create the following during training:

   'content': [{'type': 'text',
     'text': 'You are an expert radiographer. Describe accurately what you see in this image.'},
    {'type': 'image',
     'image': <PIL.PngImagePlugin.PngImageFile image mode=L size=657x442>}]},
  {'role': 'assistant',
   'content': [{'type': 'text',
     'text': 'Panoramic radiography shows an osteolytic lesion in the right posterior maxilla with resorption of the floor of the maxillary sinus (arrows).'}]}]}

Maybe something like the following, up to you.

def __getitem__(self, idx):
        # Load image file only when needed
        with open(self.file_paths[idx], 'r') as f:
            img = f.read()
        converted_data = convert_to_conversation(self.conversation[idx], img)
        return converted_data

0 replies

Serjio42 · 2025-04-11T09:31:08Z

Serjio42
Apr 11, 2025
Author

@Oseltamivir it is just guess and the same have I. I am just not sure that Unsloth using the same interface with __getitem__ method as PyTorch dataset. Maybe we should inherit from specific HuggingFace or Unsloth dataset class?..
It would be great to have such info in example notebooks and/or documentation.

0 replies

Oseltamivir · 2025-04-11T11:02:54Z

Oseltamivir
Apr 11, 2025

Unsloth's trainer inherits from hf's SFTTrainer. It should. You can try experimenting with it and submit your notebook to unslothai/notebooks if @shimmyshimmer thinks the example is appropriate

0 replies

Uh oh!

[QST] Disable RAM overconsumption while finetuning Visual model with custom dataset #2834

Uh oh!

Serjio42 Apr 8, 2025

Replies: 12 comments

Uh oh!

rupaut98 Apr 9, 2025

Uh oh!

Serjio42 Apr 9, 2025 Author

Uh oh!

rupaut98 Apr 9, 2025

Uh oh!

Serjio42 Apr 9, 2025 Author

Uh oh!

Oseltamivir Apr 11, 2025

Uh oh!

Serjio42 Apr 11, 2025 Author

Uh oh!

Oseltamivir Apr 11, 2025

Uh oh!

Serjio42 Apr 11, 2025 Author

Uh oh!

Oseltamivir Apr 11, 2025

Uh oh!

Serjio42 Apr 11, 2025 Author

Uh oh!

Oseltamivir Apr 11, 2025

Serjio42
Apr 8, 2025

rupaut98
Apr 9, 2025

Serjio42
Apr 9, 2025
Author

rupaut98
Apr 9, 2025

Serjio42
Apr 9, 2025
Author

Oseltamivir
Apr 11, 2025

Serjio42
Apr 11, 2025
Author

Oseltamivir
Apr 11, 2025

Serjio42
Apr 11, 2025
Author

Oseltamivir
Apr 11, 2025

Serjio42
Apr 11, 2025
Author

Oseltamivir
Apr 11, 2025