[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520

quic-meetkuma · 2025-07-21T10:55:49Z

No description provided.

…accordingly. Signed-off-by: meetkuma <[email protected]>

…oken. Signed-off-by: meetkuma <[email protected]>

Signed-off-by: meetkuma <[email protected]>

…thon code and peft config file. Signed-off-by: meetkuma <[email protected]>

Signed-off-by: meetkuma <[email protected]>

QEfficient/finetune/dataset/custom_dataset/disc_preproc.py

QEfficient/finetune/utils/config_utils.py

quic-swatia · 2025-07-28T21:35:39Z

QEfficient/finetune/utils/config_utils.py

+            logger.raise_error(
+                "For 'custom_dataset', please provide dataset config file via 'custom_dataset_config' flag.",
+                RuntimeError,
+            )


Why are we not handling the changes done in this method in the update_config() instead?

Here, the dataset_config which is defined at L108 is not a dict but a dataclass. In order to update the params of the existing dataclass we need to convert that into dict, add new keys from custom_dataset_config file and update it and then again construct the dataclass. This code it doing that.

QEfficient/finetune/dataset/custom_dataset/disc_preproc.py

quic-swatia · 2025-07-28T21:39:21Z

docs/source/finetune.md

+4. **Passing Additional Configuration Parameters**  
+   You can add custom arguments in `data_config.json`, which will be accessible via the `dataset_config` argument inside your `get_custom_dataset()` function.
+
+5. **Example `data_config.json` File**


Since, there's already sample_dataset_config.json created, we can remove the sample from here to keep it the documentation crisp. Either of the two should be added.

It is fine to keep it here to prevent back and forth movement while reading the documentation.

QEfficient/finetune/configs/dataset_config.py

quic-swatia · 2025-07-28T21:42:15Z

docs/source/finetune.md

+     ```
+   - Your preprocessing function must follow this structure:
+     ```python
+     def get_custom_dataset(dataset_config, tokenizer, split, context_length=None):
+         def apply_prompt_template():
+             # Apply prompt formatting to each datapoint
+
+         def tokenize():
+             # Tokenize the formatted datapoint
+
+         # Apply functions to dataset using map
+         dataset = dataset.map(apply_prompt_template, ...)
+         dataset = dataset.map(tokenize, ...)
+
+         return dataset


The existing template looks better to me as it briefly talks about what each function does followed by function call/def

It is the same, but still i will refine along with the other points of documentation to make it more descriptive.

quic-swatia · 2025-07-28T21:47:42Z

docs/source/finetune.md

+6. **Implementing Custom Preprocessing Logic**  
+   Within your dataset loader function, define `apply_prompt_template()` to manipulate raw data into desired prompt format, and `tokenize()` to convert it into token IDs using the tokenizer.
+
+7. **Reference for Dataset Utilities**  
+   You can refer to existing implementations in the [dataset directory of this repository](https://github.com/quic/efficient-transformers/tree/main/QEfficient/finetune/dataset).
+
+---


It's better to first explain these points (6 and 7) above and then followed by the sample templates .

I will refine the documentation.

quic-swatia · 2025-07-28T21:49:22Z

docs/source/finetune.md


-To run fine tuning for any user specific dataset, prepare the dataset using the following steps:
-


Most of these points should not be removed, specially 4-7.

I will refine the documentation.

quic-swatia · 2025-07-29T09:28:35Z

docs/source/finetune.md

+     "train_split": "train",
+     "test_split": "test",
+     "test_split_ratio": 0.15,
+     "preproc_file": "disc_preprocd.py:get_preprocessed_disc",


Better approach from the user perspective would be to use the existing approach in which user is required to give only the file path of the preprocessing file. Both calls: get_custom_dataset() and get_data_collator() are taken care directly through the existing code.

This will reduce the number of parameters requested form the user.

And a default path of the file is also given in the code (i.e. efficient-transformers/dataset/custom_dataset.py), so the user can skip giving the path if he wants to reuse it.

As discussed offline, that would not satisfy all the use cases. The cases where the user has its own preprocessing steps in that case it is better to take it from user. In case the user provides wants to use its own preprocessing, then he/she can pass it to sample_dataset_config.json and that template will be available via dataset_config object.

quic-meetkuma added 2 commits July 21, 2025 16:37

Updated handling of custom dataset in FT. Updated finetune.md readme …

0ad9242

…accordingly. Signed-off-by: meetkuma <[email protected]>

Minor changes to data collator call to explicitly pass -100 for pad t…

b3e2e6c

…oken. Signed-off-by: meetkuma <[email protected]>

quic-meetkuma force-pushed the custom_dataset branch from e8e87b9 to b3e2e6c Compare July 21, 2025 11:08

quic-meetkuma added 3 commits July 21, 2025 16:59

Removed redundant code changes based on recent merged PRs.

e00bd4e

Signed-off-by: meetkuma <[email protected]>

Added a sample custom dataset config, custom dataset preprocessing py…

8892a0d

…thon code and peft config file. Signed-off-by: meetkuma <[email protected]>

Changed attention back to sdpa from eager.

919d875

Signed-off-by: meetkuma <[email protected]>

quic-meetkuma changed the title ~~Updated handling of custom dataset in FT. Updated finetune.md readme file.~~ [QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. Jul 24, 2025

quic-meetkuma marked this pull request as ready for review July 24, 2025 08:00

quic-meetkuma requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners July 24, 2025 08:00

quic-meetkuma requested review from quic-akuruvil and quic-swatia July 24, 2025 08:00

quic-akuruvil reviewed Jul 25, 2025

View reviewed changes

QEfficient/finetune/dataset/custom_dataset/disc_preproc.py Show resolved Hide resolved

quic-swatia reviewed Jul 28, 2025

View reviewed changes

quic-swatia reviewed Jul 29, 2025

View reviewed changes

quic-meetkuma marked this pull request as draft August 4, 2025 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520

[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520

quic-meetkuma commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

quic-swatia Jul 28, 2025 •

edited

Loading

Uh oh!

quic-meetkuma Aug 4, 2025

Uh oh!

Uh oh!

quic-swatia Jul 28, 2025

Uh oh!

quic-meetkuma Jul 30, 2025

Uh oh!

Uh oh!

quic-swatia Jul 28, 2025

Uh oh!

quic-meetkuma Jul 30, 2025

Uh oh!

quic-swatia Jul 28, 2025

Uh oh!

quic-meetkuma Jul 30, 2025

Uh oh!

quic-swatia Jul 28, 2025

Uh oh!

quic-meetkuma Jul 30, 2025

Uh oh!

quic-swatia Jul 29, 2025

Uh oh!

quic-meetkuma Jul 30, 2025

Uh oh!

Uh oh!


		To run fine tuning for any user specific dataset, prepare the dataset using the following steps:

[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520

Are you sure you want to change the base?

[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520

Conversation

quic-meetkuma commented Jul 21, 2025

Uh oh!

Uh oh!

Uh oh!

quic-swatia Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-swatia Jul 28, 2025 •

edited

Loading