Skip to content

[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

quic-meetkuma
Copy link
Contributor

No description provided.

@quic-meetkuma quic-meetkuma changed the title Updated handling of custom dataset in FT. Updated finetune.md readme file. [QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. Jul 24, 2025
@quic-meetkuma quic-meetkuma marked this pull request as ready for review July 24, 2025 08:00
logger.raise_error(
"For 'custom_dataset', please provide dataset config file via 'custom_dataset_config' flag.",
RuntimeError,
)
Copy link
Contributor

@quic-swatia quic-swatia Jul 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not handling the changes done in this method in the update_config() instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, the dataset_config which is defined at L108 is not a dict but a dataclass. In order to update the params of the existing dataclass we need to convert that into dict, add new keys from custom_dataset_config file and update it and then again construct the dataclass. This code it doing that.

4. **Passing Additional Configuration Parameters**
You can add custom arguments in `data_config.json`, which will be accessible via the `dataset_config` argument inside your `get_custom_dataset()` function.

5. **Example `data_config.json` File**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, there's already sample_dataset_config.json created, we can remove the sample from here to keep it the documentation crisp. Either of the two should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine to keep it here to prevent back and forth movement while reading the documentation.

Comment on lines +94 to +108
```
- Your preprocessing function must follow this structure:
```python
def get_custom_dataset(dataset_config, tokenizer, split, context_length=None):
def apply_prompt_template():
# Apply prompt formatting to each datapoint

def tokenize():
# Tokenize the formatted datapoint

# Apply functions to dataset using map
dataset = dataset.map(apply_prompt_template, ...)
dataset = dataset.map(tokenize, ...)

return dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing template looks better to me as it briefly talks about what each function does followed by function call/def

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the same, but still i will refine along with the other points of documentation to make it more descriptive.

Comment on lines +135 to +141
6. **Implementing Custom Preprocessing Logic**
Within your dataset loader function, define `apply_prompt_template()` to manipulate raw data into desired prompt format, and `tokenize()` to convert it into token IDs using the tokenizer.

7. **Reference for Dataset Utilities**
You can refer to existing implementations in the [dataset directory of this repository](https://github.com/quic/efficient-transformers/tree/main/QEfficient/finetune/dataset).

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to first explain these points (6 and 7) above and then followed by the sample templates .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refine the documentation.


To run fine tuning for any user specific dataset, prepare the dataset using the following steps:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these points should not be removed, specially 4-7.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will refine the documentation.

"train_split": "train",
"test_split": "test",
"test_split_ratio": 0.15,
"preproc_file": "disc_preprocd.py:get_preprocessed_disc",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better approach from the user perspective would be to use the existing approach in which user is required to give only the file path of the preprocessing file. Both calls: get_custom_dataset() and get_data_collator() are taken care directly through the existing code.

This will reduce the number of parameters requested form the user.

And a default path of the file is also given in the code (i.e. efficient-transformers/dataset/custom_dataset.py), so the user can skip giving the path if he wants to reuse it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, that would not satisfy all the use cases. The cases where the user has its own preprocessing steps in that case it is better to take it from user. In case the user provides wants to use its own preprocessing, then he/she can pass it to sample_dataset_config.json and that template will be available via dataset_config object.

@quic-meetkuma quic-meetkuma marked this pull request as draft August 4, 2025 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants