-
Notifications
You must be signed in to change notification settings - Fork 46
[QEff. Finetune] Updated handling of custom dataset in FT. Updated finetune.md readme file. #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…accordingly. Signed-off-by: meetkuma <[email protected]>
…oken. Signed-off-by: meetkuma <[email protected]>
e8e87b9
to
b3e2e6c
Compare
Signed-off-by: meetkuma <[email protected]>
…thon code and peft config file. Signed-off-by: meetkuma <[email protected]>
Signed-off-by: meetkuma <[email protected]>
logger.raise_error( | ||
"For 'custom_dataset', please provide dataset config file via 'custom_dataset_config' flag.", | ||
RuntimeError, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we not handling the changes done in this method in the update_config() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, the dataset_config which is defined at L108 is not a dict but a dataclass. In order to update the params of the existing dataclass we need to convert that into dict, add new keys from custom_dataset_config file and update it and then again construct the dataclass. This code it doing that.
4. **Passing Additional Configuration Parameters** | ||
You can add custom arguments in `data_config.json`, which will be accessible via the `dataset_config` argument inside your `get_custom_dataset()` function. | ||
|
||
5. **Example `data_config.json` File** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since, there's already sample_dataset_config.json created, we can remove the sample from here to keep it the documentation crisp. Either of the two should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine to keep it here to prevent back and forth movement while reading the documentation.
``` | ||
- Your preprocessing function must follow this structure: | ||
```python | ||
def get_custom_dataset(dataset_config, tokenizer, split, context_length=None): | ||
def apply_prompt_template(): | ||
# Apply prompt formatting to each datapoint | ||
|
||
def tokenize(): | ||
# Tokenize the formatted datapoint | ||
|
||
# Apply functions to dataset using map | ||
dataset = dataset.map(apply_prompt_template, ...) | ||
dataset = dataset.map(tokenize, ...) | ||
|
||
return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing template looks better to me as it briefly talks about what each function does followed by function call/def
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the same, but still i will refine along with the other points of documentation to make it more descriptive.
6. **Implementing Custom Preprocessing Logic** | ||
Within your dataset loader function, define `apply_prompt_template()` to manipulate raw data into desired prompt format, and `tokenize()` to convert it into token IDs using the tokenizer. | ||
|
||
7. **Reference for Dataset Utilities** | ||
You can refer to existing implementations in the [dataset directory of this repository](https://github.com/quic/efficient-transformers/tree/main/QEfficient/finetune/dataset). | ||
|
||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to first explain these points (6 and 7) above and then followed by the sample templates .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refine the documentation.
|
||
To run fine tuning for any user specific dataset, prepare the dataset using the following steps: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these points should not be removed, specially 4-7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will refine the documentation.
"train_split": "train", | ||
"test_split": "test", | ||
"test_split_ratio": 0.15, | ||
"preproc_file": "disc_preprocd.py:get_preprocessed_disc", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better approach from the user perspective would be to use the existing approach in which user is required to give only the file path of the preprocessing file. Both calls: get_custom_dataset() and get_data_collator() are taken care directly through the existing code.
This will reduce the number of parameters requested form the user.
And a default path of the file is also given in the code (i.e. efficient-transformers/dataset/custom_dataset.py), so the user can skip giving the path if he wants to reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, that would not satisfy all the use cases. The cases where the user has its own preprocessing steps in that case it is better to take it from user. In case the user provides wants to use its own preprocessing, then he/she can pass it to sample_dataset_config.json and that template will be available via dataset_config object.
No description provided.