-
Notifications
You must be signed in to change notification settings - Fork 581
Add pre-tokenized Delta to MDS conversion script #1680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
750a241 to
ef37a3f
Compare
| 'attention_mask': 'ndarray', | ||
| 'labels': 'ndarray', | ||
| } | ||
| convert_x = lambda x: ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this assume single turn?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forge reduces/joins multi-turn CHAT into a single turn. I thought this was intended but after looking at the code, I don't think it is / PySpark might coerce the data into that format 🙃 . Will do some more debugging.
IFT regression tests might be necessary after all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just following up that we confirmed multi turn is preserved, and so this script should support it too
|
Last few commits address comments, add support for multi-turn CHAT data. Provided testing plan + YAMLs in Slack thread. Looking for guidance on to test with PEFT enabled |
| with open(json_full_filepath, 'r') as f: | ||
| for line in f: | ||
| turns = convert_x(json.loads(line)) | ||
| for turn in turns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right? see
llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py
Lines 244 to 250 in 1e997df
| sample_to_write = {'turns': []} | |
| for turn in sample['turns']: | |
| turn_to_write = {} | |
| for key in ['input_ids', 'labels']: | |
| turn_to_write[key] = list(turn[key]) | |
| sample_to_write['turns'].append(turn_to_write) | |
| out.write(sample_to_write) |
cf378e0 to
919aacd
Compare
This PR
Adds conversion script for pre-tokenized data in a Delta table.
Testing
MCLI IFT and CPT runs trained successfully.