Skip to content

Conversation

@mattyding
Copy link
Contributor

@mattyding mattyding commented Nov 30, 2024

This PR

Adds conversion script for pre-tokenized data in a Delta table.

Testing

MCLI IFT and CPT runs trained successfully.

@mattyding mattyding force-pushed the matt/split-mds-script branch from 750a241 to ef37a3f Compare March 4, 2025 20:01
@mattyding mattyding marked this pull request as ready for review March 4, 2025 20:08
@mattyding mattyding requested a review from a team as a code owner March 4, 2025 20:08
'attention_mask': 'ndarray',
'labels': 'ndarray',
}
convert_x = lambda x: (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this assume single turn?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forge reduces/joins multi-turn CHAT into a single turn. I thought this was intended but after looking at the code, I don't think it is / PySpark might coerce the data into that format 🙃 . Will do some more debugging.

IFT regression tests might be necessary after all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just following up that we confirmed multi turn is preserved, and so this script should support it too

@mattyding
Copy link
Contributor Author

Last few commits address comments, add support for multi-turn CHAT data. Provided testing plan + YAMLs in Slack thread. Looking for guidance on to test with PEFT enabled

@mattyding mattyding requested a review from dakinggg March 17, 2025 14:21
with open(json_full_filepath, 'r') as f:
for line in f:
turns = convert_x(json.loads(line))
for turn in turns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right? see

sample_to_write = {'turns': []}
for turn in sample['turns']:
turn_to_write = {}
for key in ['input_ids', 'labels']:
turn_to_write[key] = list(turn[key])
sample_to_write['turns'].append(turn_to_write)
out.write(sample_to_write)
for another example of writing turns data to MDS

@mattyding mattyding force-pushed the matt/split-mds-script branch from cf378e0 to 919aacd Compare March 19, 2025 02:06
@mattyding mattyding marked this pull request as draft May 12, 2025 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants