Add pre-tokenized Delta to MDS conversion script #1680

mattyding · 2024-11-30T06:36:54Z

This PR

Adds conversion script for pre-tokenized data in a Delta table.

Testing

MCLI IFT and CPT runs trained successfully.

llmfoundry/command_utils/data_prep/convert_delta_to_mds.py

dakinggg · 2025-03-05T00:46:13Z

llmfoundry/command_utils/data_prep/convert_delta_to_mds.py

+            'attention_mask': 'ndarray',
+            'labels': 'ndarray',
+        }
+        convert_x = lambda x: (


why does this assume single turn?

Forge reduces/joins multi-turn CHAT into a single turn. I thought this was intended but after looking at the code, I don't think it is / PySpark might coerce the data into that format 🙃 . Will do some more debugging.

IFT regression tests might be necessary after all.

just following up that we confirmed multi turn is preserved, and so this script should support it too

mattyding · 2025-03-17T14:21:11Z

Last few commits address comments, add support for multi-turn CHAT data. Provided testing plan + YAMLs in Slack thread. Looking for guidance on to test with PEFT enabled

dakinggg · 2025-03-17T19:20:54Z

llmfoundry/command_utils/data_prep/convert_delta_to_mds.py

+                with open(json_full_filepath, 'r') as f:
+                    for line in f:
+                        turns = convert_x(json.loads(line))
+                        for turn in turns:


I don't think this is right? see

llm-foundry/llmfoundry/command_utils/data_prep/convert_finetuning_dataset.py

Lines 244 to 250 in 1e997df

sample_to_write = {'turns': []}

for turn in sample['turns']:

turn_to_write = {}

for key in ['input_ids', 'labels']:

turn_to_write[key] = list(turn[key])

sample_to_write['turns'].append(turn_to_write)

out.write(sample_to_write)

for another example of writing turns data to MDS

mattyding added 12 commits November 21, 2024 22:45

delta to mds script v1

ed15c03

remove open folder

5379d5b

debug

48d26e4

added intermediate jsonl

aa9edbf

update script

fd54b59

cast to ndarray

2095115

nit

6a75da5

revert delta->jsonl refactor

a1a5274

nit

02dfcb5

update col name

4cf6d23

use dtypes

2932a9b

Merge remote-tracking branch 'origin' into matt/split-mds-script

23635c4

mattyding commented Nov 30, 2024

View reviewed changes

llmfoundry/command_utils/data_prep/convert_delta_to_mds.py Outdated Show resolved Hide resolved

mattyding and others added 16 commits December 2, 2024 15:58

dbugging message

04e628e

test bugfix

08bc526

logic is hard

21abada

more testing

819c112

Merge remote-tracking branch 'origin/main' into matt/split-mds-script

accb12b

remove debug msg

19bf0a4

assume single turn input

b5bf28c

reuse convert_ft_dataset fn

9372f48

update for ft

408b96f

fix split

f47cfab

revert a few commits to not break

46fc2d0

rename file to train.jsonl

bb757c7

add debugging statement

6c3e0a7

change debugging statement

2712e2b

Merge branch 'main' into matt/split-mds-script

bdcd3c0

remove debugging statements

ef37a3f

mattyding force-pushed the matt/split-mds-script branch from 750a241 to ef37a3f Compare March 4, 2025 20:01

remove diff

919aacd

mattyding marked this pull request as ready for review March 4, 2025 20:08

mattyding requested a review from a team as a code owner March 4, 2025 20:08

dakinggg reviewed Mar 5, 2025

View reviewed changes

mattyding requested a review from dakinggg March 17, 2025 14:21

dakinggg reviewed Mar 17, 2025

View reviewed changes

mattyding force-pushed the matt/split-mds-script branch from cf378e0 to 919aacd Compare March 19, 2025 02:06

mattyding added 10 commits April 11, 2025 09:54

re-add fix for multi-turn

7a2aaaa

bump

654d084

Merge branch 'main' into matt/split-mds-script

5b918c4

update naming

38652da

debug

f8640ea

debug

92c5886

convert ndarray to bytes

512a165

debug

f47093f

first try specifying bytes type

d10e89e

attempt

d87886e

mattyding marked this pull request as draft May 12, 2025 14:26

mattyding added 7 commits May 12, 2025 07:38

print out dtype for debugging

2d7dbde

do the thing

01c953b

fix

084a2e4

i love debugging

dda4ec7

fix

5120abd

fix

ee8e274

fix

0ae048c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pre-tokenized Delta to MDS conversion script #1680

Add pre-tokenized Delta to MDS conversion script #1680

Uh oh!

mattyding commented Nov 30, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakinggg Mar 5, 2025

Uh oh!

mattyding Mar 5, 2025

Uh oh!

dakinggg Mar 14, 2025

Uh oh!

mattyding commented Mar 17, 2025

Uh oh!

dakinggg Mar 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	sample_to_write = {'turns': []}
	for turn in sample['turns']:
	turn_to_write = {}
	for key in ['input_ids', 'labels']:
	turn_to_write[key] = list(turn[key])
	sample_to_write['turns'].append(turn_to_write)
	out.write(sample_to_write)

Add pre-tokenized Delta to MDS conversion script #1680

Are you sure you want to change the base?

Add pre-tokenized Delta to MDS conversion script #1680

Uh oh!

Conversation

mattyding commented Nov 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakinggg Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

mattyding Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

dakinggg Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

mattyding commented Mar 17, 2025

Uh oh!

dakinggg Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mattyding commented Nov 30, 2024 •

edited

Loading