-
Notifications
You must be signed in to change notification settings - Fork 3.3k
add FIM dataset support #2291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add FIM dataset support #2291
Changes from all commits
ba54bc9
1f1cba7
c03fd16
efa6ac7
5b8d7eb
0a2ad73
26145dd
7a793e7
e17dc5b
092ecb9
c7766e9
bd9d689
4d9ee92
bdff861
7dba916
ae4e01e
ec47085
70dda91
0bdec43
3525938
d01590d
c2578e4
2f4e2fa
430881d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,34 @@ | ||
| # Data Pipeline | ||
|
|
||
| ## FIM dataset | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe not necessarily belong to this PR, but we should at least note in the readme that in order to use FIM training, your pretrain dataset needs to be preprocessed with the special tokens. We might need to add support for data preprocessing script as well ( could be separate PR)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. agree, let's add it later i na separate PR.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah something like: |
||
|
|
||
| `GPTFIMDataset` extends Megatron-Core’s `GPTDataset` to support **Fill-in-the-Middle (FIM)** data augmentation. | ||
| It probabilistically converts samples into FIM format using configurable rates, with support for both PSM and SPM patterns, fragment-level splitting, and length-preserving output. | ||
|
|
||
| `GPTFIMDatasetConfig` provides the configuration needed to enable this behavior. | ||
| `GPTFIMDatasetConfig` configuration object extending `GPTDatasetConfig` to enable FIM preprocessing. | ||
|
|
||
| **Attributes** | ||
|
|
||
| - `rate`: Probability of converting a sample into a FIM example. A value of `1.0` means FIM is always applied. a value of `0.0` means FIM is never applied. | ||
| - `spm_rate`: Probability of using the SPM FIM pattern (vs PSM). The remaining probability (`1 - spm_rate`) selects the PSM (prefix-suffix-middle) pattern instead. For example, if `spm_rate = 0.3`: 30% SPM, 70% PSM. | ||
| - `extra_tokens`: Dictionary containing the FIM special tokens: {"prefix", "middle", "suffix", "pad", "eod"}. | ||
| - `split_sample`: Optional token around which samples are split before applying FIM. If provided, the input sequence is divided at every occurrence of this token, and FIM is applied independently to each fragment. `A B C <SPLI_SAMPLE> D E F <SPLIT_SAMPLE> G H` -> `FIM(Fragment 1) <SPLI_SAMPLE> FIM(Fragment 2) <SPLI_SAMPLE> FIM(Fragment 3)`. | ||
| - `fragment_rate`: Probability of applying FIM to each fragment when split_sample is used. | ||
| - `no_prefix`: If the decoded sequence starts with this prefix, FIM is skipped. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd leave it as it is since it's implemntation from NeMo1 |
||
| `GPTFIMDataset` dataset class that loads token sequences from an `IndexedDataset` and applies FIM transformations before returning each sample. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how will seq length change in this case? in many cases IndexedDataset has constant seq length?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seq length will not be changed |
||
|
|
||
| **PSM Format** | ||
| ``` | ||
| [prefix_tok] prefix [suffix_tok] suffix [middle_tok] middle | ||
| ``` | ||
|
|
||
| **SPM Format** | ||
| ``` | ||
| [prefix_tok, suffix_tok] suffix [middle_tok] prefix middle | ||
| ``` | ||
|
|
||
| **Special cases:** | ||
|
|
||
| - If the sequence starts with no_prefix, FIM is skipped. | ||
| - If FIM is not applied, the sample is returned unchanged. | ||
Uh oh!
There was an error while loading. Please reload this page.