Skip to content

Commit 7a793e7

Browse files
committed
add readme
Signed-off-by: dimapihtar <[email protected]>
1 parent 26145dd commit 7a793e7

File tree

1 file changed

+41
-0
lines changed

1 file changed

+41
-0
lines changed
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# Data Pipeline
2+
3+
## FIM dataset
4+
5+
`GPTFIMDataset` extends Megatron-Core’s `GPTDataset` to support **Fill-in-the-Middle (FIM)** data augmentation.
6+
It probabilistically converts samples into FIM format using configurable rates, with support for both PSM and SPM patterns, fragment-level splitting, and length-preserving output.
7+
8+
`GPTFIMDatasetConfig` provides the configuration needed to enable this behavior.
9+
`GPTFIMDatasetConfig` configuration object extending `GPTDatasetConfig` to enable FIM preprocessing.
10+
11+
**Attributes**
12+
13+
- `rate`: Probability of converting a sample into a FIM example.
14+
15+
- `spm_rate`: Probability of using the SPM FIM pattern (vs PSM).
16+
17+
- `extra_tokens`: Dictionary containing the FIM special tokens: {"prefix", "middle", "suffix", "pad", "eod"}.
18+
19+
- `split_sample`: Optional token around which samples are split before applying FIM.
20+
21+
- `fragment_rate`: Probability of applying FIM to each fragment when split_sample is used.
22+
23+
- `no_prefix`: If the decoded sequence starts with this prefix, FIM is skipped.
24+
25+
`GPTFIMDataset` dataset class that loads token sequences from an `IndexedDataset` and applies FIM transformations before returning each sample.
26+
27+
**PSM Format**
28+
```
29+
[prefix_tok] prefix [suffix_tok] suffix [middle_tok] middle
30+
```
31+
32+
**SPM Format**
33+
```
34+
[prefix_tok, suffix_tok] suffix [middle_tok] prefix middle
35+
```
36+
37+
**Special cases:**
38+
39+
- If the sequence starts with no_prefix, FIM is skipped.
40+
41+
- If FIM is not applied, the sample is returned unchanged.

0 commit comments

Comments
 (0)