Skip to content

Commit e17dc5b

Browse files
committed
fix readme
Signed-off-by: dimapihtar <[email protected]>
1 parent 7a793e7 commit e17dc5b

File tree

1 file changed

+0
-7
lines changed

1 file changed

+0
-7
lines changed

megatron/training/datasets/README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -11,17 +11,11 @@ It probabilistically converts samples into FIM format using configurable rates,
1111
**Attributes**
1212

1313
- `rate`: Probability of converting a sample into a FIM example.
14-
1514
- `spm_rate`: Probability of using the SPM FIM pattern (vs PSM).
16-
1715
- `extra_tokens`: Dictionary containing the FIM special tokens: {"prefix", "middle", "suffix", "pad", "eod"}.
18-
1916
- `split_sample`: Optional token around which samples are split before applying FIM.
20-
2117
- `fragment_rate`: Probability of applying FIM to each fragment when split_sample is used.
22-
2318
- `no_prefix`: If the decoded sequence starts with this prefix, FIM is skipped.
24-
2519
`GPTFIMDataset` dataset class that loads token sequences from an `IndexedDataset` and applies FIM transformations before returning each sample.
2620

2721
**PSM Format**
@@ -37,5 +31,4 @@ It probabilistically converts samples into FIM format using configurable rates,
3731
**Special cases:**
3832

3933
- If the sequence starts with no_prefix, FIM is skipped.
40-
4134
- If FIM is not applied, the sample is returned unchanged.

0 commit comments

Comments
 (0)