Fill-in-the-Middle (FIM) Dataset Support

**Is your feature request related to a problem? Please describe.**
I need to train code models (like StarCoder, CodeGemma, etc.) using Fill-in-the-Middle (FIM) training in Megatron Bridge. While Megatron Bridge has model support for code models, there is no dataset infrastructure to train these models with the FIM objective.

**Describe the solution you'd like**
Add FIM dataset support. This would include:

- FIM Dataset config in `src/megatron/bridge/training/config.py`
- FIM Dataset implementation in `src/megatron/bridge/data/`
- Integration with existing dataset provider registry and blending support
- Tokenizer integration

**Describe alternatives you've considered**
N/A

**Additional context**
Reference Implementations:
    • [StarCoder](https://github.com/bigcode-project/Megatron-LM/blob/6c4bf908df8fd86b4977f54bf5b8bd4b521003d1/megatron/data/gpt_dataset.py#L339-L438) shows FIM implementation in Megatron-LM fork

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fill-in-the-Middle (FIM) Dataset Support #1389

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fill-in-the-Middle (FIM) Dataset Support #1389

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions