Skip to content

Fill-in-the-Middle (FIM) Dataset Support #1389

@sbhavani

Description

@sbhavani

Is your feature request related to a problem? Please describe.
I need to train code models (like StarCoder, CodeGemma, etc.) using Fill-in-the-Middle (FIM) training in Megatron Bridge. While Megatron Bridge has model support for code models, there is no dataset infrastructure to train these models with the FIM objective.

Describe the solution you'd like
Add FIM dataset support. This would include:

  • FIM Dataset config in src/megatron/bridge/training/config.py
  • FIM Dataset implementation in src/megatron/bridge/data/
  • Integration with existing dataset provider registry and blending support
  • Tokenizer integration

Describe alternatives you've considered
N/A

Additional context
Reference Implementations:
StarCoder shows FIM implementation in Megatron-LM fork

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions