-
Notifications
You must be signed in to change notification settings - Fork 72
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem? Please describe.
I need to train code models (like StarCoder, CodeGemma, etc.) using Fill-in-the-Middle (FIM) training in Megatron Bridge. While Megatron Bridge has model support for code models, there is no dataset infrastructure to train these models with the FIM objective.
Describe the solution you'd like
Add FIM dataset support. This would include:
- FIM Dataset config in
src/megatron/bridge/training/config.py - FIM Dataset implementation in
src/megatron/bridge/data/ - Integration with existing dataset provider registry and blending support
- Tokenizer integration
Describe alternatives you've considered
N/A
Additional context
Reference Implementations:
• StarCoder shows FIM implementation in Megatron-LM fork
kannankumarkannankumar
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request