Welcome to the WMT2024-LRILT repository! This project contains the system developed by the NLIP Lab at IIT Hyderabad for the WMT 2024 Shared Task on Low-Resource Indic Language Translation. Our work focuses on advancing the translation of English β Assamese, Khasi, Mizo, and Manipuri language pairs through fine-tuning pre-trained models.
- Language-Specific Fine-Tuning: Fine-tuning on pre-trained models like IndicRASP and IndicRASP Seed for low-resource languages.
- Multilingual Support: Cross-lingual transfer learning using multilingual models with script-based language grouping.
- Layer-Freezing Techniques: Leveraging frozen layers to enhance the efficiency of transfer learning.
- Alignment Augmentation: Improving translation quality using alignment-based pre-training objectives.
Our approach leverages the following models:
- IndicRASP: Pre-trained on 22 scheduled Indic languages, focused on alignment augmentation.
- IndicRASP Seed: A fine-tuned version of IndicRASP on high-quality, small-scale data, demonstrating improved translation results.
We experimented with bilingual and multilingual setups, using language grouping based on script similarity, and explored layer-freezing techniques to optimize performance.
This repository contains checkpoints for translation models trained on low-resource Indic languages as part of the WMT24 Shared Task. Below are the links to the available models for En β Indic and Indic β En directions.
| Language | Script | Checkpoint Name | Download Link |
|---|---|---|---|
| Assamese | Bengali | checkpoint_best.pt |
Download |
| Khasi | Latin | checkpoint_best.pt |
Download |
| Mizo | Latin | checkpoint_best.pt |
Download |
| Manipuri | Bengali | checkpoint_best.pt |
Download |
| Language | Script | Checkpoint Name | Download Link |
|---|---|---|---|
| Assamese | Bengali | checkpoint_best.pt |
Download |
| Khasi | Latin | checkpoint_best.pt |
Download |
| Mizo | Latin | checkpoint_best.pt |
Download |
| Manipuri | Bengali | checkpoint_best.pt |
Download |
Our system achieved the following results on the public test set:
| Language Pair | BLEU Score | chrF2 Score |
|---|---|---|
| English β Assamese | 20.1 | 50.6 |
| English β Khasi | 19.1 | 42.3 |
| English β Mizo | 30.0 | 54.9 |
| English β Manipuri | 35.6 | 66.3 |
We used the IndicNECorp1.0 dataset provided by the IndicMT shared task organizers. It includes parallel and monolingual data for Assamese, Khasi, Mizo, and Manipuri languages. For more details, refer to the official WMT24 shared task page.
- Pramit Sahoo (@pramitsahoo)
- Maharaj Brahma
- Maunendra Sankar Desarkar
This project is licensed under the MIT License - see the LICENSE file for details.
For more details on the system and our experiments, please refer to our paper.
Happy translating! πβ¨