Skip to content

maharajbrahma/WMT2024-LRILT

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 

Repository files navigation

WMT2024-LRILT

Overview

Welcome to the WMT2024-LRILT repository! This project contains the system developed by the NLIP Lab at IIT Hyderabad for the WMT 2024 Shared Task on Low-Resource Indic Language Translation. Our work focuses on advancing the translation of English ↔ Assamese, Khasi, Mizo, and Manipuri language pairs through fine-tuning pre-trained models.

Table of Contents

Features

  • Language-Specific Fine-Tuning: Fine-tuning on pre-trained models like IndicRASP and IndicRASP Seed for low-resource languages.
  • Multilingual Support: Cross-lingual transfer learning using multilingual models with script-based language grouping.
  • Layer-Freezing Techniques: Leveraging frozen layers to enhance the efficiency of transfer learning.
  • Alignment Augmentation: Improving translation quality using alignment-based pre-training objectives.

System Architecture

Our approach leverages the following models:

  • IndicRASP: Pre-trained on 22 scheduled Indic languages, focused on alignment augmentation.
  • IndicRASP Seed: A fine-tuned version of IndicRASP on high-quality, small-scale data, demonstrating improved translation results.

We experimented with bilingual and multilingual setups, using language grouping based on script similarity, and explored layer-freezing techniques to optimize performance.

Models

This repository contains checkpoints for translation models trained on low-resource Indic languages as part of the WMT24 Shared Task. Below are the links to the available models for En β†’ Indic and Indic β†’ En directions.

En β†’ Indic

Language Script Checkpoint Name Download Link
Assamese Bengali checkpoint_best.pt Download
Khasi Latin checkpoint_best.pt Download
Mizo Latin checkpoint_best.pt Download
Manipuri Bengali checkpoint_best.pt Download

Indic β†’ En

Language Script Checkpoint Name Download Link
Assamese Bengali checkpoint_best.pt Download
Khasi Latin checkpoint_best.pt Download
Mizo Latin checkpoint_best.pt Download
Manipuri Bengali checkpoint_best.pt Download

Results

Our system achieved the following results on the public test set:

Language Pair BLEU Score chrF2 Score
English β†’ Assamese 20.1 50.6
English β†’ Khasi 19.1 42.3
English β†’ Mizo 30.0 54.9
English β†’ Manipuri 35.6 66.3

Datasets

We used the IndicNECorp1.0 dataset provided by the IndicMT shared task organizers. It includes parallel and monolingual data for Assamese, Khasi, Mizo, and Manipuri languages. For more details, refer to the official WMT24 shared task page.

Contributors

  • Pramit Sahoo (@pramitsahoo)
  • Maharaj Brahma
  • Maunendra Sankar Desarkar

License

This project is licensed under the MIT License - see the LICENSE file for details.


For more details on the system and our experiments, please refer to our paper.

Happy translating! 🌐✨

About

(WMT 2024) Low-Resource Indic Language Translation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published