WMT2024-LRILT

Overview

Welcome to the WMT2024-LRILT repository! This project contains the system developed by the NLIP Lab at IIT Hyderabad for the WMT 2024 Shared Task on Low-Resource Indic Language Translation. Our work focuses on advancing the translation of English ↔ Assamese, Khasi, Mizo, and Manipuri language pairs through fine-tuning pre-trained models.

Features

Language-Specific Fine-Tuning: Fine-tuning on pre-trained models like IndicRASP and IndicRASP Seed for low-resource languages.
Multilingual Support: Cross-lingual transfer learning using multilingual models with script-based language grouping.
Layer-Freezing Techniques: Leveraging frozen layers to enhance the efficiency of transfer learning.
Alignment Augmentation: Improving translation quality using alignment-based pre-training objectives.

System Architecture

Our approach leverages the following models:

IndicRASP: Pre-trained on 22 scheduled Indic languages, focused on alignment augmentation.
IndicRASP Seed: A fine-tuned version of IndicRASP on high-quality, small-scale data, demonstrating improved translation results.

We experimented with bilingual and multilingual setups, using language grouping based on script similarity, and explored layer-freezing techniques to optimize performance.

Models

This repository contains checkpoints for translation models trained on low-resource Indic languages as part of the WMT24 Shared Task. Below are the links to the available models for En → Indic and Indic → En directions.

En → Indic

Language	Script	Checkpoint Name	Download Link
Assamese	Bengali	`checkpoint_best.pt`	Download
Khasi	Latin	`checkpoint_best.pt`	Download
Mizo	Latin	`checkpoint_best.pt`	Download
Manipuri	Bengali	`checkpoint_best.pt`	Download

Indic → En

Language	Script	Checkpoint Name	Download Link
Assamese	Bengali	`checkpoint_best.pt`	Download
Khasi	Latin	`checkpoint_best.pt`	Download
Mizo	Latin	`checkpoint_best.pt`	Download
Manipuri	Bengali	`checkpoint_best.pt`	Download

Results

Our system achieved the following results on the public test set:

Language Pair	BLEU Score	chrF2 Score
English → Assamese	20.1	50.6
English → Khasi	19.1	42.3
English → Mizo	30.0	54.9
English → Manipuri	35.6	66.3

Datasets

We used the IndicNECorp1.0 dataset provided by the IndicMT shared task organizers. It includes parallel and monolingual data for Assamese, Khasi, Mizo, and Manipuri languages. For more details, refer to the official WMT24 shared task page.

Contributors

Pramit Sahoo (@pramitsahoo)
Maharaj Brahma
Maunendra Sankar Desarkar

License

This project is licensed under the MIT License - see the LICENSE file for details.

For more details on the system and our experiments, please refer to our paper.

Happy translating! 🌐✨

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

WMT2024-LRILT

Overview

Table of Contents

Features

System Architecture

Models

En → Indic

Indic → En

Results

Datasets

Contributors

License

About

Uh oh!

Releases

Packages

License

maharajbrahma/WMT2024-LRILT

Folders and files

Latest commit

History

Repository files navigation

WMT2024-LRILT

Overview

Table of Contents

Features

System Architecture

Models

En → Indic

Indic → En

Results

Datasets

Contributors

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages