Skip to content

Investigate mapping token embeddings from source to target #481

@mshannon-sil

Description

@mshannon-sil
Collaborator

A recently published paper introduced a strategy called "trans-tokenization", which "focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language." We should investigate whether this approach could improve the performance of adding trained tokens to NLLB.

Sub-issues

Sub-issues

0 of 2 Issues completed

Activity

moved this from 🆕 New to 📋 Backlog in SIL-NLP Researchon Oct 16, 2024
moved this from 📋 Backlog to 🏗 In progress in SIL-NLP Researchon Nov 18, 2024
moved this from 🏗 In progress to 📋 Backlog in SIL-NLP Researchon Dec 18, 2024
removed their assignment
on Jan 8, 2025
johnml1135

johnml1135 commented on Feb 26, 2025

@johnml1135
Collaborator

Here is another paper on this question: https://arxiv.org/abs/2502.10852
To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    📋 Backlog

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @johnml1135@TaperChipmunk32@mshannon-sil

        Issue actions

          Investigate mapping token embeddings from source to target · Issue #481 · sillsdev/silnlp