-
-
Notifications
You must be signed in to change notification settings - Fork 6
Open
0 / 20 of 2 issues completedOpen
0 / 20 of 2 issues completed
Copy link
Labels
researchResearch topicsResearch topics
Description
A recently published paper introduced a strategy called "trans-tokenization", which "focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language." We should investigate whether this approach could improve the performance of adding trained tokens to NLLB.
Sub-issues
Collapse Sub-issuesSub-issues
- Manage this item control⌃ shift⇧ uU
- Manage this item control⌃ shift⇧ uU
To pick up a draggable item, press the space bar.
While dragging, use the arrow keys to move the item.
Press space again to drop the item in its new position, or press escape to cancel.
Metadata
Metadata
Assignees
Labels
researchResearch topicsResearch topics
Type
Projects
Status
📋 Backlog
Milestone
Relationships
Development
Select code repository
Activity
johnml1135 commentedon Feb 26, 2025
Here is another paper on this question: https://arxiv.org/abs/2502.10852
To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.