Investigate mapping token embeddings from source to target

A recently published [paper](https://arxiv.org/pdf/2408.04303) introduced a strategy called "trans-tokenization", which "focuses on adapting a high-resource monolingual LLM to an unseen target language by initializing the token embeddings of the target language using a weighted average of semantically similar token embeddings from the source language." We should investigate whether this approach could improve the performance of adding trained tokens to NLLB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate mapping token embeddings from source to target #481

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Investigate mapping token embeddings from source to target #481

Description

Sub-issues

Collapse Sub-issuesSub-issues

Activity

johnml1135 commented on Feb 26, 2025

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions

Sub-issues