Decoder Only Translation Knowledge Distillation with LLaMa 3.2

The scripts in this repository demonstrate how to build a pipeline to distill the translation knowledge of LLaMa3.2 3B into a blank LLaMa3.2 1B config using LoRa and KL Divergence as part of a subclassed Trainer with additional TrainingArguments of alpha (distillation strength) and temperature (smoothing coefficient). The project builds on previous work by Lewis Tunstall explained in the book Natural Language Processing with Transformers.

Since the distilled model starts from a blank config, its behavior is sometimes unpredictable. Namely, once translation is complete the model tends to continue generating indefinitely. To remedy this, we add a special token, <|end_of_translation|>, to the LLaMa3.2 tokenizer. This establishes a boundary so that only the initial translated text is taken into account -- the string can merely be split on this character.

Using the provided scripts, it is possible to achieve 82 COMET, which is in the high quality range. BLEU is slightly lower due to stylistic differences.

Any form of quantization will result in severe performance degradation and is not advised.

Be sure to install the following dependencies:

transformers
peft
datasets
evaluate
tqdm
torch

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
evaluate_llama.py		evaluate_llama.py
left_split_tokenize.ipynb		left_split_tokenize.ipynb
train_llama.py		train_llama.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Decoder Only Translation Knowledge Distillation with LLaMa 3.2

About

Uh oh!

Releases

Packages

Languages

rsyue/knowledge_distillation

Folders and files

Latest commit

History

Repository files navigation

Decoder Only Translation Knowledge Distillation with LLaMa 3.2

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages