Skip to content

rsyue/knowledge_distillation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Decoder Only Translation Knowledge Distillation with LLaMa 3.2

The scripts in this repository demonstrate how to build a pipeline to distill the translation knowledge of LLaMa3.2 3B into a blank LLaMa3.2 1B config using LoRa and KL Divergence as part of a subclassed Trainer with additional TrainingArguments of alpha (distillation strength) and temperature (smoothing coefficient). The project builds on previous work by Lewis Tunstall explained in the book Natural Language Processing with Transformers.

Since the distilled model starts from a blank config, its behavior is sometimes unpredictable. Namely, once translation is complete the model tends to continue generating indefinitely. To remedy this, we add a special token, <|end_of_translation|>, to the LLaMa3.2 tokenizer. This establishes a boundary so that only the initial translated text is taken into account -- the string can merely be split on this character.

Using the provided scripts, it is possible to achieve 82 COMET, which is in the high quality range. BLEU is slightly lower due to stylistic differences.

Any form of quantization will result in severe performance degradation and is not advised.

Be sure to install the following dependencies:

  • transformers
  • peft
  • datasets
  • evaluate
  • tqdm
  • torch

About

Distilling knowledge into a smaller Llama model and smaller BERT model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published