Skip to content

Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https://arxiv.org/abs/2309.08351)

Notifications You must be signed in to change notification settings

cltl/headless-lm

 
 

Repository files navigation

headless-lm: Better and Faster LM pretraining

This repository is a fork of NathanGodey/headless-lm containing training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying". See also the README.

Setting up on Slurm

Run setup-venv.sh to install a virtual environment on Snellius. Note that the requirements.txt have been adapted.

sbatch scripts/setup-venv.sh

The following instructions are from the original README

Preprocess data

Adapt the config file in configs/preprocess_owt2.json to your specific case, and then run:

python preprocess.py --config=configs/your_config_file.json

Training

Encoder

To train an encoder model:

  1. Write/edit model-related parameters in a config file similar to configs/mlm_headless.json
  2. Run the following command with your specific arguments:
python mlm_headless.py \
    --config configs/your_config_file.json \
    --num_nodes your-gpu-node-count \
    --global_bs your-accumulated-batch_size \
    --gpu_bs your-per-device-batch-size \
    --dataset your-preprocessed-output.hf \
    --hf_tokenizer your-tokenizer \
    --hf_path path-to-your-model-arch-on-HF \
    --model_max_seq_len models-max-pos-embeddings \
    --run_name run-name-for-logging-and-ckpts \
    --saved_ckpt_path where-to-save-ckpts

Other args include --accelerator (hf, xformers or flash_attention), --ckpt_every to pick checkpoint frequency, among others.

  1. Pick your checkpoint and publish it to HuggingFace:
python hf_publisher.py \
    --hf_name your_hf_id/your_model \
    --model_ckpt your_model.ckpt \
    --mode mlm

Decoder

To train a decoder model:

  1. Write/edit model-related parameters in a config file similar to configs/gpt_headless_70m.json
  2. Run the following command with your specific arguments:
python gpt_headless.py \
    --config configs/your_config_file.json \
    --num_nodes your-gpu-node-count \
    --global_bs your-accumulated-batch_size \
    --gpu_bs your-per-device-batch-size \
    --dataset your-preprocessed-output.hf \
    --hf_tokenizer your-tokenizer \
    --hf_path path-to-your-model-arch-on-HF \
    --model_max_seq_len models-max-pos-embeddings \
    --run_name run-name-for-logging-and-ckpts \
    --saved_ckpt_path where-to-save-ckpts

Other args include --accelerator (hf, xformers or flash_attention), --ckpt_every to pick checkpoint frequency, among others.

  1. (optional) Pick your checkpoint and publish it to HuggingFace. You'll need to use the add_head option to make it able to output tokens:
python hf_publisher.py \
    --hf_name your_hf_id/your_model \
    --model_ckpt your_model.ckpt \
    --mode add_head
  1. The resulting model will probably perform poorly for language generation. Why? Because it was not trained to do it! To turn your contrastive model into a good LM, you'll need add a head and fine-tune it. Setup a config file in the style of config/gpt_vanilla_ft.json and run:
python ft_gpt_headless.py \
    --ckpt_path your_headless_model.ckpt' \
    --config configs/your_ft_config.json \
    --num_nodes your-gpu-nodes \
    --global_bs your-accumulated-bs \
    --gpu_bs your-device-bs \
    --dataset your-preprocessed-output.hf \
    --run_name run-name-for-logging-and-ckpts \
    --saved_ckpt_path where-to-save-finetuned-ckpts
  1. Pick your fine-tuned checkpoint and publish it to HuggingFace. You don't need to use the add_head option anymore as you just trained one:
python hf_publisher.py \
    --hf_name your_hf_id/your_model \
    --model_ckpt your_model.ckpt \
    --mode lm

Evaluation

You can now use any zero-shot or fine-tuning code to evaluate your models. We provide our GLUE fine-tuning script in glue_finetuning.py, and we used the LM Eval Harness for zero-shot evaluation.

Citation

This repo contains the code that was used for the experiments of the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying".

@misc{godey2023headless,
      title={Headless Language Models: Learning without Predicting with Contrastive Weight Tying}, 
      author={Nathan Godey and Éric de la Clergerie and Benoît Sagot},
      year={2023},
      eprint={2309.08351},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying" (https://arxiv.org/abs/2309.08351)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.7%
  • Shell 0.3%