headless-lm: Better and Faster LM pretraining

This repository is a fork of NathanGodey/headless-lm containing training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying". See also the README.

Setting up on Slurm

Run setup-venv.sh to install a virtual environment on Snellius. Note that the requirements.txt have been adapted.

sbatch scripts/setup-venv.sh

The following instructions are from the original README

Preprocess data

Adapt the config file in configs/preprocess_owt2.json to your specific case, and then run:

python preprocess.py --config=configs/your_config_file.json

Training

Encoder

To train an encoder model:

Write/edit model-related parameters in a config file similar to configs/mlm_headless.json
Run the following command with your specific arguments:

python mlm_headless.py \
    --config configs/your_config_file.json \
    --num_nodes your-gpu-node-count \
    --global_bs your-accumulated-batch_size \
    --gpu_bs your-per-device-batch-size \
    --dataset your-preprocessed-output.hf \
    --hf_tokenizer your-tokenizer \
    --hf_path path-to-your-model-arch-on-HF \
    --model_max_seq_len models-max-pos-embeddings \
    --run_name run-name-for-logging-and-ckpts \
    --saved_ckpt_path where-to-save-ckpts

Other args include --accelerator (hf, xformers or flash_attention), --ckpt_every to pick checkpoint frequency, among others.

Pick your checkpoint and publish it to HuggingFace:

python hf_publisher.py \
    --hf_name your_hf_id/your_model \
    --model_ckpt your_model.ckpt \
    --mode mlm

Decoder

To train a decoder model:

Write/edit model-related parameters in a config file similar to configs/gpt_headless_70m.json
Run the following command with your specific arguments:

python gpt_headless.py \
    --config configs/your_config_file.json \
    --num_nodes your-gpu-node-count \
    --global_bs your-accumulated-batch_size \
    --gpu_bs your-per-device-batch-size \
    --dataset your-preprocessed-output.hf \
    --hf_tokenizer your-tokenizer \
    --hf_path path-to-your-model-arch-on-HF \
    --model_max_seq_len models-max-pos-embeddings \
    --run_name run-name-for-logging-and-ckpts \
    --saved_ckpt_path where-to-save-ckpts

Other args include --accelerator (hf, xformers or flash_attention), --ckpt_every to pick checkpoint frequency, among others.

(optional) Pick your checkpoint and publish it to HuggingFace. You'll need to use the add_head option to make it able to output tokens:

python hf_publisher.py \
    --hf_name your_hf_id/your_model \
    --model_ckpt your_model.ckpt \
    --mode add_head

The resulting model will probably perform poorly for language generation. Why? Because it was not trained to do it! To turn your contrastive model into a good LM, you'll need add a head and fine-tune it. Setup a config file in the style of config/gpt_vanilla_ft.json and run:

python ft_gpt_headless.py \
    --ckpt_path your_headless_model.ckpt' \
    --config configs/your_ft_config.json \
    --num_nodes your-gpu-nodes \
    --global_bs your-accumulated-bs \
    --gpu_bs your-device-bs \
    --dataset your-preprocessed-output.hf \
    --run_name run-name-for-logging-and-ckpts \
    --saved_ckpt_path where-to-save-finetuned-ckpts

Pick your fine-tuned checkpoint and publish it to HuggingFace. You don't need to use the add_head option anymore as you just trained one:

python hf_publisher.py \
    --hf_name your_hf_id/your_model \
    --model_ckpt your_model.ckpt \
    --mode lm

Evaluation

You can now use any zero-shot or fine-tuning code to evaluate your models. We provide our GLUE fine-tuning script in glue_finetuning.py, and we used the LM Eval Harness for zero-shot evaluation.

Citation

This repo contains the code that was used for the experiments of the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying".

@misc{godey2023headless,
      title={Headless Language Models: Learning without Predicting with Contrastive Weight Tying}, 
      author={Nathan Godey and Éric de la Clergerie and Benoît Sagot},
      year={2023},
      eprint={2309.08351},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.dvc		.dvc
configs		configs
docs		docs
engine		engine
imgs		imgs
scripts		scripts
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
create_tokenizer.py		create_tokenizer.py
ft_gpt_headless.py		ft_gpt_headless.py
glue_finetuning.py		glue_finetuning.py
gpt_headless.py		gpt_headless.py
hf_publisher.py		hf_publisher.py
mlm_headless.py		mlm_headless.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
tokenizers.dvc		tokenizers.dvc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

headless-lm: Better and Faster LM pretraining

Setting up on Slurm

Preprocess data

Training

Encoder

Decoder

Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

cltl/headless-lm

Folders and files

Latest commit

History

Repository files navigation

headless-lm: Better and Faster LM pretraining

Setting up on Slurm

Preprocess data

Training

Encoder

Decoder

Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages