This repository is a fork of NathanGodey/headless-lm containing training and evaluation code for the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying". See also the README.
Run setup-venv.sh to install a virtual environment on Snellius. Note that the requirements.txt have been adapted.
sbatch scripts/setup-venv.shThe following instructions are from the original README
Adapt the config file in configs/preprocess_owt2.json to your specific case, and then run:
python preprocess.py --config=configs/your_config_file.json
To train an encoder model:
- Write/edit model-related parameters in a config file similar to
configs/mlm_headless.json - Run the following command with your specific arguments:
python mlm_headless.py \
--config configs/your_config_file.json \
--num_nodes your-gpu-node-count \
--global_bs your-accumulated-batch_size \
--gpu_bs your-per-device-batch-size \
--dataset your-preprocessed-output.hf \
--hf_tokenizer your-tokenizer \
--hf_path path-to-your-model-arch-on-HF \
--model_max_seq_len models-max-pos-embeddings \
--run_name run-name-for-logging-and-ckpts \
--saved_ckpt_path where-to-save-ckptsOther args include --accelerator (hf, xformers or flash_attention), --ckpt_every to pick checkpoint frequency, among others.
- Pick your checkpoint and publish it to HuggingFace:
python hf_publisher.py \
--hf_name your_hf_id/your_model \
--model_ckpt your_model.ckpt \
--mode mlmTo train a decoder model:
- Write/edit model-related parameters in a config file similar to
configs/gpt_headless_70m.json - Run the following command with your specific arguments:
python gpt_headless.py \
--config configs/your_config_file.json \
--num_nodes your-gpu-node-count \
--global_bs your-accumulated-batch_size \
--gpu_bs your-per-device-batch-size \
--dataset your-preprocessed-output.hf \
--hf_tokenizer your-tokenizer \
--hf_path path-to-your-model-arch-on-HF \
--model_max_seq_len models-max-pos-embeddings \
--run_name run-name-for-logging-and-ckpts \
--saved_ckpt_path where-to-save-ckptsOther args include --accelerator (hf, xformers or flash_attention), --ckpt_every to pick checkpoint frequency, among others.
- (optional) Pick your checkpoint and publish it to HuggingFace. You'll need to use the
add_headoption to make it able to output tokens:
python hf_publisher.py \
--hf_name your_hf_id/your_model \
--model_ckpt your_model.ckpt \
--mode add_head- The resulting model will probably perform poorly for language generation. Why? Because it was not trained to do it! To turn your contrastive model into a good LM, you'll need add a head and fine-tune it. Setup a config file in the style of
config/gpt_vanilla_ft.jsonand run:
python ft_gpt_headless.py \
--ckpt_path your_headless_model.ckpt' \
--config configs/your_ft_config.json \
--num_nodes your-gpu-nodes \
--global_bs your-accumulated-bs \
--gpu_bs your-device-bs \
--dataset your-preprocessed-output.hf \
--run_name run-name-for-logging-and-ckpts \
--saved_ckpt_path where-to-save-finetuned-ckpts
- Pick your fine-tuned checkpoint and publish it to HuggingFace. You don't need to use the
add_headoption anymore as you just trained one:
python hf_publisher.py \
--hf_name your_hf_id/your_model \
--model_ckpt your_model.ckpt \
--mode lmYou can now use any zero-shot or fine-tuning code to evaluate your models. We provide our GLUE fine-tuning script in glue_finetuning.py, and we used the LM Eval Harness for zero-shot evaluation.
This repo contains the code that was used for the experiments of the paper "Headless Language Models: Learning without Predicting with Contrastive Weight Tying".
@misc{godey2023headless,
title={Headless Language Models: Learning without Predicting with Contrastive Weight Tying},
author={Nathan Godey and Éric de la Clergerie and Benoît Sagot},
year={2023},
eprint={2309.08351},
archivePrefix={arXiv},
primaryClass={cs.CL}
}