Implementation of 56M GPT-1 from scratch. Since BookCorpus dataset (~1B tokens) is no longer publicly available, I instead use WikiText-103 (103M tokens) dataset to pre-train GPT-1
Model dimension (d_model) = 512
Number of Attention Heads (n_heads) = 8
Number of Decoders (num_decoder_layers) = 8
Maximum sequence length (max_len) = 128
Feedforward layer hidden size (dim_feedforward) = 2048
Vocabulary size (vocab_size) = 30000 for WikiText-103 dataset
Batch size (batch_size) = 64
TOTAL PARAMETER COUNT ≈ 56M
-
Run
input_processing.pyto generate tokenized Wikitext data and save it in .pt torch tensor format -
Run
main_pretrain.pyto pre-train GPT-1. The user can change training settings from this file and model parameters fromGPT_Decoder.py -
Run last cell in
test.ipynbto generate random text from pre-trained GPT-1
Sample generation output: The earliest known mention of this date was that of 544 , when King Olaf II of Norway was discovered in the reign of King Olaf II of Norway . The earliest recorded mention of this date was from 544 , when King Olaf was assassinated . The date of the birth is unknown , but it is unclear whether Olaf was killed . Olaf 's birth date is unknown , but it is likely that Olaf was killed by the Vikings in 842 , but Olaf 's reign is uncertain . .
This can be improved by increasing model size, but is good enough for 56M parameter model.