[QUESTION] Checkpoint storage format #1092
Unanswered
syx11237744
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Your question
Ask a clear and concise question about Megatron-LM.
Could you let me know which version I should revert to if I want to use the previous checkpoint storage format, which is stored as .pt? Or are there any other methods to save it as a .pt file?Thank you!
DATASET_PATH=/share/root/out_file/sum.jsonl
SAVE_PATH=/share/sunyuanxu/out_file/sum
VOCAB_FILE=gpt2_/vocab.json
MERGE_FILE=gpt2_/merges.txt
export CUDA_DEVICE_MAX_CONNECTIONS=1
GPUS_PER_NODE=1
Change for multinode config
MASTER_ADDR=localhost
MASTER_PORT=6000
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
CHECKPOINT_PATH=/share/root/checkpoint/cp
DATA_PATH=/share/root/out_file/sum_text_document
DISTRIBUTED_ARGS="
--nproc_per_node $GPUS_PER_NODE
--nnodes $NNODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
"
GPT_ARGS="
--num-layers 24
--hidden-size 1024
--num-attention-heads 16
--seq-length 1024
--max-position-embeddings 1024
--micro-batch-size 32
--global-batch-size 256
--lr 0.00015
--train-iters 1000
--lr-decay-iters 320000
--lr-decay-style cosine
--min-lr 1.0e-5
--weight-decay 1e-2
--lr-warmup-fraction .01
--clip-grad 1.0
--fp16
--attention-softmax-in-fp32
"
DATA_ARGS="
--data-path $DATA_PATH
--vocab-file $VOCAB_FILE
--merge-file $MERGE_FILE
--split 949,50,1
"
OUTPUT_ARGS="
--log-interval 100
--save-interval 10000
--eval-interval 1000
--eval-iters 10
"
torchrun $DISTRIBUTED_ARGS Megatron-LM/pretrain_gpt.py
$GPT_ARGS
$DATA_ARGS
$OUTPUT_ARGS
--distributed-backend nccl
--save $CHECKPOINT_PATH
--load $CHECKPOINT_PATH
Beta Was this translation helpful? Give feedback.
All reactions