Hello. I am currently training GPT on a single node with 4 GPUs. Here is my command:
nsys profile -t cuda,nvtx,cudnn,cublas -c nvtx -o nsys_GPT2_trace_complete --force-overwrite true torchrun --nnodes=1 --nproc-per-node=4 train.py config/train_gpt2.py
However, after the training finishes, no .nsys-rep file is generated (even though the output indicates “generated:”). Does anyone know why this might be happening?
I have added the nvtx ranges in the train.py. My original goal is to profile the computation and communication operations in the training process, either on a single GPU or across all GPUs.
Thanks in advance