Use Nsight (nsys profile) to profile the training in a single multi-GPU node

Hello. I am currently training GPT on a single node with 4 GPUs. Here is my command:
nsys profile -t cuda,nvtx,cudnn,cublas -c nvtx -o nsys_GPT2_trace_complete --force-overwrite true torchrun --nnodes=1 --nproc-per-node=4 train.py config/train_gpt2.py

However, after the training finishes, no .nsys-rep file is generated (even though the output indicates “generated:”). Does anyone know why this might be happening?

I have added the nvtx ranges in the train.py. My original goal is to profile the computation and communication operations in the training process, either on a single GPU or across all GPUs.

Thanks in advance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Nsight (nsys profile) to profile the training in a single multi-GPU node #642

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Use Nsight (nsys profile) to profile the training in a single multi-GPU node #642

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions