Replies: 3 comments
-
|
Could you try running a simple PyTorch Distributed test program to help determine if it's something wrong with your infrastructure and job startup, or something with Megatron itself. `import os import torch print('MASTER_PORT = ', os.getenv('MASTER_PORT')) print('About to initialize PyTorch Distributed...', flush=True) print('Entering barrier...', flush=True) |
Beta Was this translation helpful? Give feedback.
-
|
Sei la |
Beta Was this translation helpful? Give feedback.
-
|
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
My question
I am trying to run Megatron multi-node on Docker.
My docker was established by the following command:
The pretrain.sh also had been setted like this:
However, when I run the shell, the error occured:
Beta Was this translation helpful? Give feedback.
All reactions