Replies: 1 comment
-
|
Marking as stale. No activity in 60 days. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I am using single-node multi-GPU cluster (A100 * 6) and would like to use Megatron-LM to learn llama2 model on that.
On the computational environment, I am not able to use docker and the CUDA version is fixed at 11.6.
My question is if it is possible to install Megatron-LM on my environment.
After installing required packages (including apex), I did
and run a shell script basically just calling pretrain_gpt.py.
It fails with the following error message saying "no module named transformer_engine".
Then I found https://github.com/NVIDIA/TransformerEngine, but looks like we need to have CUDA>=11.8.
It would be very helpful if you give me advice to sort this out.
Beta Was this translation helpful? Give feedback.
All reactions