Skip to content

Conversation

@tonykew
Copy link
Contributor

@tonykew tonykew commented Aug 26, 2025

Use numpy 1.x
Add libaio
x86_64 and ARM64 build working
Ran Inference tests on x86_64 and ARM64
Ran Training on 8 GPU DGX (x86_64) node:
Completed epoch 0 training
Restarted training from epoch 0 checkpoint AOK
Slurm examples for x86_64 and ARM64

NOTE: The following files have "raw" github URLs that will have to be fixed for production:

BUILD-ARM64.md
README.md

Tony

tonykew added 22 commits August 26, 2025 16:49
Use numpy 1.x
Add libaio
x86_64 and ARM64 build working
Ran Inference tests on x86_64 and ARM64
Ran Training on 8 GPU DGX (x86_64) node:
  Completed ephoch 0 training
  Restarted training from epoch 0 checkpoint AOK
Slurm examples for x86_64 and ARM64

NOTE: The following files have "raw" github URLs that will have to be fixed
for production:

BUILD-ARM64.md
README.md

Tony
Prevent attempting to run X86 binaries on ARM64

Tony
Avoids a "df" error when running OpenFold

Tony
A GPU has to be requested with the "salloc" or the nvidia pieces, "nvcc" CUDA
etc. don't get downloaded and installed.
Even though "--exclusive" is used, "nvidia-smi -L" sees no GPU

Tony
Build takes about 4 hours

Tony
Note: ARM64 build still broken

Tony
Occasionally there wil be the following error on container startup:

  /usr/bin/rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system
  rm: cannot remove '/usr/local/cuda/compat/lib': Read-only file system

The nvidia continers add "--writable-tmpfs" to their containers but I
can't find a way to do this in the .def file, so added to the container
startup.

Tony
Unfortunately here are slews of deprecation and future warnings that
cannot be easily supresssed - needs code changes

Tony
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant