Fix nvidia mode setup logic #438
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.


Related to #425
Recently I'm testing udocker on Google Colab and found an issue caused by the copy logic below:
udocker/udocker/engine/nvidia.py
Lines 66 to 80 in 638bc42
When copying nvidia-related libraries and executables, if the source file
srcnameis a symbolic link, the program will retrieve the targetlinktoofsrcnameand then directly create a symbolic link fromdstnametolinkto.On the Google Colab platform, Nvidia's library files are stored under
/usr/lib64-nvidia. Using thelscommand, we can observe the following situation:Assuming
srcname='/usr/lib64-nvidia/libOpenCL.so.1.0'and the correspondingdstname='xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0', then:linkto='libOpenCL.so.1.0.0'os.symlinkwill create a symbolic link like this:xxx/.udocker/containers/deb/ROOT//usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0 -> libOpenCL.so.1.0.0libOpenCL.so.1.0.0is a filename, which resolves toxxx/.udocker/containers/deb/ROOT/usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0when accessed.libOpenCL.so.1.0.0is a regular file that will be copied byshutil.copy2, so within the container, accessinglibOpenCL.so.1.0will correctly resolve tolibOpenCL.so.1.0.0.However, in Google Colab, the nvidia-related executables in the
/usr/bindirectory are also symbolic links, for example:srcname='/usr/bin/nvidia-smi'dstname='xxx/.udocker/containers/deb/ROOT//usr/bin/nvidia-smi'linkto='/opt/bin/.nvidia/nvidia-smi'As a result, a symbolic link
/usr/bin/nvidia-smi -> /opt/bin/.nvidia/nvidia-smiwill be created in the container.Since the program only creates the symbolic link and does not copy the actual file,
/opt/bin/.nvidia/nvidia-smidoes not exist in the container's file system (xxx/.udocker/containers/deb/ROOT/opt/bin/.nvidia/nvidia-sminot exists). Therefore, in the shell, we cannot usenvidia-smito view GPU information.Other files may also be affected by this logic and may be missing in the container. That's why I tried to fixed this issue. After the fix, the container can be successfully started following the steps in #425, and
nvidia-smican be executed. Additionally, the test about PyTorch was successful:/content/test.py:Note 1:
urunis an alias forsu somebottle -l -cNote 2: Since the libraries of NVIDIA can be correctly copied, PyTorch can access GPU before this fix.
Looking forward to your reply! Thank you for your great work.