Skip to content

Conversation

@jfroy
Copy link
Collaborator

@jfroy jfroy commented Nov 17, 2025

This is a followup patch that builds on #1444. It covers more cases by having a slightly wider set of system paths.

@elezar @cdesiniotis

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 17, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

//
// docker run --rm -ti redhat/ubi9 /usr/lib/ld-linux-aarch64.so.1 --help | grep -A6 "Shared library search path"
// TODO: Add other architectures that have custom `add_system_dir` macros (e.g. riscv)
// TODO: Replace with executing the container's dynamlic linker with `--list-diagnostics`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note on this. Since containers are user-supplied, we need to be careful about executing something from the container. This (and the fact that not all containers include ldconfig) is the reason that we don't run ldconfig form the container.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is why I wrote it as a question. Two follow up:

  • The suggestion/question is about the dynamic linker, not ldconfig. I imagine almost all container images that expect to run our software are going to have it.
  • I probably lack the historical or technical background why this hook is not a startContainer hook that could run the container image's ldconfig (if any) or ld.so. Naively, the runtime is about to execute the main container process, which is just as untrusted. Anyways I don't mean to start a big technical discussion with this comment; I am just curious about the way things are and the security posture.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is a createContainer hook and not a startContainer hook is that the hooks we run (i.e. nvidia-cdi-hook) are not available in the container. This is not to say that it is impossible to ensure that these are injected and available to the container, but it is not something that we have worked on.

Note that logic like "if any" is not really expressible in the current OCI Runtime hook spec which is why we rely on more complex logic being backed into an executable. What we could consider doing is:

  1. Run createContainer hooks to:
    2. create /etc/ld.so.conf.d/ drop in files for injected libraries and CUDA compat libraries.
    3. create a hook at a well known path in the container that covers optionally running ldconfig / ldconfig.real in the container.
  2. Run a startContainer hook referencing the created hook.

Note that since we would then be running ldconfig in the container as a startContainer hook, we would be able to leverage the isolation that is already provided by low-level runtimes such as runc and it would also simplify the logic around running ldconfig since we would not have to handle differences between the host and the container distributions.

One caveat here is that we would NOT be able to handle containers that do not have ldconfig in the container -- although in this case we may be able to fall back to a host executable mounted into the container.

Copy link
Collaborator Author

@jfroy jfroy Nov 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's appealing to consider a startContainer approach, but because of containers without an ldconfig binary and the need to emit one or more drop-ins, it wouldn't save any code or reduce complexity.

@elezar
Copy link
Member

elezar commented Nov 18, 2025

/ok-to-test ce85f8d

@elezar elezar force-pushed the ldcache-system-dirs branch from ce85f8d to d7f94e6 Compare November 18, 2025 08:45
@jfroy jfroy force-pushed the ldcache-system-dirs branch 2 times, most recently from ba16157 to 5de1fb0 Compare November 19, 2025 05:24
@jfroy jfroy changed the title Append container system paths to ld.so.conf Write system paths to lexicographically last ld.so.conf.d drop-in Nov 19, 2025
@jfroy
Copy link
Collaborator Author

jfroy commented Nov 19, 2025

Updated patch to write a drop-in instead of appending to the top-level conf file.

@jfroy jfroy force-pushed the ldcache-system-dirs branch from 5de1fb0 to 8182a27 Compare November 19, 2025 05:27
This change ensures that the ldcache in a non-debian container
includes libraries at /lib64 and /usr/lib64 when running on
debian host. This is required because the system search paths
do not include these folders by default resulting in a non-debian
container missing system libraries from the ldcache.

Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Jean-Francois Roy <[email protected]>
@jfroy jfroy force-pushed the ldcache-system-dirs branch 2 times, most recently from 33ce626 to d238345 Compare November 19, 2025 14:54
In most cases, the hook will be executing a host ldconfig that may be
configured widely differently from what the container image expects. The
common case is Debian vs non-Debian. But there are also hosts that
configure ldconfig to search in a glibc prefix (e.g. /usr/lib/glibc). To
avoid all these cases, write the container's expected system search
paths to a drop-in conf file that is likely to be last in lexicographic
order. Entries in the top-level ld.so.conf file may be processed after
this drop-in, but this hook does not modify the top-level file if it
exists.

Signed-off-by: Jean-Francois Roy <[email protected]>
@jfroy jfroy force-pushed the ldcache-system-dirs branch from d238345 to b709f1d Compare November 19, 2025 18:38
@jfroy
Copy link
Collaborator Author

jfroy commented Nov 19, 2025

Superseded by #1444

@jfroy jfroy closed this Nov 19, 2025
jfroy added a commit to jfroy/nvidia-container-toolkit that referenced this pull request Nov 19, 2025
With the fixed paths, the hook can emit the system paths drop-in
unconditionally without breaking the e2e tests.

Signed-off-by: Jean-Francois Roy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants