Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Jul 1, 2025

This patch adds extra functionality to the apply-exit hook, calling the nvidia-ctk cdi generate to refresh the CDI config on a MIG (re)configuration event

@ArangoGutierrez ArangoGutierrez self-assigned this Jul 1, 2025

This comment was marked as outdated.

@ArangoGutierrez ArangoGutierrez requested a review from klueska July 1, 2025 11:49
fi
}

function refresh-cdi-config() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer required.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ArangoGutierrez ArangoGutierrez force-pushed the cdi_generate_hook branch 2 times, most recently from cf9c0bd to 83b4d05 Compare July 1, 2025 15:08
@ArangoGutierrez ArangoGutierrez requested a review from elezar July 1, 2025 15:09
@ArangoGutierrez ArangoGutierrez changed the title Add refresh-cdi hook Extend the functionality of apply-exit to refresh CDI config Jul 2, 2025
@ArangoGutierrez
Copy link
Collaborator Author

PR Edited
@elezar / @klueska PTAL

This comment was marked as outdated.

@elezar
Copy link
Member

elezar commented Jul 10, 2025

@ArangoGutierrez how are we specifying WHERE the output should be created? Although @klueska had concerns about triggering the nvidia-cdi-refresh.service doing so would mean that we would ensure consistency in terms of the ouptut path.

Also, on:

Given that NVIDIA/nvidia-container-toolkit#1107 is merged, if I remove the --output from this PR that should be enough to allow the user to specify custom configuration via the config file

I don't think that specifying an output path in the config file makes sense since this would complicate the default behvaiour which is to output the spec to STDOUT.

@ArangoGutierrez
Copy link
Collaborator Author

@ArangoGutierrez how are we specifying WHERE the output should be created? Although @klueska had concerns about triggering the nvidia-cdi-refresh.service doing so would mean that we would ensure consistency in terms of the ouptut path.

I would like to call the nvidia-cdi-refresh.service from the hook, maybe something to discuss with @klueska

@ArangoGutierrez
Copy link
Collaborator Author

@klueska / @elezar Updated based on today's discussion

This comment was marked as outdated.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR extends the apply-exit hook functionality to refresh CDI (Container Device Interface) configuration after MIG reconfiguration events. The implementation adds a new hook that triggers CDI configuration updates through a systemd service.

Key changes:

  • Adds a new cdi-refresh hook to the hooks configuration
  • Creates a new script that conditionally starts the nvidia-cdi-refresh systemd service
  • Integrates CDI refresh functionality into the MIG manager's hook system

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
deployments/systemd/hooks-default.yaml Adds new cdi-refresh hook configuration that executes the CDI refresh script
deployments/systemd/cdi-refresh.sh New shell script that conditionally starts nvidia-cdi-refresh.service if available

@ArangoGutierrez
Copy link
Collaborator Author

Refresher PING @klueska / @elezar

@ArangoGutierrez
Copy link
Collaborator Author

Now that we have Toolkit RC6 and are almost ready for proper release, and Systemd is stable, can we give this PR another try?
@klueska / @elezar

- workdir: "/etc/nvidia-mig-manager"
command: "/bin/bash"
args: ["-x", "-c", "source hooks.sh; apply-exit"]
cdi-refresh:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a valid hook name. I think this should probably just be an apply-exit hook.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is not included in the rpm or deb packages.

Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what we discussed. We should:

  1. Define cdi-refresh in hooks.sh
  2. Add a new hook to the apply-exit hooks hooks-default.yaml that calls the function in the same way as we do for the other hooks.

We ALSO discussed (not for this PR):

  1. Updating the names of the other functions to show intent and not the EVENTS
  2. Clarifying the API around envvars etc.

@ArangoGutierrez ArangoGutierrez force-pushed the cdi_generate_hook branch 2 times, most recently from 4cac112 to cec05e8 Compare November 10, 2025 10:57
@ArangoGutierrez
Copy link
Collaborator Author

This is not what we discussed. We should:

  1. Define cdi-refresh in hooks.sh
  2. Add a new hook to the apply-exit hooks hooks-default.yaml that calls the function in the same way as we do for the other hooks.

We ALSO discussed (not for this PR):

  1. Updating the names of the other functions to show intent and not the EVENTS
  2. Clarifying the API around envvars etc.

Done

elezar
elezar previously approved these changes Nov 10, 2025
@ArangoGutierrez
Copy link
Collaborator Author

I have updated this PR to be in line with changes proposed at #275

# Check if nvidia-cdi-refresh.service exists
if systemctl list-unit-files nvidia-cdi-refresh.service --quiet; then
echo "Found nvidia-cdi-refresh.service, calling systemctl..." >&2
if ! systemctl start nvidia-cdi-refresh.service; then
Copy link
Member

@elezar elezar Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Should this be restart instead of start?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, will edit

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
@ArangoGutierrez
Copy link
Collaborator Author

@elezar I did a change after your approval PTAL

@ArangoGutierrez ArangoGutierrez dismissed elezar’s stale review November 21, 2025 15:29

@elezar I did a change after your approval PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants