Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Oct 2, 2025

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

Decision Tree:

shouldSkipUninstall()
│
├─> [1] config.forceReinstall == true?
│   ├─> YES → ❌ PROCEED WITH UNINSTALL
│   │           return (false, "")
│   │           Log: "Force reinstall is enabled, proceeding with driver uninstall"
│   │
│   └─> NO → Continue to [2]
│
├─> [2] isDriverLoaded()?
│   │   Check: /sys/module/nvidia/refcnt exists
│   │
│   ├─> NO → ❌ PROCEED WITH INSTALLATION
│   │          return (false, "")
│   │          Log: "Driver not currently loaded, proceeding with installation"
│   │
│   └─> YES → Continue to [3]
│
└─> [3] hasDriverConfigChanged()?
    │   Read: /run/nvidia/driver-config.state
    │   Build current config from env vars + config files
    │   Compare: currentConfig == storedConfig?
    │
    ├─> YES (config changed) → ❌ PROCEED WITH UNINSTALL
    │                           return (false, reason)
    │                           Log: "Driver configuration has changed: <reason>"
    │                           
    │   Reasons:
    │   • "no previous driver configuration found" (file missing)
    │   • "unable to read previous driver configuration" (read error)
    │   • "driver configuration changed" (content differs)
    │
    └─> NO (config matches) → ✅ SKIP UNINSTALL
                               return (true, "desired version and configuration already present")
                               Log: "Installed driver version and configuration match desired state, skipping uninstall"


LEGEND:
═══════
❌ PROCEED = return (false, reason) → uninstallDriver() continues
✅ SKIP = return (true, reason) → uninstallDriver() returns nil early


SCENARIOS:
═════════

Scenario 1: Clean Restart (no modules loaded)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=false → ❌ PROCEED

Scenario 2: Non-Clean Restart (modules loaded, config unchanged)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=true → Continue
  [3] hasDriverConfigChanged()=false → ✅ SKIP

Scenario 3: Config Changed (version, params, kernel, etc.)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=true → Continue
  [3] hasDriverConfigChanged()=true → ❌ PROCEED

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 68adf6a to 1991b8c Compare October 16, 2025 15:11
@karthikvetrivel karthikvetrivel marked this pull request as ready for review October 16, 2025 15:19
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 900f54b to 1991b8c Compare October 20, 2025 20:53
@karthikvetrivel
Copy link
Member Author

@cdesiniotis I've moved shouldSkipUninstall so that the operands still release /run/nvidia/driver mounts.

… trigger driver reinstall

Signed-off-by: Karthik Vetrivel <[email protected]>
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 4aa4395 to 5fba425 Compare November 6, 2025 14:09
@karthikvetrivel karthikvetrivel marked this pull request as draft November 6, 2025 14:16
driverRoot = "/run/nvidia/driver"
driverPIDFile = "/run/nvidia/nvidia-driver.pid"
driverConfigStateFile = "/run/nvidia/driver-config.state"
operatorNamespace = "gpu-operator"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use this from the OPERATOR_NAMESPACE env instead of hard coding.

Copy link
Member Author

@karthikvetrivel karthikvetrivel Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definition was in the original version, not sure why it is in the diff. If OPERATOR_NAMESPACE is set, that value is used. Otherwise, it uses the Value field as a default, which is what we use this variable for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants