Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Oct 2, 2025

Resolves #581.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke jgehrcke requested a review from klueska October 2, 2025 16:36
@jgehrcke jgehrcke self-assigned this Oct 2, 2025
@jgehrcke jgehrcke added the robustness issue/pr: edge cases & fault tolerance label Oct 2, 2025
@jgehrcke jgehrcke added this to the v25.8.0 milestone Oct 2, 2025
@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Oct 2, 2025
Comment on lines 336 to 355

// TODO: review skipping this if the new set of IP addresses only
// strictly removes addresses compared to the old set (then we don't
// need to force the daemon to re-resolve & re-connect).
if updated && !fresh {
// Actively ask the IMEX daemon to re-read its config and to
// re-connect to its peers (involving DNS name re-resolution).
klog.Infof("updated DNS/IP mapping, old process: send SIGUSR1")
if err := processManager.Signal(syscall.SIGUSR1); err != nil {
// Only log (ignore this error for now: if the process went
// away unexpectedly, the process manager will handle that.
// If any other error resulted in bad signal delivery, we
// may get away with it).
klog.Errorf("failed to send SIGUSR1 to child process: %s", err)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: review skipping this if the new set of IP addresses only
// strictly removes addresses compared to the old set (then we don't
// need to force the daemon to re-resolve & re-connect).
if updated && !fresh {
// Actively ask the IMEX daemon to re-read its config and to
// re-connect to its peers (involving DNS name re-resolution).
klog.Infof("updated DNS/IP mapping, old process: send SIGUSR1")
if err := processManager.Signal(syscall.SIGUSR1); err != nil {
// Only log (ignore this error for now: if the process went
// away unexpectedly, the process manager will handle that.
// If any other error resulted in bad signal delivery, we
// may get away with it).
klog.Errorf("failed to send SIGUSR1 to child process: %s", err)
}
}
// TODO: review skipping this if the new set of IP addresses only
// strictly removes addresses compared to the old set (then we don't
// need to force the daemon to re-resolve & re-connect).
if !updated || fresh {
return nil
}
// Actively ask the IMEX daemon to re-read its config and to
// re-connect to its peers (involving DNS name re-resolution).
klog.Infof("updated DNS/IP mapping, old process: send SIGUSR1")
if err := processManager.Signal(syscall.SIGUSR1); err != nil {
// Only log (ignore this error for now: if the process went
// away unexpectedly, the process manager will handle that.
// If any other error resulted in bad signal delivery, we
// may get away with it).
klog.Errorf("failed to send SIGUSR1 to child process: %s", err)
}

@jgehrcke jgehrcke force-pushed the jp/imex-daemon-sigusr1 branch 2 times, most recently from 3cfc79a to f4011ef Compare October 6, 2025 13:02
@jgehrcke jgehrcke force-pushed the jp/imex-daemon-sigusr1 branch from b6dd25e to c6b8c80 Compare October 6, 2025 14:43
@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 6, 2025

Updated one more time to not do return nil but continue instead in the early exit.

I will also test this patch again and report back -- it's too easy to make mistakes during seemingly innocent changes, of course.

@jgehrcke
Copy link
Collaborator Author

jgehrcke commented Oct 6, 2025

Tested that last state, looking OK:

  • ✔️ three nvbandwidth-based fault injection tests
  • ✔️ make bats (see below)
tests.bats
 ✓ test VERSION_W_COMMIT, VERSION_GHCR_CHART, VERSION [65]
 ✓ confirm no kubelet plugin pods running [68]
 ✓ helm-install deployments/helm/nvidia-dra-driver-gpu/25.8.0-dev [5205]
 ✓ helm list: validate output [96]
 ✓ get crd computedomains.resource.nvidia.com [62]
 ✓ wait for kubelet plugin pods READY [264]
 ✓ wait for controller pod READY [162]
 ✓ validate CD controller container image spec [63]
 ✓ IMEX channel injection (single) [7104]
 ✓ IMEX channel injection (all) [9987]
 ✓ NodePrepareResources: catch unknown field in opaque cfg in ResourceClaim [3132]
 ✓ nickelpie (NCCL send/recv/broadcast, 2 pods, 2 nodes, small payload) [11552]
 ✓ nvbandwidth (2 nodes, 2 GPUs each) [19673]
 ✓ downgrade: current-dev -> last-stable [27903]
 ✓ upgrade: wipe-state, install-last-stable, upgrade-to-current-dev [29114]

15 tests, 0 failures in 115 seconds

Kevin reviewed and verbally approved -- merging now, to facilitate branch wrestling.

@jgehrcke jgehrcke merged commit 55c647a into NVIDIA:main Oct 6, 2025
14 checks passed
@klueska klueska moved this from In Progress to Closed in Planning Board: k8s-dra-driver-gpu Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

robustness issue/pr: edge cases & fault tolerance

Projects

Development

Successfully merging this pull request may close these issues.

Actively trigger local IMEX daemon to re-resolve and re-connect to its peers (via SIGUSR1)

2 participants