Skip to content

Conversation

@jgehrcke
Copy link
Collaborator

For #305. This is not yet tested. Will report back.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jgehrcke
Copy link
Collaborator Author

OK, I tested this on Luna. Confirming that the expected tolerations are set:

$ kubectl get pod -n nvidia-dra-driver-gpu   imex-channel-injection-c9s9r-5pzxk -o yaml | grep -C15 tolerations
<snip>
  nodeSelector:
    resource.nvidia.com/computeDomain: 25c467f6-b235-477d-a306-09c8e88311a7
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  resourceClaims:
  - name: compute-domain-daemon
    resourceClaimTemplateName: imex-channel-injection-daemon-claim-template-w6qnx
<snip>
  tolerations:
  - effect: NoSchedule
    operator: Exists
  - effect: NoExecute
    operator: Exists
  - effect: PreferNoSchedule
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/disk-pressure
    operator: Exists

The top three tolerations are the new catch-alls. The other three are set by k8s by default.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds wildcard tolerations for ComputeDomain pods to address issue #305.

  • Adds three wildcard tolerations with effects "NoSchedule", "NoExecute", and "PreferNoSchedule".
  • Updates the ComputeDomain daemon template to include these tolerations.

@jgehrcke jgehrcke merged commit 50703b0 into NVIDIA:main Mar 26, 2025
7 checks passed
@klueska klueska added this to the v25.3.0 milestone Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

4 participants