Skip to content

Commit a7539de

Browse files
committed
Bump terminationGracePeriodSeconds to 120 for vGPU Manager daemonset
This ensures the vGPU Manager has enough time to finish its cleanup logic when the pod is terminated. This fixes an issue observed on a HGX node (with 8 GPUs) where the vGPU Manager was killed before it finished disabling VFs on all 8 GPUs. Signed-off-by: Christopher Desiniotis <[email protected]>
1 parent b503185 commit a7539de

File tree

4 files changed

+6
-0
lines changed

4 files changed

+6
-0
lines changed

assets/state-vgpu-manager/0500_daemonset.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ spec:
1818
labels:
1919
app: nvidia-vgpu-manager-daemonset
2020
spec:
21+
terminationGracePeriodSeconds: 120
2122
nodeSelector:
2223
nvidia.com/gpu.deploy.vgpu-manager: "true"
2324
tolerations:

internal/state/testdata/golden/driver-vgpu-host-manager-openshift.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -330,6 +330,7 @@ spec:
330330
nvidia.com/gpu.deploy.vgpu-manager: "true"
331331
priorityClassName: system-node-critical
332332
serviceAccountName: nvidia-vgpu-manager-openshift
333+
terminationGracePeriodSeconds: 120
333334
tolerations:
334335
- effect: NoSchedule
335336
key: nvidia.com/gpu

internal/state/testdata/golden/driver-vgpu-host-manager.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -238,6 +238,7 @@ spec:
238238
nvidia.com/gpu.deploy.vgpu-manager: "true"
239239
priorityClassName: system-node-critical
240240
serviceAccountName: nvidia-vgpu-manager-ubuntu22.04
241+
terminationGracePeriodSeconds: 120
241242
tolerations:
242243
- effect: NoSchedule
243244
key: nvidia.com/gpu

manifests/state-driver/0500_daemonset.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,9 @@ spec:
5959
{{- .Driver.Spec.Labels | yaml | nindent 8 }}
6060
{{- end }}
6161
spec:
62+
{{- if eq .Driver.Spec.DriverType "vgpu-host-manager" }}
63+
terminationGracePeriodSeconds: 120
64+
{{- end }}
6265
nodeSelector:
6366
{{- if eq .Driver.Spec.DriverType "vgpu-host-manager" }}
6467
nvidia.com/gpu.deploy.vgpu-manager: "true"

0 commit comments

Comments
 (0)