-
Notifications
You must be signed in to change notification settings - Fork 18
Labels
bugSomething isn't workingSomething isn't working
Description
Prerequisites
- I searched existing issues
- I can reproduce this issue
Bug Description
The GPU health monitor errors out with a Python error on shutdown
Component
Health Monitor
Steps to Reproduce
- Start GPU health monitor
- Tail logs
- Kill pod and notice the error
Environment
- NVSentinel version: 0.3.0
- Kubernetes version: 1.33
- Deployment method: helm
Logs/Output
Traceback (most recent call last):
File "/usr/local/bin/gpu_health_monitor", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1462, in __call__
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1383, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1246, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 814, in invoke
return callback(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/gpu_health_monitor/cli.py", line 145, in cli
dcgm_watcher.start([], exit)
File "/usr/local/lib/python3.10/dist-packages/gpu_health_monitor/dcgm_watcher/dcgm.py", line 308, in start
exit.wait(self._poll_interval_seconds)
File "/usr/lib/python3.10/threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "/usr/lib/python3.10/threading.py", line 324, in wait
gotit = waiter.acquire(True, timeout)
TypeError: cli.<locals>.process_exit_signal() takes 0 positional arguments but 2 were given
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working