Skip to content

[Bug]: GPU health monitor errors on shutdown #365

@lalitadithya

Description

@lalitadithya

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Bug Description

The GPU health monitor errors out with a Python error on shutdown

Component

Health Monitor

Steps to Reproduce

  1. Start GPU health monitor
  2. Tail logs
  3. Kill pod and notice the error

Environment

  • NVSentinel version: 0.3.0
  • Kubernetes version: 1.33
  • Deployment method: helm

Logs/Output

Traceback (most recent call last):
  File "/usr/local/bin/gpu_health_monitor", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1462, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1383, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1246, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 814, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/gpu_health_monitor/cli.py", line 145, in cli
    dcgm_watcher.start([], exit)
  File "/usr/local/lib/python3.10/dist-packages/gpu_health_monitor/dcgm_watcher/dcgm.py", line 308, in start
    exit.wait(self._poll_interval_seconds)
  File "/usr/lib/python3.10/threading.py", line 607, in wait
    signaled = self._cond.wait(timeout)
  File "/usr/lib/python3.10/threading.py", line 324, in wait
    gotit = waiter.acquire(True, timeout)
TypeError: cli.<locals>.process_exit_signal() takes 0 positional arguments but 2 were given

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions