Prerequisites
Feature Summary
Checks to validate software components including Networking/GPU operator are running successfully and take action if they are not
Problem/Use Case
In a typical K8s deployments, for the node to fully be operational, it is required to run Network Operator which then exposes the NVSwitch components to the GPU Operator. If either of these components are not running successfully, the node will not be available to run additional workloads. Restarting the operators may be the correct action to take depending on the errors of the operators.
Proposed Solution
Add ability to monitor the behavior of these and other operators which will dictate if the GPU node is ready to take workloads.
Add ability to restart the operators.
Component
Fault Management