-
Notifications
You must be signed in to change notification settings - Fork 17
Update the code to use scontrol in place of slurm APIs to drain the #74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3b048a6
to
478801c
Compare
redfish-exporter/listener.go
Outdated
@@ -219,15 +220,34 @@ func (s *Server) processRequest(AppConfig Config, conn net.Conn, req *http.Reque | |||
log.Printf("Origin Of Condition: %s", originOfCondition) | |||
for _, triggerEvent := range AppConfig.TriggerEvents { | |||
if severity == triggerEvent.Severity { | |||
log.Printf("Matched Trigger Event: %s with action %s", triggerEvent.Severity, triggerEvent.Action) | |||
if triggerEvent.Message != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the action will only be triggered if the severity message matches as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't want to match on the 'message' field, it can be left empty. I will add it to the comments in .env
redfish-exporter/.env
Outdated
TRIGGER_EVENTS="[\ | ||
{\"Severity\":\"Fatal\",\"Action\":\"DrainNode\"},\ | ||
{\"Severity\":\"Critical\",\"Action\":\"DrainNode\"} | ||
{\"Severity\":\"Critical\",\"Message\":\"message 1|This is a critical test event\",\"Action\":\"DrainNode\", \"DrainReasonPrefix\":\"RebootNeeded\"},\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an action necessary if it's just going to be a drain anyway?
Could you also add a comment for these fields as well to make it clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left it as is in case we need to use it for some other action in the future. Will update the comments
redfish-exporter/slurm/slurm.go
Outdated
@@ -160,6 +201,183 @@ func (c *Client) GetNodes() ([]string, error) { | |||
return nodes, nil | |||
} | |||
|
|||
func (c *Client) GetNodeReason(nodeName string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code needed anymore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left in the functions which use API for future use. I will update the name of this function
"log" | ||
"strings" | ||
|
||
"github.com/nod-ai/ADA/redfish-exporter/metrics" | ||
) | ||
|
||
const ( | ||
Drain = "DrainNode" | ||
Drain = "DrainNode" | ||
ExlcudeReasonSet = "DRAIN_EXCLUDE_REASON_SET" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the user provide this? Is this different than the exclude reason from .env?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for internal logging
node to make it slurm version independent
478801c
to
034e2ad
Compare
@@ -160,6 +201,183 @@ func (c *Client) GetNodes() ([]string, error) { | |||
return nodes, nil | |||
} | |||
|
|||
func (c *Client) GetNodeReasonWithAPI(nodeName string) (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed now that we are moving everything to scontrol
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had tested this API path also, left it in if we move to API later
d76f5ff
to
0009de2
Compare
0009de2
to
219e136
Compare
219e136
to
a506e62
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mithun-pensando Could you please update this README with the updated instructions for running the listener: https://github.com/amd/ADA/blob/main/redfish-exporter/README.md?
Thank you.
redfish-exporter/.env
Outdated
SLURM_CONTROL_NODE="10.235.34.47" | ||
SLURM_DRAIN_EXCLUDE_REASON_LIST="AMD|Pensando|RebootNeeded" | ||
SLURM_SCONTROL_PATH="/usr/bin/scontrol" | ||
TLS_TIMEOUT="15" | ||
|
||
TRIGGER_EVENTS="[\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this list can now come from env or through .json file?
redfish-exporter/config.go
Outdated
AppConfig.TriggerEvents = tInfoMap | ||
|
||
for kk, tt := range AppConfig.TriggerEvents { | ||
fmt.Println("Severity: ", kk) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these print statements for debugging purpose? If not, change it to log
?
a506e62
to
62167cb
Compare
62167cb
to
1ceb089
Compare
node to make it slurm version independent