-
Notifications
You must be signed in to change notification settings - Fork 121
Module graceful shutdown support #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #255
Conversation
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
Do you mind pasting the steps and output for testing (commands) in the PR description |
scripts/gnoi-reboot-daemon
Outdated
# gnoi-reboot-daemon | ||
# | ||
# This daemon facilitates gNOI-based reboot operations for DPU subcomponents within the SONiC platform. | ||
# It listens for JSON-formatted reboot requests on a named pipe and executes the corresponding gNOI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update that it is getting from module class?
scripts/gnoi-reboot-daemon
Outdated
try: | ||
msg = json.loads(line) | ||
dpu_ip = msg["dpu_ip"] | ||
port = msg.get("port", 50052) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port information?
scripts/gnoi-reboot-daemon
Outdated
msg = json.loads(line) | ||
dpu_ip = msg["dpu_ip"] | ||
port = msg.get("port", 50052) | ||
method = msg.get("method", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HALT method is 3
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
if [[ "$subtype" == "SmartSwitch" && "$is_dpu" != "True" ]]; then | ||
exit 0 | ||
else | ||
echo "gnoi-reboot-daemon is intended for SmartSwitch platforms only." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be debug level log, we don't want to see this on other platforms unnecessarily.
|
||
SYSLOG_IDENTIFIER = "gnoi-reboot-daemon" | ||
|
||
FIFO_PATH = "/var/run/gnoi_reboot.pipe" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still used?
# Global logger class instance | ||
logger = syslogger.SysLogger(SYSLOG_IDENTIFIER) | ||
|
||
def execute_gnoi_command(command_args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making it really "execute_gnoi_command" by taking in more structural argument like "method", "message" etc and cooking up the command string inside this helper. This reduces boiler plate further in its callers.
"-module", "System", | ||
"-rpc", "RebootStatus" | ||
] | ||
returncode, stdout, stderr = execute_gnoi_command(status_cmd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why this won't throw exception when the server go down and gnoi fail to connect to the serves? To make debugging easier we should probably clearly indicate and log where:
- DPU haven't went down yet, reboot in progress and gnoi rebootstatus get a response:
- DPU went down, gnoi rebootstatus cannot connect to remote but it is expected.
- DPU went back up, gnoi rebootstatus indicates reboot succeeded.
Provide support for SmartSwitch DPU module graceful shutdown.
Why I did it
Please refer to the HLD and related PRs:
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-buildimage#567
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU