Skip to content

Module graceful shutdown support #255

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented May 14, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Why I did it
Please refer to the HLD and related PRs:
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-buildimage#567

How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@hdwhdw
Copy link
Contributor

hdwhdw commented May 19, 2025

Do you mind pasting the steps and output for testing (commands) in the PR description

# gnoi-reboot-daemon
#
# This daemon facilitates gNOI-based reboot operations for DPU subcomponents within the SONiC platform.
# It listens for JSON-formatted reboot requests on a named pipe and executes the corresponding gNOI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update that it is getting from module class?

try:
msg = json.loads(line)
dpu_ip = msg["dpu_ip"]
port = msg.get("port", 50052)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

port information?

msg = json.loads(line)
dpu_ip = msg["dpu_ip"]
port = msg.get("port", 50052)
method = msg.get("method", 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HALT method is 3

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

if [[ "$subtype" == "SmartSwitch" && "$is_dpu" != "True" ]]; then
exit 0
else
echo "gnoi-reboot-daemon is intended for SmartSwitch platforms only."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be debug level log, we don't want to see this on other platforms unnecessarily.


SYSLOG_IDENTIFIER = "gnoi-reboot-daemon"

FIFO_PATH = "/var/run/gnoi_reboot.pipe"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

# Global logger class instance
logger = syslogger.SysLogger(SYSLOG_IDENTIFIER)

def execute_gnoi_command(command_args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making it really "execute_gnoi_command" by taking in more structural argument like "method", "message" etc and cooking up the command string inside this helper. This reduces boiler plate further in its callers.

"-module", "System",
"-rpc", "RebootStatus"
]
returncode, stdout, stderr = execute_gnoi_command(status_cmd)
Copy link
Contributor

@hdwhdw hdwhdw May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why this won't throw exception when the server go down and gnoi fail to connect to the serves? To make debugging easier we should probably clearly indicate and log where:

  1. DPU haven't went down yet, reboot in progress and gnoi rebootstatus get a response:
  2. DPU went down, gnoi rebootstatus cannot connect to remote but it is expected.
  3. DPU went back up, gnoi rebootstatus indicates reboot succeeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants