[orchagent] Event-based Retry Strategy #3699

a114j0y · 2025-06-09T23:49:08Z

What I did
I have implemented RetryCache

allows agents to communicate in the push mode to schedule retry.
Document - RetryCache High Level Design (HLD PR#1822)
The document explains
1. how to parse the failure reason for a task (constraint)
2. how to detect the resolution and send notifications
3. how to quickly archive and restore pending tasks
4. how to sync the retry cache with the current syncing cache

Why I did it

Orchagent currently uses the pull mode to retry, it polls periodically to check whether the constraints are removed.
- that could slow down the workflow when the scaled pending tasks are unready and you still retry them one by one O(N) in every event loop.
Our goal is to skip unnecessary retries, instead of pulling all, we allow some consumers to push the news.

Push mode is better in some senario
- for example, when route syncing fail, due to nhg missing, they should wait until the next hop gets created, to retry.
- better to let nhgorch push the notification (next hop creation) to routeorch, instead of routeorch pulling periodically from nhgorch.
pull mode still good in most scenarios
- for example when a task is waiting for SAI to create an interface while SAI would not notifies
- a periodic pull is necessary.
retry strategy should be event-based, allowing some to use push mode to notify constraint resolutions and kick-off corresponding retry.

How I verified it
I injected invalid routes with inject_invalid_routes.py, then measure the gaps between two event loop iterations. The gap duration is linear with the number of invalid routes pending in the syncing cache.

However, after retrycache is introduced, those pending tasks are archived aside. The consumer no longer retries them at the end of an event loop. Then the gap gets decoupled from the number of pending tasks, which is O(1).

mssonicbld · 2025-06-09T23:49:16Z

/azp run

azure-pipelines · 2025-06-09T23:49:26Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-10T00:29:46Z

/azp run

azure-pipelines · 2025-06-10T00:29:56Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-10T20:07:18Z

/azp run

azure-pipelines · 2025-06-10T20:07:28Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-10T20:07:49Z

/azp run

azure-pipelines · 2025-06-10T20:07:59Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-10T21:43:02Z

/azp run

azure-pipelines · 2025-06-10T21:43:13Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-11T02:13:24Z

/azp run

azure-pipelines · 2025-06-11T02:13:34Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-11T03:28:46Z

/azp run

azure-pipelines · 2025-06-11T03:28:56Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-11T05:38:49Z

/azp run

azure-pipelines · 2025-06-11T05:38:58Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-12T02:11:03Z

/azp run

azure-pipelines · 2025-06-12T02:11:13Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-06-12T02:18:02Z

/azp run

azure-pipelines · 2025-06-12T02:18:11Z

Azure Pipelines successfully started running 1 pipeline(s).

prabhataravind · 2025-06-16T17:06:36Z

hi @a114j0y is there a design document for the changes implemented in this PR?

a114j0y · 2025-11-25T19:20:14Z

@prabhataravind , can you review and signoff, thanks @a114j0y , looks to be an efficient strategy.

Hi @prabhataravind could you help merge it? So we could include it in the 202511 release !

prabhataravind · 2025-11-30T05:11:29Z

@a114j0y have you tested this with warm-boot? @vaibhavhd for viz.

prabhataravind · 2025-11-30T05:26:33Z

orchagent/orch.cpp

+
    for (auto &it : m_consumerMap)
    {
+        count += retryToSync(it.first, threshold - count);


Please add a comment here.

prabhataravind · 2025-11-30T05:28:15Z

orchagent/orch.cpp

+    return false;
+}
+
+size_t Orch::retryToSync(const std::string &executorName, size_t threshold)


Please add documentation for this function explaining what threshold is.

orchagent/routeorch.cpp

mssonicbld · 2025-11-30T05:41:40Z

/azp run

azure-pipelines · 2025-11-30T05:41:50Z

Azure Pipelines successfully started running 1 pipeline(s).

prabhataravind · 2025-11-30T05:46:36Z

orchagent/orch.cpp

+    return count;
+}
+
+void Orch::notifyRetry(Orch *retryOrch, const std::string &executorName, const Constraint &cst)


@a114j0y who is calling this function for other constraints like RETRY_CST_NHG, RETRY_CST_NHG_REF, RETRY_CST_ECMP etc?

These codes would be in a separate NHG-related PR, which employs RetryCache for optimization. I can move these definitions to that pr.

mssonicbld · 2025-12-01T23:21:26Z

/azp run

azure-pipelines · 2025-12-01T23:21:36Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-12-01T23:55:22Z

/azp run

azure-pipelines · 2025-12-01T23:55:33Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Yijiao Qin <[email protected]>

Co-authored-by: Copilot <[email protected]> Signed-off-by: Yijiao Qin <[email protected]>

Signed-off-by: Yijiao Qin <[email protected]>

mssonicbld · 2025-12-01T23:56:45Z

/azp run

azure-pipelines · 2025-12-01T23:56:54Z

Azure Pipelines successfully started running 1 pipeline(s).

a114j0y · 2025-12-01T23:57:35Z

Azure Pipelines successfully started running 1 pipeline(s).
To pass DCO check:

git rebase HEAD~11 --signoff
push --force-with-lease origin master

a114j0y force-pushed the master branch from 010ac75 to 9a77792 Compare June 10, 2025 00:29

a114j0y force-pushed the master branch from 9a77792 to c6644b6 Compare June 10, 2025 20:07

a114j0y force-pushed the master branch from c6644b6 to 4c696ca Compare June 10, 2025 20:07

a114j0y force-pushed the master branch from 4c696ca to afcc839 Compare June 10, 2025 21:42

a114j0y force-pushed the master branch from afcc839 to b06a171 Compare June 11, 2025 02:13

a114j0y force-pushed the master branch from b06a171 to 988aba1 Compare June 11, 2025 03:28

a114j0y force-pushed the master branch from 988aba1 to 2df16eb Compare June 11, 2025 05:38

a114j0y force-pushed the master branch from 2b60e63 to 54b944c Compare June 12, 2025 02:17

a114j0y mentioned this pull request Jun 16, 2025

Event-based OrchDaemon Retry Mechanism sonic-net/SONiC#1822

Open

prabhataravind reviewed Nov 30, 2025

View reviewed changes

orchagent/routeorch.cpp Outdated Show resolved Hide resolved

prabhataravind reviewed Nov 30, 2025

View reviewed changes

orchagent/routeorch.cpp Outdated Show resolved Hide resolved

prabhataravind reviewed Nov 30, 2025

View reviewed changes

a114j0y dismissed qiluo-msft’s stale review via 21b91b5 December 1, 2025 23:21

a114j0y force-pushed the master branch from 21b91b5 to 7c5617f Compare December 1, 2025 23:55

a114j0y and others added 11 commits December 1, 2025 23:56

implement event-based retry strategy

7b8c95c

Signed-off-by: Yijiao Qin <[email protected]>

enable task retry recording

09ea15a

Signed-off-by: Yijiao Qin <[email protected]>

set record_type with macros and bit operations

34d0380

Signed-off-by: Yijiao Qin <[email protected]>

adapt to multimap design

a739153

Signed-off-by: Yijiao Qin <[email protected]>

fix the issue of tasks with same key but different cst

a8d6448

Signed-off-by: Yijiao Qin <[email protected]>

fix

b1dd682

Signed-off-by: Yijiao Qin <[email protected]>

remove nhg change

d4c62b5

Signed-off-by: Yijiao Qin <[email protected]>

Update orchagent/routeorch.cpp

2f740e4

Co-authored-by: Copilot <[email protected]> Signed-off-by: Yijiao Qin <[email protected]>

Change operator<< and make_constraint to static inline

6f66abe

Signed-off-by: Yijiao Qin <[email protected]>

remove redundant headers

a39938c

Signed-off-by: Yijiao Qin <[email protected]>

add more documents

c5ae981

Signed-off-by: Yijiao Qin <[email protected]>

a114j0y force-pushed the master branch from 7c5617f to c5ae981 Compare December 1, 2025 23:56

[orchagent] Event-based Retry Strategy #3699

Are you sure you want to change the base?

[orchagent] Event-based Retry Strategy #3699

Uh oh!

Conversation

a114j0y commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mssonicbld commented Jun 9, 2025

Uh oh!

azure-pipelines bot commented Jun 9, 2025

Uh oh!

mssonicbld commented Jun 10, 2025

Uh oh!

azure-pipelines bot commented Jun 10, 2025

Uh oh!

mssonicbld commented Jun 10, 2025

Uh oh!

azure-pipelines bot commented Jun 10, 2025

Uh oh!

mssonicbld commented Jun 10, 2025

Uh oh!

azure-pipelines bot commented Jun 10, 2025

Uh oh!

mssonicbld commented Jun 10, 2025

Uh oh!

azure-pipelines bot commented Jun 10, 2025

Uh oh!

mssonicbld commented Jun 11, 2025

Uh oh!

azure-pipelines bot commented Jun 11, 2025

Uh oh!

mssonicbld commented Jun 11, 2025

Uh oh!

azure-pipelines bot commented Jun 11, 2025

Uh oh!

mssonicbld commented Jun 11, 2025

Uh oh!

azure-pipelines bot commented Jun 11, 2025

Uh oh!

mssonicbld commented Jun 12, 2025

Uh oh!

azure-pipelines bot commented Jun 12, 2025

Uh oh!

mssonicbld commented Jun 12, 2025

Uh oh!

azure-pipelines bot commented Jun 12, 2025

Uh oh!

prabhataravind commented Jun 16, 2025

Uh oh!

a114j0y commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prabhataravind commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prabhataravind Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

prabhataravind Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Nov 30, 2025

Uh oh!

azure-pipelines bot commented Nov 30, 2025

Uh oh!

prabhataravind Nov 30, 2025

Choose a reason for hiding this comment

Uh oh!

a114j0y Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Dec 1, 2025

Uh oh!

azure-pipelines bot commented Dec 1, 2025

Uh oh!

mssonicbld commented Dec 1, 2025

a114j0y commented Jun 9, 2025 •

edited

Loading

a114j0y commented Nov 25, 2025 •

edited

Loading

prabhataravind commented Nov 30, 2025 •

edited

Loading

a114j0y Dec 1, 2025 •

edited

Loading

a114j0y commented Dec 1, 2025 •

edited

Loading