Skip to content

Conversation

@a114j0y
Copy link
Contributor

@a114j0y a114j0y commented Jun 9, 2025

What I did
I have implemented RetryCache

  • allows agents to communicate in the push mode to schedule retry.

  • Document - RetryCache High Level Design (HLD PR#1822)

  • The document explains

    1. how to parse the failure reason for a task (constraint)
    2. how to detect the resolution and send notifications
    3. how to quickly archive and restore pending tasks
    4. how to sync the retry cache with the current syncing cache

Why I did it

  • Orchagent currently uses the pull mode to retry, it polls periodically to check whether the constraints are removed.

    • that could slow down the workflow when the scaled pending tasks are unready and you still retry them one by one O(N) in every event loop.
  • Our goal is to skip unnecessary retries, instead of pulling all, we allow some consumers to push the news.

  1. Push mode is better in some senario
    • for example, when route syncing fail, due to nhg missing, they should wait until the next hop gets created, to retry.
    • better to let nhgorch push the notification (next hop creation) to routeorch, instead of routeorch pulling periodically from nhgorch.
  2. pull mode still good in most scenarios
    • for example when a task is waiting for SAI to create an interface while SAI would not notifies
    • a periodic pull is necessary.
  3. retry strategy should be event-based, allowing some to use push mode to notify constraint resolutions and kick-off corresponding retry.

How I verified it
I injected invalid routes with inject_invalid_routes.py, then measure the gaps between two event loop iterations. The gap duration is linear with the number of invalid routes pending in the syncing cache.

However, after retrycache is introduced, those pending tasks are archived aside. The consumer no longer retries them at the end of an event loop. Then the gap gets decoupled from the number of pending tasks, which is O(1).

perf

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prabhataravind
Copy link
Contributor

hi @a114j0y is there a design document for the changes implemented in this PR?

@a114j0y
Copy link
Contributor Author

a114j0y commented Nov 25, 2025

@prabhataravind , can you review and signoff, thanks @a114j0y , looks to be an efficient strategy.

Hi @prabhataravind could you help merge it? So we could include it in the 202511 release !

@prabhataravind
Copy link
Contributor

prabhataravind commented Nov 30, 2025

@a114j0y have you tested this with warm-boot? @vaibhavhd for viz.


for (auto &it : m_consumerMap)
{
count += retryToSync(it.first, threshold - count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment here.

return false;
}

size_t Orch::retryToSync(const std::string &executorName, size_t threshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation for this function explaining what threshold is.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

return count;
}

void Orch::notifyRetry(Orch *retryOrch, const std::string &executorName, const Constraint &cst)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@a114j0y who is calling this function for other constraints like RETRY_CST_NHG, RETRY_CST_NHG_REF, RETRY_CST_ECMP etc?

Copy link
Contributor Author

@a114j0y a114j0y Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These codes would be in a separate NHG-related PR, which employs RetryCache for optimization. I can move these definitions to that pr.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@a114j0y
Copy link
Contributor Author

a114j0y commented Dec 1, 2025

Azure Pipelines successfully started running 1 pipeline(s).
To pass DCO check:

git rebase HEAD~11 --signoff
push --force-with-lease origin master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants