fix: recovery after agent kill and rejoin #325

raresgaia123 · 2025-12-11T10:56:30Z

Description

In the case when an agent that deployed workers is destroyed, the current code implementation would skip the agent that previously deployed failed workers if it wasn't found during the first iteration of recovery, continuing with remaining agents. We had to refactor the code to remove this limitation. So when an agent fails and is restarted, it will be included in the recovery process. Added changes for when an agent joins controller, to also update all existing job contexts with the details about this agent. So whenever an agent joins, all job contexts will be able to use its resources.

Type of Change

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

In the case when an agent that deployed workers is destroyed, the current code implementation would skip the agent that previously deployed failed workers if it wasn't found during the first iteration of recovery, continuing with remaining agents. We had to refactor the code to remove this limitation. So when an agent fails and is restarted, it will be included in the recovery process. Added changes for when an agent joins controller, to also update all existing job contexts with the details about this agent. So whenever an agent joins, all job contexts will be able to use its resources.

raresgaia123 changed the title ~~refactor: recovery after agent kill and rejoin~~ fix: recovery after agent kill and rejoin Dec 11, 2025

raresgaia123 force-pushed the recover_agent_failure branch from b7aa7cf to a28aed2 Compare December 11, 2025 17:11

raresgaia123 force-pushed the recover_agent_failure branch from a28aed2 to 06ea259 Compare December 13, 2025 04:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: recovery after agent kill and rejoin #325

fix: recovery after agent kill and rejoin #325

Uh oh!

raresgaia123 commented Dec 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: recovery after agent kill and rejoin #325

Are you sure you want to change the base?

fix: recovery after agent kill and rejoin #325

Uh oh!

Conversation

raresgaia123 commented Dec 11, 2025

Description

Type of Change

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant