Skip to content

Conversation

@raresgaia123
Copy link
Collaborator

Description

In the case when an agent that deployed workers is destroyed, the current code implementation would skip the agent that previously deployed failed workers if it wasn't found during the first iteration of recovery, continuing with remaining agents. We had to refactor the code to remove this limitation. So when an agent fails and is restarted, it will be included in the recovery process. Added changes for when an agent joins controller, to also update all existing job contexts with the details about this agent. So whenever an agent joins, all job contexts will be able to use its resources.

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

@raresgaia123 raresgaia123 changed the title refactor: recovery after agent kill and rejoin fix: recovery after agent kill and rejoin Dec 11, 2025
In the case when an agent that deployed workers is destroyed, the current code implementation would skip the agent that previously deployed failed workers if it wasn't found during the first iteration of recovery, continuing with remaining agents. We had to refactor the code to remove this limitation. So when an agent fails and is restarted, it will be included in the recovery process. Added changes for when an agent joins controller, to also update all existing job contexts with the details about this agent. So whenever an agent joins, all job contexts will be able to use its resources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant