Skip to content

Conversation

@Steboss
Copy link
Contributor

@Steboss Steboss commented May 16, 2025

This PR is a simple test:

  • make sure each k8s job can stop after 3 hours
  • in this way we're preventing clogging EKS queue
  • and avoiding having ghcr secret deleted

@Steboss Steboss marked this pull request as draft May 16, 2025 15:06
@Steboss Steboss marked this pull request as ready for review May 16, 2025 15:15
@Steboss Steboss requested a review from olupton May 19, 2025 09:28
@Steboss
Copy link
Contributor Author

Steboss commented May 19, 2025

@olupton I'd like to have your point of view on how to work out a solution here.
The idea is that each job has a 3 hours of max execution time, in order to get its secret and K8s job correctly deleted, without stopping all the other jobs. Ideally we should lower this execution time to 1 hour, but it may be not enough for some job we're running.

Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the runtime limit can be added to .github/eks-workflow-files/mpi-nccl-test.yml too.

This seems like an improvement in principle. Did you do any experiments to make sure the error case handling actually works (e.g. run a job with the limit set to a tiny value)?

3 hours is a lot, but we might have to tune the value down in a second pass.

@Steboss
Copy link
Contributor Author

Steboss commented May 19, 2025

@olupton
I did actually let the job to run, to see if everything worked correctly, and it looks like the AXLearn job that was hanging was correctly deleted, while all the other jobs managed to run without having errors in the ghcr secret deletion.

@Steboss Steboss requested a review from olupton May 19, 2025 20:01
@Steboss Steboss merged commit d948e6c into main May 20, 2025
62 of 64 checks passed
@Steboss Steboss deleted the sbosisio/k8s_orphan branch May 20, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants