-
Notifications
You must be signed in to change notification settings - Fork 116
[ci] Refactor CI files to use new CI pipeline generator #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - label: "Diffusion Model Test" | ||
| timeout_in_minutes: 15 | ||
| commands: | ||
| - pytest -s -v tests/single_stage/test_diffusion_model.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPU tests no longer run in built container
The GPU test steps are now plain command invocations without the docker or Kubernetes plugins that previously ran them inside public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT with HF cache mounts. With this change they execute directly on the host even though the build job still builds and pushes the container, so any GPU agent lacking the full Python environment or pre-seeded HuggingFace cache (common in these pipelines) will fail as soon as pytest starts because the required dependencies/models are missing.
Useful? React with 👍 / 👎.
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
| - pytest -s -v tests/multi_stages/ | ||
|
|
||
| - label: "Omni Model Test with H100" | ||
| timeout_in_minutes: 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We used to set the timeout to 15 minutes. @ywang96 Do you agree we set 20 minutes for testing on H100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is already 20 minutes on main branch https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/pipeline.yml#L55
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout is already 20 minutes on main branch https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/pipeline.yml#L55
Oops, my mistake! Thanks for catching that.
| gpu: h100 | ||
| num_gpus: 2 | ||
| commands: | ||
| - export VLLM_WORKER_MULTIPROC_METHOD=spawn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we also need to set the logging level here, align with the Omni Model Test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this empty file mandatory for the Buildkite test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya it's used as an indicator whether a branch has the new refactored changes or not, to route CI bootstrap step to use the correct pipeline generator. The new pipeline generator wouldn't work with the old yaml file, and vice versa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya it's used as an indicator whether a branch has the new refactored changes or not, to route CI bootstrap step to use the correct pipeline generator. The new pipeline generator wouldn't work with the old yaml file, and vice versa.
Thanks for the elaboration, very clear.
| no_plugin: true | ||
|
|
||
| - label: "Diffusion Model Test" | ||
| timeout_in_minutes: 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the timeout applied per label or per command?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's per job/label
hsliuustc0106
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
|
I plan to merge this once we migrate |
Change structure & format of CI files to use new vLLM project Buildkite pipeline generator https://github.com/vllm-project/ci-infra/tree/main/buildkite/pipeline_generator