Skip to content

Conversation

@khluu
Copy link
Collaborator

@khluu khluu commented Dec 9, 2025

Change structure & format of CI files to use new vLLM project Buildkite pipeline generator https://github.com/vllm-project/ci-infra/tree/main/buildkite/pipeline_generator

khluu added 6 commits December 8, 2025 15:20
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
@khluu khluu requested a review from hsliuustc0106 as a code owner December 9, 2025 00:38
Signed-off-by: Kevin H. Luu <[email protected]>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +11 to +14
- label: "Diffusion Model Test"
timeout_in_minutes: 15
commands:
- pytest -s -v tests/single_stage/test_diffusion_model.py

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge GPU tests no longer run in built container

The GPU test steps are now plain command invocations without the docker or Kubernetes plugins that previously ran them inside public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT with HF cache mounts. With this change they execute directly on the host even though the build job still builds and pushes the container, so any GPU agent lacking the full Python environment or pre-seeded HuggingFace cache (common in these pipelines) will fail as soon as pytest starts because the required dependencies/models are missing.

Useful? React with 👍 / 👎.

khluu added 2 commits December 8, 2025 16:41
Signed-off-by: Kevin H. Luu <[email protected]>
Signed-off-by: Kevin H. Luu <[email protected]>
@khluu khluu changed the title [DNM][ci] Use new CI pipeline generator [ci] Refactor CI files to use new CI pipeline generator Dec 9, 2025
congw729

This comment was marked as outdated.

- pytest -s -v tests/multi_stages/

- label: "Omni Model Test with H100"
timeout_in_minutes: 20
Copy link
Contributor

@congw729 congw729 Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to set the timeout to 15 minutes. @ywang96 Do you agree we set 20 minutes for testing on H100?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is already 20 minutes on main branch https://github.com/vllm-project/vllm-omni/blob/main/.buildkite/pipeline.yml#L55

Oops, my mistake! Thanks for catching that.

gpu: h100
num_gpus: 2
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to set the logging level here, align with the Omni Model Test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this empty file mandatory for the Buildkite test?

Copy link
Collaborator Author

@khluu khluu Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya it's used as an indicator whether a branch has the new refactored changes or not, to route CI bootstrap step to use the correct pipeline generator. The new pipeline generator wouldn't work with the old yaml file, and vice versa.

Copy link
Contributor

@congw729 congw729 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya it's used as an indicator whether a branch has the new refactored changes or not, to route CI bootstrap step to use the correct pipeline generator. The new pipeline generator wouldn't work with the old yaml file, and vice versa.

Thanks for the elaboration, very clear.

no_plugin: true

- label: "Diffusion Model Test"
timeout_in_minutes: 15
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the timeout applied per label or per command?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's per job/label

@ZJY0516
Copy link
Collaborator

ZJY0516 commented Dec 11, 2025

Do we have any plan to merge this? @congw729 @khluu @ywang96

@congw729
Copy link
Contributor

Do we have any plan to merge this? @congw729 @khluu @ywang96

It looks food for me.

Copy link
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@khluu
Copy link
Collaborator Author

khluu commented Dec 14, 2025

I plan to merge this once we migrate vllm-project/vllm over to the new CI pipeline generator, which is right after vllm v0.13.0 release (Dec 17), so that we can have a consistent CI file structure across vllm, vllm-omni, and other ecosystem projects within vllm-project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants