-
Notifications
You must be signed in to change notification settings - Fork 295
Improve linux parallelism in CI. #6730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -191,6 +191,7 @@ runs: | |
| --env "GITHUB_REPOSITORY=$GITHUB_REPOSITORY" \ | ||
| --env "HOST_WORKSPACE=${{github.workspace}}" \ | ||
| --env "JOB_ID=$JOB_ID" \ | ||
| --env "NVCC_APPEND_FLAGS=-t=100" \ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Q: does it make sense to run a single compilation in parallel, when the build already runs in parallel? Other cores/threads will be already occupied by other builds
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, running device compiles in parallel improves build times quite a bit. For example, you compile two TUs for five archs, one that takes 1min per arch and the other 20min per arch. With the default The sccache client ensures no more than
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is sccache magic -- sccache internally throttles jobs to
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool, thanks for the explanation! |
||
| --env "NVIDIA_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-}" \ | ||
| --env "RUNNER_TEMP=$RUNNER_TEMP" \ | ||
| --volume "${ARTIFACT_ARCHIVES}:${ARTIFACT_ARCHIVES}" \ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nvcc will not utilize more than the number of architectures it is compiling for. Perhaps in addition to
-t 16we could also use--split-compile 16and--split-compile-extended 16.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question.
This would only work if those commands spawn jobs that sccache knows about.
Judging from the job server docs: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#jobserver-jobserver
It sounds like
--split-compileis spawning new processes, and--split-compile-extendedis threading within processes. So almost certainly not safe to use--split-compile-extended.@trxcllnt Have you looked into these for the distributed build PR?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sccache disables
--split-compileand forces cicc/ptxas to use one core. It is fine to pass-t=100because sccache will intercept that flag and use it to determine the number of parallel device compiles to kick off (all scheduled on the jobserver of course).Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have looked into supporting
--split-compile, but decided against it for now.I could make one job reserve multiple permits from the jobserver when building locally. That'd block progress on other tasks (like preprocessing other files, etc.), so we'd need to measure whether it improves or degrades throughput.
It's more difficult for distributed compilation, since build servers would need to know how many threads a job will use before it fetches it. AFAIK that's not baked into AMQP or Celery, and I can't think of an easy way to accomplish that.
If a build server pulled a Big Job that needed 4 cores when only 1 was available (the default behavior), it would need to wait for 3 others to finish before starting Big Job. It could've possibly finished other jobs in the meantime, so again, we'd have to measure whether it actually improved the throughput.