Improve linux parallelism in CI. #6730

alliepiper · 2025-11-21T18:52:01Z

sccache internally throttles jobs and we've been doing this on windows. This will improve build util.

copy-pr-bot · 2025-11-21T18:52:04Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

sccache internally throttles jobs and we've been doing this on windows.

This test required > 5GB of RAM to compile and led to OOM issues on CI runners.

oleksandr-pavlyk · 2025-11-21T20:58:27Z

.github/actions/workflow-run-job-linux/action.yml

            --env "GITHUB_REPOSITORY=$GITHUB_REPOSITORY" \
            --env "HOST_WORKSPACE=${{github.workspace}}" \
            --env "JOB_ID=$JOB_ID" \
+            --env "NVCC_APPEND_FLAGS=-t=100" \


The nvcc will not utilize more than the number of architectures it is compiling for. Perhaps in addition to -t 16 we could also use --split-compile 16 and --split-compile-extended 16.

--threads <number> (-t) Specify the maximum number of threads to be created in parallel when compiling for multiple architectures. If <number> is 1 or if compiling for one architecture, this option is ignored. If <number> is 0, the number of threads will be the number of CPUs on the machine. --split-compile <number> (-split-compile) Specify the maximum amount of concurrent threads to to be utilized when running compiler optimizations. If <number> is 1, this option is ignored. If <number> is 0, the number of threads will be the number of CPUs on the machine. This option will have minimal (if any) impact on performance of the compiled binary. --split-compile-extended <number> (-split-compile-extended) Specify the maximum amount of concurrent threads to be utilized when running compiler optimizations in LTO mode. If <number> is 1, this option is ignored. If <number> is 0, the number of threads will be the number of CPUs on the machine. This option is a more aggressive form of split compilation, and can potentially impact performance of the compiled binary. It is available in LTO mode only.

Good question.

This would only work if those commands spawn jobs that sccache knows about.

Judging from the job server docs: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#jobserver-jobserver

When using -split-compile or --threads inside of a build controlled by GNU Make, require that job slots are acquired Make’s jobserver for each of the threads used, helping prevent oversubscription. This option does not restrict -split-compile-extended (the number of threads created by it will not be controlled).

It sounds like --split-compile is spawning new processes, and --split-compile-extended is threading within processes. So almost certainly not safe to use --split-compile-extended.

@trxcllnt Have you looked into these for the distributed build PR?

sccache disables --split-compile and forces cicc/ptxas to use one core. It is fine to pass -t=100 because sccache will intercept that flag and use it to determine the number of parallel device compiles to kick off (all scheduled on the jobserver of course).

Have you looked into these for the distributed build PR?

Yes, I have looked into supporting --split-compile, but decided against it for now.

I could make one job reserve multiple permits from the jobserver when building locally. That'd block progress on other tasks (like preprocessing other files, etc.), so we'd need to measure whether it improves or degrades throughput.

It's more difficult for distributed compilation, since build servers would need to know how many threads a job will use before it fetches it. AFAIK that's not baked into AMQP or Celery, and I can't think of an easy way to accomplish that.

If a build server pulled a Big Job that needed 4 cores when only 1 was available (the default behavior), it would need to wait for 3 others to finish before starting Big Job. It could've possibly finished other jobs in the meantime, so again, we'd have to measure whether it actually improved the throughput.

alliepiper · 2025-11-21T21:40:52Z

This option has the unfortunate side effect that all of our worst TUs are now more likely to launch in parallel 😄

Some Thrust tests are triggering OOMs, gonna put this on draft while I port some infra from CUB to help split these up.

github-actions · 2025-11-22T00:49:54Z

😬 CI Workflow Results

🟥 Finished in 3h 04m: Pass: 97%/267 | Total: 3d 06h | Max: 3h 01m | Hits: 98%/361824

See results here.

davebayer · 2025-11-22T08:52:02Z

.github/actions/workflow-run-job-linux/action.yml

            --env "GITHUB_REPOSITORY=$GITHUB_REPOSITORY" \
            --env "HOST_WORKSPACE=${{github.workspace}}" \
            --env "JOB_ID=$JOB_ID" \
+            --env "NVCC_APPEND_FLAGS=-t=100" \


Q: does it make sense to run a single compilation in parallel, when the build already runs in parallel? Other cores/threads will be already occupied by other builds

Yes, running device compiles in parallel improves build times quite a bit.

For example, you compile two TUs for five archs, one that takes 1min per arch and the other 20min per arch. With the default -t=1, the device compiles are serialized, so the first object builds in 5min, and the second builds in 100min. If you increase -t=5, the first object builds in 1min and the second object builds in 20min.

The sccache client ensures no more than $(nproc) number of compilations are run concurrently, but even with thousands of TUs, compiling the device code for each arch in parallel is still significantly faster than serializing them.

This is sccache magic -- sccache internally throttles jobs to nproc, so we can take advantage of both nvcc -t and ninja -j simultaneously without manual load balancing.

github-project-automation bot added this to CCCL Nov 21, 2025

github-project-automation bot moved this to Todo in CCCL Nov 21, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Nov 21, 2025

alliepiper requested a review from trxcllnt November 21, 2025 18:52

alliepiper marked this pull request as ready for review November 21, 2025 18:52

alliepiper requested a review from a team as a code owner November 21, 2025 18:52

trxcllnt approved these changes Nov 21, 2025

View reviewed changes

github-project-automation bot moved this from In Progress to In Review in CCCL Nov 21, 2025

bernhardmgruber approved these changes Nov 21, 2025

View reviewed changes

alliepiper added 2 commits November 21, 2025 20:31

Improve linux parallelism in CI.

31627e9

sccache internally throttles jobs and we've been doing this on windows.

Split up thrust transform test.

31ae96c

This test required > 5GB of RAM to compile and led to OOM issues on CI runners.

oleksandr-pavlyk reviewed Nov 21, 2025

View reviewed changes

WIP Need to port %PARAM% to Thrust.

e33bcb5

alliepiper force-pushed the nvcc-parallel branch from c0e0028 to e33bcb5 Compare November 21, 2025 21:38

alliepiper requested a review from a team as a code owner November 21, 2025 21:38

alliepiper requested a review from elstehle November 21, 2025 21:38

alliepiper marked this pull request as draft November 21, 2025 21:39

cccl-authenticator-app bot moved this from In Review to In Progress in CCCL Nov 21, 2025

davebayer reviewed Nov 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve linux parallelism in CI. #6730

Improve linux parallelism in CI. #6730

alliepiper commented Nov 21, 2025

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

oleksandr-pavlyk Nov 21, 2025

Uh oh!

alliepiper Nov 21, 2025

Uh oh!

trxcllnt Nov 21, 2025 •

edited

Loading

Uh oh!

trxcllnt Nov 21, 2025 •

edited

Loading

Uh oh!

alliepiper commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

davebayer Nov 22, 2025

Uh oh!

trxcllnt Nov 22, 2025 •

edited

Loading

Uh oh!

alliepiper Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Improve linux parallelism in CI. #6730

Are you sure you want to change the base?

Improve linux parallelism in CI. #6730

Conversation

alliepiper commented Nov 21, 2025

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

oleksandr-pavlyk Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

alliepiper Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

trxcllnt Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trxcllnt Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alliepiper commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 22, 2025

😬 CI Workflow Results

🟥 Finished in 3h 04m: Pass: 97%/267 | Total: 3d 06h | Max: 3h 01m | Hits: 98%/361824

Uh oh!

davebayer Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

trxcllnt Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alliepiper Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

trxcllnt Nov 21, 2025 •

edited

Loading

trxcllnt Nov 21, 2025 •

edited

Loading

alliepiper commented Nov 21, 2025 •

edited

Loading

trxcllnt Nov 22, 2025 •

edited

Loading