-
Notifications
You must be signed in to change notification settings - Fork 253
Open
Labels
ciEverything related to continuous integration.Everything related to continuous integration.
Description
I had kept gpuci
on CUDA 12.6 because there were known issues with OOMs, hoping that things would resolve themselves. After having to upgrade the driver to support CUDA 13, the issues are back. Basically, there are three failure modes:
base/random: Error During Test at /var/lib/buildkite-agent/builds/gpuci-13/julialang/cuda-dot-jl/test/base/random.jl:201
Got exception outside of a @test
Out of GPU memory
Effective GPU memory usage: 99.92% (4.746 GiB/4.750 GiB)
Memory pool usage: 256.591 MiB (288.000 MiB reserved)
Sometimes this manifests as SIGTERM, which I think comes from earlyoom
or so killing the process:
[242914] signal (15): Terminated
Now, for a hint where all this may come from:
Some tests did not pass: 16 passed, 0 failed, 3 errored, 0 broken.
core/codegen: Error During Test at /var/lib/buildkite-agent/builds/gpuci-11/julialang/cuda-dot-jl/test/core/codegen.jl:194
Test threw exception
Expression: CUDA.code_sass(devnull, valid_kernel, Tuple{}) == nothing
CUPTIError: CUPTI doesn't allow multiple callback subscribers. Only a single subscriber can be registered at a time. (code 39, CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED)
One possibility here is that some process fails to clean up after itself, keeping a hold of GPU memory and sometimes of the CUPTI lock, resulting in subsequent tests failing.
Another possibility is that multiple tests get scheduled on the same MIG slice.
Metadata
Metadata
Assignees
Labels
ciEverything related to continuous integration.Everything related to continuous integration.