Skip to content

CI: Upgrade past CUDA 12.6 results in OOMs/CUPTI errors #2885

@maleadt

Description

@maleadt

I had kept gpuci on CUDA 12.6 because there were known issues with OOMs, hoping that things would resolve themselves. After having to upgrade the driver to support CUDA 13, the issues are back. Basically, there are three failure modes:

base/random: Error During Test at /var/lib/buildkite-agent/builds/gpuci-13/julialang/cuda-dot-jl/test/base/random.jl:201
  Got exception outside of a @test
  Out of GPU memory
  Effective GPU memory usage: 99.92% (4.746 GiB/4.750 GiB)
  Memory pool usage: 256.591 MiB (288.000 MiB reserved)

Sometimes this manifests as SIGTERM, which I think comes from earlyoom or so killing the process:

[242914] signal (15): Terminated

Now, for a hint where all this may come from:

Some tests did not pass: 16 passed, 0 failed, 3 errored, 0 broken.
core/codegen: Error During Test at /var/lib/buildkite-agent/builds/gpuci-11/julialang/cuda-dot-jl/test/core/codegen.jl:194
  Test threw exception
  Expression: CUDA.code_sass(devnull, valid_kernel, Tuple{}) == nothing
  CUPTIError: CUPTI doesn't allow multiple callback subscribers. Only a single subscriber can be registered at a time. (code 39, CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED)

One possibility here is that some process fails to clean up after itself, keeping a hold of GPU memory and sometimes of the CUPTI lock, resulting in subsequent tests failing.

Another possibility is that multiple tests get scheduled on the same MIG slice.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ciEverything related to continuous integration.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions