CI: Upgrade past CUDA 12.6 results in OOMs/CUPTI errors

I had kept `gpuci` on CUDA 12.6 because there were known issues with OOMs, hoping that things would resolve themselves. After having to upgrade the driver to support CUDA 13, the issues are back. Basically, there are three failure modes:

```
base/random: Error During Test at /var/lib/buildkite-agent/builds/gpuci-13/julialang/cuda-dot-jl/test/base/random.jl:201
  Got exception outside of a @test
  Out of GPU memory
  Effective GPU memory usage: 99.92% (4.746 GiB/4.750 GiB)
  Memory pool usage: 256.591 MiB (288.000 MiB reserved)
```

Sometimes this manifests as SIGTERM, which I think comes from `earlyoom` or so killing the process:

```
[242914] signal (15): Terminated
```

Now, for a hint where all this may come from:

```
Some tests did not pass: 16 passed, 0 failed, 3 errored, 0 broken.
core/codegen: Error During Test at /var/lib/buildkite-agent/builds/gpuci-11/julialang/cuda-dot-jl/test/core/codegen.jl:194
  Test threw exception
  Expression: CUDA.code_sass(devnull, valid_kernel, Tuple{}) == nothing
  CUPTIError: CUPTI doesn't allow multiple callback subscribers. Only a single subscriber can be registered at a time. (code 39, CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED)
```

One possibility here is that some process fails to clean up after itself, keeping a hold of GPU memory and sometimes of the CUPTI lock, resulting in subsequent tests failing.

Another possibility is that multiple tests get scheduled on the same MIG slice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: Upgrade past CUDA 12.6 results in OOMs/CUPTI errors #2885

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CI: Upgrade past CUDA 12.6 results in OOMs/CUPTI errors #2885

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions