[BUG] multithreaded read_region() with CuPy asarray() throws CUDARuntimeError, segfaults, or core dumps

**Describe the bug**

Using multiple threads, each thread calls `cupy.asarray(CuImage.read_region(...))`. This causes intermittent crashes ranging from CUDARuntimeError to segfaults and core dumps. This behavior does not reproduce when using a single thread. I confirmed this happens on multiple different input images. Evidence suggests a bug related to memory safety or race conditions.

The code below reproduces the issue fairly reliably, but may need to be run multiple times before a crash is observed. The image referenced is available from TCGA.

**Steps/Code to reproduce bug**

```python
import math
from concurrent.futures import ThreadPoolExecutor
from functools import partial

import cupy as cp
from cucim import CuImage

# BUG does not reproduce with N_THREADS=1
N_THREADS = 64
slide_path = "34344dfc-b64e-439f-bd15-279cf6c74401/TCGA-BP-5195-01Z-00-DX1.910fae7d-503e-4758-bb45-7c039ff9d179.svs"


def extract_patch(
    coord: tuple[int, int],
    slide: CuImage,
    level: int,
    size: tuple[int, int],
):
    try:
        img: CuImage = slide.read_region(
            location=coord,
            level=level,
            size=size,
        )
        # BUG occurs here
        return cp.asarray(img)
    except Exception as e:
        print(f"(x, y): {coord} level: {level} size: {size}")
        raise e


with ThreadPoolExecutor(max_workers=N_THREADS) as thread_pool:
    # init CuImage object
    slide = CuImage(slide_path)
    width_px, height_px = slide.size("XY")
    # hardcoded for this image, obtained via openslide
    target_mpp = 1.0
    patch_size = 512
    level = 1
    level_downsample = 4.000057352603808
    base_mpp = 0.2498
    # downsample factor to reach the target MPP
    downsample_factor = target_mpp / base_mpp
    scale_factor = downsample_factor / level_downsample
    adjusted_width = math.ceil(patch_size * scale_factor)
    adjusted_height = math.ceil(patch_size * scale_factor)
    # stride in pixels at level 0
    stride = round((patch_size) * downsample_factor)
    rows = math.floor(height_px / stride)
    cols = math.floor(width_px / stride)
    # xy coordinates of every patch
    coords = [
        (col_idx * stride, row_idx * stride)
        for row_idx in range(rows)
        for col_idx in range(cols)
    ]
    # sanity check xy coordinates for given image size
    for x, y in coords:
        assert x >= 0 and x + stride < width_px
        assert y >= 0 and y + stride < height_px

    fn = partial(
        extract_patch,
        slide=slide,
        level=level,
        size=(adjusted_width, adjusted_height),
    )
    # do read_region() in parallel using multithreading
    patches = list(thread_pool.map(fn, coords))

```

**Expected behavior**

I expect multithreaded usage of `cupy.asarray()` with `read_region()` to not throw sporadic errors.

**Environment details (please complete the following information):**
 - Environment location: DGX-H200 node
 - Method of cuCIM install: pip

Python package versions

```
cucim-cu12                               25.4.0
cupy-cuda12x                             13.4.1
nvidia-cublas-cu12                       12.4.5.8
nvidia-cuda-cupti-cu12                   12.4.127
nvidia-cuda-nvrtc-cu12                   12.4.127
nvidia-cuda-runtime-cu12                 12.4.127
nvidia-cudnn-cu12                        9.1.0.70
nvidia-cufft-cu12                        11.2.1.3
nvidia-curand-cu12                       10.3.5.147
nvidia-cusolver-cu12                     11.6.1.9
nvidia-cusparse-cu12                     12.3.1.170
nvidia-cusparselt-cu12                   0.6.2
nvidia-nccl-cu12                         2.21.5
nvidia-nvjitlink-cu12                    12.4.127
nvidia-nvtx-cu12                         12.4.127
```

Linux kernel: 5.15.0
CUDA Version: 12.2
Driver Version: 535.216.03 

**Additional context**

Most common crash output:
>   File "/home/cucim_bug_repro.py", line 25, in extract_patch
    return cp.asarray(img)
           ^^^^^^^^^^^^^^^
  File "/home/venv/lib/python3.12/site-packages/cupy/_creation/from_data.py", line 88, in asarray
    return _core.array(a, dtype, False, order, blocking=blocking)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "cupy/_core/core.pyx", line 2455, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2482, in cupy._core.core.array
  File "cupy/_core/core.pyx", line 2647, in cupy._core.core._array_default
  File "cupy/cuda/memory.pyx", line 488, in cupy.cuda.memory.MemoryPointer.copy_from_host_async
  File "cupy_backends/cuda/api/runtime.pyx", line 607, in cupy_backends.cuda.api.runtime.memcpyAsync
  File "cupy_backends/cuda/api/runtime.pyx", line 146, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidValue: invalid argument

More rare crash output:

> The futex facility returned an unexpected error code.
Aborted (core dumped)

CC @gigony 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] multithreaded read_region() with CuPy asarray() throws CUDARuntimeError, segfaults, or core dumps #884

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] multithreaded read_region() with CuPy asarray() throws CUDARuntimeError, segfaults, or core dumps #884

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions