Handle `cuda.core.Stream` in driver operations #401

brandon-b-miller · 2025-08-18T12:21:42Z

Closes #151

copy-pr-bot · 2025-08-18T12:21:45Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

numba_cuda/numba/cuda/cudadrv/driver.py

brandon-b-miller · 2025-08-18T12:27:09Z

/ok to test

brandon-b-miller · 2025-08-20T14:42:06Z

/ok to test

brandon-b-miller · 2025-08-20T15:11:15Z

/ok to test

brandon-b-miller · 2025-08-20T15:58:31Z

/ok to test

numba_cuda/numba/cuda/cudadrv/driver.py

isVoid

I see this PR closes #151, per issue suggests that we can pass a cuda core stream object via kernel launch interface, but this PR is missing a test for this use case.

Co-authored-by: Keith Kraus <[email protected]>

brandon-b-miller · 2025-08-25T15:57:17Z

/ok to test

numba_cuda/numba/cuda/cudadrv/driver.py

leofang · 2025-08-26T19:40:01Z

numba_cuda/numba/cuda/cudadrv/driver.py

+    acceptable stream objects. Acceptable types are
+    int (0 for default stream), Stream, ExperimentalStream


Is the docstring outdated? int is currently not allowed

Only for the special value 0 I believe.

Should we consider deprecating allowing passing 0 as a Stream? The "default stream" is ambiguous in Python since PTDS is normally a host compile-time concept. We have an environment variable for controlling it in cuda.bindings / cuda.core: CUDA_PYTHON_CUDA_PER_THREAD_DEFAULT_STREAM which I think should be generally used.

It would be great if we could introduce a deprecation warning in some form to passing 0 as a Stream in user facing APIs.

From the user perspective we're deprecating the apis fully in #546, so those should be gone entirely. But we should do a sweep and make sure we're being explicit with all our usages of streams internally.

Outside of the DeviceNDArray class, I think streams are accepted when launching kernels and using the Event APIs as well where we should properly handle there as well?

launching is tested as part of this PR, events added in 7df62ce though.

leofang · 2025-08-26T19:44:21Z

numba_cuda/numba/cuda/cudadrv/driver.py

+    """
+    Memset on the device.
+    If stream is 0, the call is synchronous.
+    If stream is a Stream object, asynchronous mode is used.


There is a bug (or change or behavior) here and elsewhere. stream can be a Stream object from either numba-cuda or cuda.core, but still holds 0 (the default stream) under the hood. However, the call now becomes asynchronous (with respect to the host) instead of synchronous. Just wanted to call it out in case it was not the intention.

This is a really good catch. As a follow up to this, is the output here as expected, where dev is a cuda.core.experimental.Device for whom set_current() has been called? Should it not be (0, 0)?

>>> dev.default_stream.__cuda_stream__() (0, 1)

I ask hoping there's a reliable way of detecting this situation based on the passed object.

After a while searching around the codebase I concluded this was at least the original intention, though these are really only used for the deprecated device array API:

If a CUDA ``stream`` is given, then the transfer will be made asynchronously as part as the given stream. Otherwise, the transfer is synchronous: the function returns after the copy is finished.

So AFAICT this PR maintains the above behavior just with a new stream object. Ultimately though I'm not sure we should spend too much time thinking about it as these will be removed and users performing these types of memory transfers should use either cupy for a nice array API or cuda.bindings for full control of things like synchronization behavior.

numba_cuda/numba/cuda/cudadrv/driver.py

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py

brandon-b-miller · 2025-10-14T21:47:50Z

/ok to test

brandon-b-miller · 2025-10-15T17:40:41Z

/ok to test

brandon-b-miller · 2025-10-15T20:25:57Z

/ok to test

brandon-b-miller · 2025-10-15T21:30:32Z

/ok to test

brandon-b-miller · 2025-10-24T12:47:27Z

/ok to test

numba_cuda/numba/cuda/cudadrv/driver.py

brandon-b-miller · 2025-10-27T17:05:36Z

/ok to test

brandon-b-miller · 2025-10-27T20:24:43Z

/ok to test

brandon-b-miller · 2025-10-27T21:04:07Z

/ok to test

Closes NVIDIA#151 --------- Co-authored-by: Keith Kraus <[email protected]>

numba_cuda/numba/cuda/cudadrv/driver.py

leofang · 2025-10-30T00:45:37Z

Another question: @brandon-b-miller IIRC we hit some test failures that blocked this PR from making progress. What was the fix that unblocked this?

brandon-b-miller · 2025-10-30T12:04:59Z

Another question: @brandon-b-miller IIRC we hit some test failures that blocked this PR from making progress. What was the fix that unblocked this?

It was the fact that elsewhere in the test suite we were blowing up all contexts out from under cuda-core, fixed by #507.

leofang · 2025-10-30T12:34:17Z

Ahh yes, the context reset issue, thanks for reminder 🙏

- Add support for cache-hinted load and store operations (NVIDIA#587) - Add more thirdparty tests (NVIDIA#586) - Add sphinx-lint to pre-commit and fix errors (NVIDIA#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (NVIDIA#544) - chore: clean up dead workaround for unavailable `lru_cache` (NVIDIA#598) - chore(docs): format types docs (NVIDIA#596) - refactor: decouple `Context` from `Stream` and `Event` objects (NVIDIA#579) - Fix freezing in of constant arrays with negative strides (NVIDIA#589) - Update tests to accept variants of generated PTX (NVIDIA#585) - refactor: replace device functionality with `cuda.core` APIs (NVIDIA#581) - Move frontend tests to `cudapy` namespace (NVIDIA#558) - Generalize the concurrency group for main merges (NVIDIA#582) - ci: move pre-commit checks to pre commit action (NVIDIA#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (NVIDIA#574) - ci: ensure that python version in ci matches matrix (NVIDIA#575) - Fix the `cuda.is_supported_version()` API (NVIDIA#571) - Fix checks on main (NVIDIA#576) - feat: add `math.nextafter` (NVIDIA#543) - ci: replace conda testing with pixi (NVIDIA#554) - [CI] Run PR workflow on merge to main (NVIDIA#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (NVIDIA#569) - test: enable fail-on-warn and clean up resulting failures (NVIDIA#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (NVIDIA#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (NVIDIA#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (NVIDIA#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (NVIDIA#550) - test: revert back to ipc futures that await each iteration (NVIDIA#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (NVIDIA#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (NVIDIA#534) - Remove dependencies on target_extension for CUDA target (NVIDIA#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (NVIDIA#559) - [WIP] Port numpy reduction tests to CUDA (NVIDIA#523) - ci: add timeout to avoid blocking the job queue (NVIDIA#556) - Handle `cuda.core.Stream` in driver operations (NVIDIA#401) - feat: add support for `math.exp2` (NVIDIA#541) - Vendor in types and datamodel for CUDA-specific changes (NVIDIA#533) - refactor: cleanup device constructor (NVIDIA#548) - bench: add cupy to array constructor kernel launch benchmarks (NVIDIA#547) - perf: cache dimension computations (NVIDIA#542) - perf: remove duplicated size computation (NVIDIA#537) - chore(perf): add torch to benchmark (NVIDIA#539) - test: speed up ipc tests by ~6.5x (NVIDIA#527) - perf: speed up kernel launch (NVIDIA#510) - perf: remove context threading in various pointer abstractions (NVIDIA#536) - perf: reduce the number of `__cuda_array_interface__` accesses (NVIDIA#538) - refactor: remove unnecessary custom map and set implementations (NVIDIA#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (NVIDIA#513) - test: add benchmarks for kernel launch for reproducibility (NVIDIA#528) - test(pixi): update pixi testing command to work with the new `testing` directory (NVIDIA#522) - refactor: fully remove `USE_NV_BINDING` (NVIDIA#525) - Draft: Vendor in the IR module (NVIDIA#439) - pyproject.toml: add search path for Pyrefly (NVIDIA#524) - Vendor in numba.core.typing for CUDA-specific changes (NVIDIA#473) - Use numba.config when available, otherwise use numba.cuda.config (NVIDIA#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (NVIDIA#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (NVIDIA#502) - build: allow parallelization of nvcc testing builds (NVIDIA#521) - chore(dev-deps): add pixi (NVIDIA#505) - Vendor the imputils module for CUDA refactoring (NVIDIA#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (NVIDIA#519) - Switch back to stable cuDF release in thirdparty tests (NVIDIA#518) - Updating .gitignore with binaries in the `testing` folder (NVIDIA#516) - Remove some unnecessary uses of ContextResettingTestCase (NVIDIA#507) - Vendor in _helperlib cext for CUDA-specific changes (NVIDIA#512) - Vendor in typeconv for future CUDA-specific changes (NVIDIA#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (NVIDIA#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (NVIDIA#494) - Make the CUDA target the default for CUDA overload decorators (NVIDIA#511) - Remove C extension loading hacks (NVIDIA#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (NVIDIA#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (NVIDIA#433) - Fix Bf16 Test OB Error (NVIDIA#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (NVIDIA#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (NVIDIA#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (NVIDIA#488) - Improve debug value range coverage (NVIDIA#461) - Add `compile_all` API (NVIDIA#484) - Vendor in core.registry for CUDA-specific changes (NVIDIA#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (NVIDIA#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (NVIDIA#476) - [test] Remove dependency on cpu_target (NVIDIA#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (NVIDIA#475) - [test] Use numpy's tolerance for float16 (NVIDIA#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (NVIDIA#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (NVIDIA#478)

- Add support for cache-hinted load and store operations (#587) - Add more thirdparty tests (#586) - Add sphinx-lint to pre-commit and fix errors (#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (#544) - chore: clean up dead workaround for unavailable `lru_cache` (#598) - chore(docs): format types docs (#596) - refactor: decouple `Context` from `Stream` and `Event` objects (#579) - Fix freezing in of constant arrays with negative strides (#589) - Update tests to accept variants of generated PTX (#585) - refactor: replace device functionality with `cuda.core` APIs (#581) - Move frontend tests to `cudapy` namespace (#558) - Generalize the concurrency group for main merges (#582) - ci: move pre-commit checks to pre commit action (#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (#574) - ci: ensure that python version in ci matches matrix (#575) - Fix the `cuda.is_supported_version()` API (#571) - Fix checks on main (#576) - feat: add `math.nextafter` (#543) - ci: replace conda testing with pixi (#554) - [CI] Run PR workflow on merge to main (#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (#569) - test: enable fail-on-warn and clean up resulting failures (#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (#550) - test: revert back to ipc futures that await each iteration (#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (#534) - Remove dependencies on target_extension for CUDA target (#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (#559) - [WIP] Port numpy reduction tests to CUDA (#523) - ci: add timeout to avoid blocking the job queue (#556) - Handle `cuda.core.Stream` in driver operations (#401) - feat: add support for `math.exp2` (#541) - Vendor in types and datamodel for CUDA-specific changes (#533) - refactor: cleanup device constructor (#548) - bench: add cupy to array constructor kernel launch benchmarks (#547) - perf: cache dimension computations (#542) - perf: remove duplicated size computation (#537) - chore(perf): add torch to benchmark (#539) - test: speed up ipc tests by ~6.5x (#527) - perf: speed up kernel launch (#510) - perf: remove context threading in various pointer abstractions (#536) - perf: reduce the number of `__cuda_array_interface__` accesses (#538) - refactor: remove unnecessary custom map and set implementations (#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (#513) - test: add benchmarks for kernel launch for reproducibility (#528) - test(pixi): update pixi testing command to work with the new `testing` directory (#522) - refactor: fully remove `USE_NV_BINDING` (#525) - Draft: Vendor in the IR module (#439) - pyproject.toml: add search path for Pyrefly (#524) - Vendor in numba.core.typing for CUDA-specific changes (#473) - Use numba.config when available, otherwise use numba.cuda.config (#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (#502) - build: allow parallelization of nvcc testing builds (#521) - chore(dev-deps): add pixi (#505) - Vendor the imputils module for CUDA refactoring (#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (#519) - Switch back to stable cuDF release in thirdparty tests (#518) - Updating .gitignore with binaries in the `testing` folder (#516) - Remove some unnecessary uses of ContextResettingTestCase (#507) - Vendor in _helperlib cext for CUDA-specific changes (#512) - Vendor in typeconv for future CUDA-specific changes (#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (#494) - Make the CUDA target the default for CUDA overload decorators (#511) - Remove C extension loading hacks (#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (#433) - Fix Bf16 Test OB Error (#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (#488) - Improve debug value range coverage (#461) - Add `compile_all` API (#484) - Vendor in core.registry for CUDA-specific changes (#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (#476) - [test] Remove dependency on cpu_target (#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (#475) - [test] Use numpy's tolerance for float16 (#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (#478)

brandon-b-miller added 5 commits August 15, 2025 13:52

initial

a0f25af

tests

5322eef

refactor

251f4e9

small changes

505cd4d

__cuda_stream__

b861723

brandon-b-miller commented Aug 18, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/driver.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

brandon-b-miller added 3 commits August 20, 2025 06:02

Merge branch 'main' into cuda-core-streams

b53f9ca

accomodate ctypes bindings

2181748

clean

46863d3

more pacifying ctypes bindings

2082063

fix

ec5841c

kkraus14 reviewed Aug 22, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/driver.py Outdated Show resolved Hide resolved

numba_cuda/numba/cuda/cudadrv/driver.py Outdated Show resolved Hide resolved

isVoid reviewed Aug 22, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/driver.py Show resolved Hide resolved

isVoid reviewed Aug 22, 2025

View reviewed changes

brandon-b-miller and others added 4 commits August 25, 2025 07:22

Merge branch 'main' into cuda-core-streams

2e45f6d

renaming

4fcf9d1

address reviews

220c2e3

Update numba_cuda/numba/cuda/cudadrv/driver.py

f3b07c0

Co-authored-by: Keith Kraus <[email protected]>

leofang reviewed Aug 26, 2025

View reviewed changes

isVoid reviewed Aug 29, 2025

View reviewed changes

numba_cuda/numba/cuda/tests/cudadrv/test_cuda_driver.py Outdated Show resolved Hide resolved

rparolin mentioned this pull request Oct 1, 2025

[FEA] Make cuda.core.Stream recognized by numba-cuda by supporting the __cuda_stream__ protocol #151

Closed

brandon-b-miller added 2 commits October 7, 2025 07:27

merge/resolve

387ba84

address some reviews

20440ab

merge/resolve

9ab36e7

small fix

d1ad577

small fix

b7b56eb

Merge branch 'main' into cuda-core-streams

c3e10af

kkraus14 reviewed Oct 27, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/driver.py Outdated Show resolved Hide resolved

brandon-b-miller added 2 commits October 27, 2025 10:03

Merge branch 'main' into cuda-core-streams

9b301a8

USE_NV_BINDING

324a48a

kkraus14 approved these changes Oct 27, 2025

View reviewed changes

events

7df62ce

kkraus14 approved these changes Oct 27, 2025

View reviewed changes

skip event tests on sim

f859466

brandon-b-miller merged commit 39066c7 into NVIDIA:main Oct 27, 2025
70 checks passed

brandon-b-miller deleted the cuda-core-streams branch October 27, 2025 22:01

atmnp pushed a commit to atmnp/numba-cuda that referenced this pull request Oct 29, 2025

Handle cuda.core.Stream in driver operations (NVIDIA#401)

6a99c1b

Closes NVIDIA#151 --------- Co-authored-by: Keith Kraus <[email protected]>

leofang reviewed Oct 30, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/driver.py Show resolved Hide resolved

leofang mentioned this pull request Oct 30, 2025

Restore compatibility with older cuda-bindings #563

Closed

gmarkall mentioned this pull request Nov 20, 2025

Bump version to 0.21.0 #602

Merged

		acceptable stream objects. Acceptable types are
		int (0 for default stream), Stream, ExperimentalStream

Handle cuda.core.Stream in driver operations #401

Handle cuda.core.Stream in driver operations #401

Uh oh!

Conversation

brandon-b-miller commented Aug 18, 2025

Uh oh!

copy-pr-bot bot commented Aug 18, 2025

Uh oh!

Uh oh!

brandon-b-miller commented Aug 18, 2025

Uh oh!

This comment was marked as outdated.

brandon-b-miller commented Aug 20, 2025

Uh oh!

brandon-b-miller commented Aug 20, 2025

Uh oh!

brandon-b-miller commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

isVoid left a comment

Choose a reason for hiding this comment

Uh oh!

brandon-b-miller commented Aug 25, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandon-b-miller Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brandon-b-miller commented Oct 14, 2025

Uh oh!

brandon-b-miller commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 15, 2025

Uh oh!

brandon-b-miller commented Oct 24, 2025

Uh oh!

Uh oh!

brandon-b-miller commented Oct 27, 2025

Uh oh!

brandon-b-miller commented Oct 27, 2025

Uh oh!

brandon-b-miller commented Oct 27, 2025

Uh oh!

Uh oh!

Uh oh!

leofang commented Oct 30, 2025

Uh oh!

brandon-b-miller commented Oct 30, 2025

Uh oh!

leofang commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Handle `cuda.core.Stream` in driver operations #401

Handle `cuda.core.Stream` in driver operations #401

brandon-b-miller Oct 27, 2025 •

edited

Loading