perf: speed up kernel launch #510

cpcloud · 2025-10-07T16:21:58Z

This PR speeds up kernel launching by ~10-25%.

To figure out where some of the hotter paths were, I used the script that's in
the top-level of the repo prof.py.

The script makes a call to a cuda-jitted function that takes a single argument
and does nothing in the body of the function.

I don't plan to include this script if this PR is merged, but it's there so
people can pull the PR down and run the code if they'd like.

I also don't plan to include the changes to pixi that add the prof group.

Most of the changes here fall under the category of removing dynamism.

Replacing __getattr__ calls/implementations by concretizing frequently accessed attributes.
Avoiding isinstance checks of abc.ABCMeta-based objects.
Accessing attributes by direct lookup instead of using getattr(x, string).

copy-pr-bot · 2025-10-07T16:22:02Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud · 2025-10-07T16:23:52Z

numba_cuda/numba/cuda/cudadrv/devicearray.py

-        else:
+        try:
+            # apparently faster in the non-exceptional case
            return self.gpu_data.device_ctypes_pointer


This is faster because after Python 3.11, there's no cost to the non-exceptional case, and this case appears to be more common in the kernel launching path than the case of self.gpu_data is None being True.

cpcloud · 2025-10-07T16:32:11Z

numba_cuda/numba/cuda/core/config.py

 def __getattr__(name):
    """Module-level __getattr__ provides dynamic behavior for _EnvVar descriptors."""
-    # Fetch non-descriptor globals directly
-    if name in globals():


I removed this because this is the default behavior of module-level attribute access, so it's pointless to do it twice. __getattr__ is only called if name isn't found using the normal attribute lookup.

cpcloud · 2025-10-07T16:34:56Z

numba_cuda/numba/cuda/cudadrv/devices.py

-        return super(_DeviceList, self).__getattr__(attr)
+    @property
+    @functools.cache
+    def lst(self):


This is faster for all instances, because there's no longer any dynamic attribute lookup happening in Python, it's all happening in native code with the exception of this attribute.

cpcloud · 2025-10-07T16:35:19Z

numba_cuda/numba/cuda/cudadrv/devices.py

+        if devnum is not None:
+            return self[devnum]
+        return None


This change probably doesn't affect performance, so I am happy to remove it.

cpcloud · 2025-10-07T18:44:47Z

numba_cuda/numba/cuda/cudadrv/devices.py

                return ctx

    def _activate_context_for(self, devnum):
-        with self._lock:


This lock (while reentrant, so not incorrect) was held inside of another section, so I just inlined the lock-holding to the one place this method was being called without a lock, and was able to remove one acquire and release operation.

numba_cuda/numba/cuda/args.py

leofang · 2025-10-07T21:29:11Z

prof.py

btw another good benchmark is this #288 (comment). Varying the number of kernel arguments (or the dimension of a single input array) might help reveal additional hot paths. For every array argument, last time I checked (which was a while ago) numba-cuda would unpack it to (1 + ndim*2) arguments (ptr/shape/strides).

numba_cuda/numba/cuda/cudadrv/driver.py

numba_cuda/numba/cuda/cudadrv/devices.py

gmarkall · 2025-10-14T09:56:14Z

numba_cuda/numba/cuda/tests/cudadrv/test_context_stack.py

        device_ids = [device.id for device in cuda.list_devices()]
        for device_id in device_ids:
-            with cuda.gpus[device_id]:
+            with cuda.gpus[int(device_id)]:


I saw this change after writing https://github.com/NVIDIA/numba-cuda/pull/510/files#r2428469695 above. This line is the key part of this test, so changing it is adapting it to the fact that this PR changed the API.

cpcloud · 2025-10-15T17:23:46Z

I'm going to push up a small PR to add a couple benchmarks, so that it's easy to use pytest-benchmark to run them and I don't have to keep using pyinstrument to verify results.

cpcloud · 2025-10-15T19:04:40Z

Changed the title to reflect the variance in speedup. It's somewhere between 10-25%.

This PR establishes a benchmarks directory where pytest-benchmark-based benchmarks can live. This is to serve as a baseline for #510.

cpcloud · 2025-10-16T13:21:54Z

/ok to test

copy-pr-bot · 2025-10-16T15:04:53Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpcloud · 2025-10-16T15:06:31Z

This is ready for review.

cpcloud · 2025-10-16T15:34:05Z

This PR does not close #477.

There's still some significant difference in our code path that automatically converts torch tensors into cuda arrays

cpcloud · 2025-10-20T16:57:59Z

@gmarkall This is ready for review. I am still working on isolating the as_cuda_array differences, which I will tackle in a follow-up. I already have a couple wins there, but I'm still trying to track down some more of the performance issues there.

cpcloud · 2025-10-21T15:31:40Z

Ping @gmarkall, would you mind reviewing this? Also happy to give this over to another reviewer if you don't have time!

- Add support for cache-hinted load and store operations (NVIDIA#587) - Add more thirdparty tests (NVIDIA#586) - Add sphinx-lint to pre-commit and fix errors (NVIDIA#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (NVIDIA#544) - chore: clean up dead workaround for unavailable `lru_cache` (NVIDIA#598) - chore(docs): format types docs (NVIDIA#596) - refactor: decouple `Context` from `Stream` and `Event` objects (NVIDIA#579) - Fix freezing in of constant arrays with negative strides (NVIDIA#589) - Update tests to accept variants of generated PTX (NVIDIA#585) - refactor: replace device functionality with `cuda.core` APIs (NVIDIA#581) - Move frontend tests to `cudapy` namespace (NVIDIA#558) - Generalize the concurrency group for main merges (NVIDIA#582) - ci: move pre-commit checks to pre commit action (NVIDIA#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (NVIDIA#574) - ci: ensure that python version in ci matches matrix (NVIDIA#575) - Fix the `cuda.is_supported_version()` API (NVIDIA#571) - Fix checks on main (NVIDIA#576) - feat: add `math.nextafter` (NVIDIA#543) - ci: replace conda testing with pixi (NVIDIA#554) - [CI] Run PR workflow on merge to main (NVIDIA#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (NVIDIA#569) - test: enable fail-on-warn and clean up resulting failures (NVIDIA#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (NVIDIA#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (NVIDIA#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (NVIDIA#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (NVIDIA#550) - test: revert back to ipc futures that await each iteration (NVIDIA#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (NVIDIA#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (NVIDIA#534) - Remove dependencies on target_extension for CUDA target (NVIDIA#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (NVIDIA#559) - [WIP] Port numpy reduction tests to CUDA (NVIDIA#523) - ci: add timeout to avoid blocking the job queue (NVIDIA#556) - Handle `cuda.core.Stream` in driver operations (NVIDIA#401) - feat: add support for `math.exp2` (NVIDIA#541) - Vendor in types and datamodel for CUDA-specific changes (NVIDIA#533) - refactor: cleanup device constructor (NVIDIA#548) - bench: add cupy to array constructor kernel launch benchmarks (NVIDIA#547) - perf: cache dimension computations (NVIDIA#542) - perf: remove duplicated size computation (NVIDIA#537) - chore(perf): add torch to benchmark (NVIDIA#539) - test: speed up ipc tests by ~6.5x (NVIDIA#527) - perf: speed up kernel launch (NVIDIA#510) - perf: remove context threading in various pointer abstractions (NVIDIA#536) - perf: reduce the number of `__cuda_array_interface__` accesses (NVIDIA#538) - refactor: remove unnecessary custom map and set implementations (NVIDIA#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (NVIDIA#513) - test: add benchmarks for kernel launch for reproducibility (NVIDIA#528) - test(pixi): update pixi testing command to work with the new `testing` directory (NVIDIA#522) - refactor: fully remove `USE_NV_BINDING` (NVIDIA#525) - Draft: Vendor in the IR module (NVIDIA#439) - pyproject.toml: add search path for Pyrefly (NVIDIA#524) - Vendor in numba.core.typing for CUDA-specific changes (NVIDIA#473) - Use numba.config when available, otherwise use numba.cuda.config (NVIDIA#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (NVIDIA#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (NVIDIA#502) - build: allow parallelization of nvcc testing builds (NVIDIA#521) - chore(dev-deps): add pixi (NVIDIA#505) - Vendor the imputils module for CUDA refactoring (NVIDIA#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (NVIDIA#519) - Switch back to stable cuDF release in thirdparty tests (NVIDIA#518) - Updating .gitignore with binaries in the `testing` folder (NVIDIA#516) - Remove some unnecessary uses of ContextResettingTestCase (NVIDIA#507) - Vendor in _helperlib cext for CUDA-specific changes (NVIDIA#512) - Vendor in typeconv for future CUDA-specific changes (NVIDIA#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (NVIDIA#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (NVIDIA#494) - Make the CUDA target the default for CUDA overload decorators (NVIDIA#511) - Remove C extension loading hacks (NVIDIA#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (NVIDIA#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (NVIDIA#433) - Fix Bf16 Test OB Error (NVIDIA#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (NVIDIA#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (NVIDIA#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (NVIDIA#488) - Improve debug value range coverage (NVIDIA#461) - Add `compile_all` API (NVIDIA#484) - Vendor in core.registry for CUDA-specific changes (NVIDIA#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (NVIDIA#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (NVIDIA#476) - [test] Remove dependency on cpu_target (NVIDIA#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (NVIDIA#475) - [test] Use numpy's tolerance for float16 (NVIDIA#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (NVIDIA#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (NVIDIA#478)

- Add support for cache-hinted load and store operations (#587) - Add more thirdparty tests (#586) - Add sphinx-lint to pre-commit and fix errors (#597) - Add DWARF variant part support for polymorphic variables in CUDA debug info (#544) - chore: clean up dead workaround for unavailable `lru_cache` (#598) - chore(docs): format types docs (#596) - refactor: decouple `Context` from `Stream` and `Event` objects (#579) - Fix freezing in of constant arrays with negative strides (#589) - Update tests to accept variants of generated PTX (#585) - refactor: replace device functionality with `cuda.core` APIs (#581) - Move frontend tests to `cudapy` namespace (#558) - Generalize the concurrency group for main merges (#582) - ci: move pre-commit checks to pre commit action (#577) - chore(pixi): set up doc builds; remove most `build-conda` dependencies (#574) - ci: ensure that python version in ci matches matrix (#575) - Fix the `cuda.is_supported_version()` API (#571) - Fix checks on main (#576) - feat: add `math.nextafter` (#543) - ci: replace conda testing with pixi (#554) - [CI] Run PR workflow on merge to main (#572) - Propose Alternative Module Path for `ext_types` and Maintain `numba.cuda.types.bfloat16` Import API (#569) - test: enable fail-on-warn and clean up resulting failures (#529) - [Refactor][NFC] Vendor-in compiler_lock for future CUDA-specific changes (#565) - Fix registration with Numba, vendor MakeFunctionToJITFunction tests (#566) - [Refactor][NFC][Cleanups] Update imports to upstream numba to use the numba.cuda modules (#561) - test: refactor process-based tests to use concurrent futures in order to simplify tests (#550) - test: revert back to ipc futures that await each iteration (#564) - chore(deps): move to self-contained pixi.toml to avoid mixed-pypi-pixi environments (#551) - [Refactor][NFC] Vendor-in errors for future CUDA-specific changes (#534) - Remove dependencies on target_extension for CUDA target (#555) - Relax the pinning to `cuda-core` to allow it floating across minor releases (#559) - [WIP] Port numpy reduction tests to CUDA (#523) - ci: add timeout to avoid blocking the job queue (#556) - Handle `cuda.core.Stream` in driver operations (#401) - feat: add support for `math.exp2` (#541) - Vendor in types and datamodel for CUDA-specific changes (#533) - refactor: cleanup device constructor (#548) - bench: add cupy to array constructor kernel launch benchmarks (#547) - perf: cache dimension computations (#542) - perf: remove duplicated size computation (#537) - chore(perf): add torch to benchmark (#539) - test: speed up ipc tests by ~6.5x (#527) - perf: speed up kernel launch (#510) - perf: remove context threading in various pointer abstractions (#536) - perf: reduce the number of `__cuda_array_interface__` accesses (#538) - refactor: remove unnecessary custom map and set implementations (#530) - [Refactor][NFC] Vendor-in vectorize decorators for future CUDA-specific changes (#513) - test: add benchmarks for kernel launch for reproducibility (#528) - test(pixi): update pixi testing command to work with the new `testing` directory (#522) - refactor: fully remove `USE_NV_BINDING` (#525) - Draft: Vendor in the IR module (#439) - pyproject.toml: add search path for Pyrefly (#524) - Vendor in numba.core.typing for CUDA-specific changes (#473) - Use numba.config when available, otherwise use numba.cuda.config (#497) - [MNT] Drop NUMBA_CUDA_USE_NVIDIA_BINDING; always use cuda.core and cuda.bindings as fallback (#479) - Vendor in dispatcher, entrypoints, pretty_annotate for CUDA-specific changes (#502) - build: allow parallelization of nvcc testing builds (#521) - chore(dev-deps): add pixi (#505) - Vendor the imputils module for CUDA refactoring (#448) - Don't use `MemoryLeakMixin` for tests that don't use NRT (#519) - Switch back to stable cuDF release in thirdparty tests (#518) - Updating .gitignore with binaries in the `testing` folder (#516) - Remove some unnecessary uses of ContextResettingTestCase (#507) - Vendor in _helperlib cext for CUDA-specific changes (#512) - Vendor in typeconv for future CUDA-specific changes (#499) - [Refactor][NFC] Vendor-in numba.cpython modules for future CUDA-specific changes (#493) - [Refactor][NFC] Vendor-in numba.np modules for future CUDA-specific changes (#494) - Make the CUDA target the default for CUDA overload decorators (#511) - Remove C extension loading hacks (#506) - Ensure NUMBA can manipulate memory from CUDA graphs before the graph is launched (#437) - [Refactor][NFC] Vendor-in core Numba analysis utils for CUDA-specific changes (#433) - Fix Bf16 Test OB Error (#509) - Vendor in components from numba.core.runtime for CUDA-specific changes (#498) - [Refactor] Vendor in _dispatcher, _devicearray, mviewbuf C extension for CUDA-specific customization (#373) - [MNT] Managed UM memset fallback and skip CUDA IPC tests on WSL2 (#488) - Improve debug value range coverage (#461) - Add `compile_all` API (#484) - Vendor in core.registry for CUDA-specific changes (#485) - [Refactor][NFC] Vendor in numba.misc for CUDA-specific changes (#457) - Vendor in optional, boxing for CUDA-specific changes, fix dangling imports (#476) - [test] Remove dependency on cpu_target (#490) - Change dangling imports of numba.core.lowering to numba.cuda.lowering (#475) - [test] Use numpy's tolerance for float16 (#491) - [Refactor][NFC] Vendor-in numba.extending for future CUDA-specific changes (#466) - [Refactor][NFC] Vendor-in more cpython registries for future CUDA-specific changes (#478)

cpcloud commented Oct 7, 2025

View reviewed changes

cpcloud requested a review from gmarkall October 7, 2025 18:36

cpcloud commented Oct 7, 2025

View reviewed changes

numba_cuda/numba/cuda/args.py Show resolved Hide resolved

leofang reviewed Oct 7, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/driver.py Show resolved Hide resolved

gmarkall added the 2 - In Progress Currently a work in progress label Oct 8, 2025

cpcloud force-pushed the speedup-kernel-launch branch 2 times, most recently from f8a3c3a to 364e4e2 Compare October 13, 2025 21:11

gmarkall reviewed Oct 14, 2025

View reviewed changes

numba_cuda/numba/cuda/cudadrv/devices.py Outdated Show resolved Hide resolved

gmarkall reviewed Oct 14, 2025

View reviewed changes

cpcloud force-pushed the speedup-kernel-launch branch 2 times, most recently from b9765c9 to b09e3be Compare October 15, 2025 13:31

cpcloud mentioned this pull request Oct 15, 2025

test: add benchmarks for kernel launch for reproducibility #528

Merged

cpcloud force-pushed the speedup-kernel-launch branch from b11dc74 to 9ef5864 Compare October 15, 2025 18:24

cpcloud changed the title ~~perf: speed up kernel launch by ~25%~~ perf: speed up kernel launch Oct 15, 2025

cpcloud force-pushed the speedup-kernel-launch branch 2 times, most recently from 1386e40 to 1cc38f3 Compare October 15, 2025 19:16

gmarkall pushed a commit that referenced this pull request Oct 16, 2025

test: add benchmarks for kernel launch for reproducibility (#528)

3d34c31

This PR establishes a benchmarks directory where pytest-benchmark-based benchmarks can live. This is to serve as a baseline for #510.

cpcloud added 5 commits October 16, 2025 08:43

perf: remove unnecessary metaclass

5b87524

perf: remove unnecessary wrap_arg call

77c311c

perf: construct shape and strides args with minimal Python calls

bd19d73

perf: concretize as many attribute accesses as possible

0d68a22

refactor: reduce the noise around lst in _DeviceList

a1b6a1e

cpcloud added 2 commits October 16, 2025 08:43

refactor: clean up string formatting

9e3a39b

perf: only use the driver lock when necessary; avoid reentering the lock

f1105f0

cpcloud force-pushed the speedup-kernel-launch branch from 1cc38f3 to f1105f0 Compare October 16, 2025 12:43

perf: shave a few more µs off of attribute access

54d34f9

cpcloud marked this pull request as ready for review October 16, 2025 15:04

cpcloud requested a review from gmarkall October 16, 2025 15:04

gmarkall added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 16, 2025

gmarkall approved these changes Oct 21, 2025

View reviewed changes

gmarkall merged commit 46cda77 into NVIDIA:main Oct 21, 2025
70 checks passed

cpcloud deleted the speedup-kernel-launch branch October 22, 2025 01:21

cpcloud mentioned this pull request Oct 23, 2025

[BUG] Large overhead when launching kernel with torch tensors #98

Open

gmarkall mentioned this pull request Nov 20, 2025

Bump version to 0.21.0 #602

Merged

perf: speed up kernel launch #510

perf: speed up kernel launch #510

Uh oh!

Conversation

cpcloud commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Oct 7, 2025

Uh oh!

cpcloud Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leofang Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gmarkall Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

cpcloud commented Oct 15, 2025

Uh oh!

cpcloud commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cpcloud commented Oct 16, 2025

Uh oh!

copy-pr-bot bot commented Oct 16, 2025

Uh oh!

cpcloud commented Oct 16, 2025

Uh oh!

cpcloud commented Oct 16, 2025

Uh oh!

cpcloud commented Oct 20, 2025

Uh oh!

cpcloud commented Oct 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cpcloud commented Oct 7, 2025 •

edited

Loading

cpcloud Oct 7, 2025 •

edited

Loading

cpcloud commented Oct 15, 2025 •

edited

Loading