Implement a thread-local means to access kernel launch config. #288

tpn · 2025-06-10T22:46:29Z

This allows downstream passes, such as rewriting, to access information about the kernel launch for which they have been enlisted to participate.

Posting this as a PR now to get feedback on the overall approach. Assuming this solution is acceptable, I'll follow up with tests and docs.

copy-pr-bot · 2025-06-10T22:46:32Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

tpn · 2025-06-10T22:55:38Z

Will close #280.

gmarkall · 2025-06-11T22:00:28Z

/ok to test

gmarkall · 2025-06-11T22:07:45Z

Using the simple benchmark from numba/numba#3003 (comment):

from numba import cuda
import numpy as np


@cuda.jit('void()')
def some_kernel_1():
    return

@cuda.jit('void(float32[:])')
def some_kernel_2(arr1):
    return

@cuda.jit('void(float32[:],float32[:])')
def some_kernel_3(arr1,arr2):
    return

@cuda.jit('void(float32[:],float32[:],float32[:])')
def some_kernel_4(arr1,arr2,arr3):
    return

@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')
def some_kernel_5(arr1,arr2,arr3,arr4):
    return

arr = cuda.device_array(10000, dtype=np.float32)

%timeit some_kernel_1[1, 1]()
%timeit some_kernel_2[1, 1](arr)
%timeit some_kernel_3[1, 1](arr,arr)
%timeit some_kernel_4[1, 1](arr,arr,arr)
%timeit some_kernel_5[1, 1](arr,arr,arr,arr)

On main:

6.6 μs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
10.6 μs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
14.1 μs ± 65.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
17.4 μs ± 236 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
20.3 μs ± 38.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

On this branch:

8.6 μs ± 70.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
13 μs ± 41.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
17.2 μs ± 152 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
20.3 μs ± 35.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
24.4 μs ± 44.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

From this crude benchmark, it appears to add 2-4μs per launch, or 17-23% overhead. I don't yet know how we should consider this (or whether the benchmark is appropriate) but I think it's a consideration to keep in mind when thinking about this approach.

gmarkall · 2025-06-11T22:08:54Z

numba_cuda/numba/cuda/dispatcher.py

+            stream=stream,
+            sharedmem=sharedmem,
+        ):
+            if self.specialized:


Specialized kernels cannot be recompiled, so a new launch configuration would not be able to affect the compilation of a new version - so this check could be kept outside the context manager.

gmarkall · 2025-06-11T22:11:12Z

numba_cuda/numba/cuda/launchconfig.py

+
+
+@dataclass(frozen=True, slots=True)
+class LaunchConfig:


There seems to be quite some overlap with dispatcher._LaunchConfiguration in this class (an observation at this point - I don't know whether it makes sense to combine them)

kkraus14 · 2025-06-12T01:22:13Z

Using the simple benchmark from numba/numba#3003 (comment):

from numba import cuda
import numpy as np


@cuda.jit('void()')
def some_kernel_1():
    return

@cuda.jit('void(float32[:])')
def some_kernel_2(arr1):
    return

@cuda.jit('void(float32[:],float32[:])')
def some_kernel_3(arr1,arr2):
    return

@cuda.jit('void(float32[:],float32[:],float32[:])')
def some_kernel_4(arr1,arr2,arr3):
    return

@cuda.jit('void(float32[:],float32[:],float32[:],float32[:])')
def some_kernel_5(arr1,arr2,arr3,arr4):
    return

arr = cuda.device_array(10000, dtype=np.float32)

%timeit some_kernel_1[1, 1]()
%timeit some_kernel_2[1, 1](arr)
%timeit some_kernel_3[1, 1](arr,arr)
%timeit some_kernel_4[1, 1](arr,arr,arr)
%timeit some_kernel_5[1, 1](arr,arr,arr,arr)

On main:

6.6 μs ± 11.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
10.6 μs ± 79.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
14.1 μs ± 65.8 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
17.4 μs ± 236 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
20.3 μs ± 38.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

On this branch:

8.6 μs ± 70.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
13 μs ± 41.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
17.2 μs ± 152 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
20.3 μs ± 35.9 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
24.4 μs ± 44.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

From this crude benchmark, it appears to add 2-4μs per launch, or 17-23% overhead. I don't yet know how we should consider this (or whether the benchmark is appropriate) but I think it's a consideration to keep in mind when thinking about this approach.

Kernel launch latency is something we need to care about moving forward and is becoming more and more of a bottleneck in important workloads. Adding 2-4μs per launch for this is probably unacceptable, but that being said our launch latency right now is much higher than we'd like in general where we probably need to rework the entire launch path at some point in the not too distant future.

gmarkall · 2025-06-12T10:27:59Z

we probably need to rework the entire launch path at some point in the not too distant future.

I think that's another thing that concerns me - if we implement something like this, then it constrains how we can rework the launch path in future if we have to go on supporting it.

This allows downstream passes, such as rewriting, to access information about the kernel launch for which they have been enlisted to participate.

This routine raises an error if no launch config is set, which is inevitably going to be the preferred way of obtaining the current launch config.

This is required by cuda.coop in order to pass two-phase primitive instances as kernel parameters without having to call the @cuda.jit decorator with extensions=[...] up-front.

tpn mentioned this pull request Jun 10, 2025

[FEA] Plumb kernel launch configuration such that it's accessible from rewriting and lowering phases #280

Open

gmarkall reviewed Jun 11, 2025

View reviewed changes

gmarkall added the 2 - In Progress Currently a work in progress label Jun 12, 2025

tpn force-pushed the 280-launch-config-contextvar branch from 019d60d to bba597d Compare July 14, 2025 02:36

leofang mentioned this pull request Sep 19, 2025

Reduce kernel launch overhead #477

Open

tpn added 4 commits September 29, 2025 15:01

Implement a thread-local means to access kernel launch config.

b229ca7

This allows downstream passes, such as rewriting, to access information about the kernel launch for which they have been enlisted to participate.

Add args and dispatcher to LaunchConfig.

2faf9cd

Implement ensure_current_launch_config().

6a88b46

This routine raises an error if no launch config is set, which is inevitably going to be the preferred way of obtaining the current launch config.

Add support for pre-kernel-launch callbacks to launch config.

fea9f79

This is required by cuda.coop in order to pass two-phase primitive instances as kernel parameters without having to call the @cuda.jit decorator with extensions=[...] up-front.

tpn force-pushed the 280-launch-config-contextvar branch from bba597d to fea9f79 Compare September 29, 2025 22:02

NVIDIA deleted a comment from CLAassistant Oct 1, 2025

leofang mentioned this pull request Oct 7, 2025

perf: speed up kernel launch #510

Merged

gmarkall mentioned this pull request Nov 17, 2025

[BUG] Warp vote operations must use a constant int for the mode #592

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement a thread-local means to access kernel launch config. #288

Implement a thread-local means to access kernel launch config. #288

Uh oh!

tpn commented Jun 10, 2025

Uh oh!

copy-pr-bot bot commented Jun 10, 2025

Uh oh!

tpn commented Jun 10, 2025

Uh oh!

gmarkall commented Jun 11, 2025

Uh oh!

gmarkall commented Jun 11, 2025

Uh oh!

gmarkall Jun 11, 2025

Uh oh!

gmarkall Jun 11, 2025

Uh oh!

kkraus14 commented Jun 12, 2025

Uh oh!

gmarkall commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Implement a thread-local means to access kernel launch config. #288

Are you sure you want to change the base?

Implement a thread-local means to access kernel launch config. #288

Uh oh!

Conversation

tpn commented Jun 10, 2025

Uh oh!

copy-pr-bot bot commented Jun 10, 2025

Uh oh!

tpn commented Jun 10, 2025

Uh oh!

gmarkall commented Jun 11, 2025

Uh oh!

gmarkall commented Jun 11, 2025

Uh oh!

gmarkall Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

gmarkall Jun 11, 2025

Choose a reason for hiding this comment

Uh oh!

kkraus14 commented Jun 12, 2025

Uh oh!

gmarkall commented Jun 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants