[FEA] Support for simpler ways to inline PTX

Presently PTX can be inlined, but it requires writing NVVM IR with llvmlite inside a Numba extension (either with typing and a lowering function that generates an `ir.InlineAsm` instruction, or an overload and an intrinsic with the necessary code generation).

It would be nicer to be able to write inline PTX more simply. There are two possible ways this could be done:

- Through Pythonic intrinsics for PTX instructions, if these were to exist.
- Through a simpler API that can be used directly in a kernel.

For the latter option, the usage could look like:

```python
@cuda.jit
def f(r, x):
    arg = x[0]
    result = inline_ptx("tanh.approx.f32 $0, $1;", "=f,f", (arg,))
    r[0] = result
```

This mimics the CUDA C++ API for inline PTX, where the assembly snippet, constraints, and arguments all need to be provided.

cc @leofang @oleksandr-pavlyk @benhg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEA] Support for simpler ways to inline PTX #532

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEA] Support for simpler ways to inline PTX #532

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions