Skip to content

Log GPU timing in cudf-polars traces#21970

Draft
TomAugspurger wants to merge 8 commits intorapidsai:mainfrom
TomAugspurger:tom/gpu-timing
Draft

Log GPU timing in cudf-polars traces#21970
TomAugspurger wants to merge 8 commits intorapidsai:mainfrom
TomAugspurger:tom/gpu-timing

Conversation

@TomAugspurger
Copy link
Copy Markdown
Contributor

@TomAugspurger TomAugspurger commented Mar 31, 2026

Description

The tracing in cudf-polars is currently based around host times. We record a start and stop based on Python's monotonic_ns around a call to IR.do_evaluate. But inside of that IR.do_evaluate we call some non-blocking, asynchronous pylibucdf calls whose runtime extends past the end of stop, running on the GPU.

To measure the actual runtime of GPU operations associated with some IR node, we need to measure when that sequence of GPU operations actually finishes. There's several ways to do this, but I've opted for CUDA Events and cudaLaunchHostFunc.

Using cudaLaunchHostFunc to call a python function can be fraught (deadlocks, often related to the GIL, are apparently a risk)... We deliberately keep the work done inside this function (and so, on the thread calling it) simple: just putting an integer completion token on a Queue.

The actual logging is done by a background thread.

When enabled, we'll emit two traces per IR.do_evaluate:

  1. a trace with scope=evaluate_ir_node containing host information
  2. a trace with scope=evaluate_ir_node_gpu containing gpu information

We include a trace_event_id (a UUID generated on the fly) so that consumers can correlate GPU traces with host traces.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Mar 31, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Mar 31, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Mar 31, 2026
@TomAugspurger TomAugspurger changed the title Log GPU Events Log GPU traces in cudf-polars Mar 31, 2026
@TomAugspurger
Copy link
Copy Markdown
Contributor Author

TomAugspurger commented Mar 31, 2026

Using cudaLaunchHostFunc to call a python function can be fraught

Proof: this is currently hanging on some queries. I don't know why yet, but it's related to recording the completion event on result.stream; An earlier version used kwargs["context"].get_cuda_stream(). This didn't hang, but it also didn't correctly measure the time the GPU operations took (since the stream from get_cuda_stream() might not be downstream of the operations in IR.do_evaluate.

@TomAugspurger
Copy link
Copy Markdown
Contributor Author

Using cudaLaunchHostFunc to call a python function can be fraught

Proof 2: even after "fixing" that (by recording the completion event on another stream with)

                timing_stream = get_joined_cuda_stream(
                    exec_ctx.get_cuda_stream,
                    upstreams=[exec_ctx.get_cuda_stream(), result.stream],
                )
 

I'm getting intermittent hangs. Maybe this suggests the approach of using cuda.bindings plus a "simple" host callback function is more fragile than I thought. Or maybe I'm missing something about how we're supposed to record the events.

Comment thread python/cudf_polars/cudf_polars/dsl/ir.py Outdated
Comment thread python/cudf_polars/cudf_polars/utils/cuda_stream.py
Comment thread python/cudf_polars/cudf_polars/utils/config.py Outdated
Comment on lines +205 to +206
trace_event_id: str | None = None
query_id_str: str | None = None
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think query_id should be included through context variables?


from __future__ import annotations

import ctypes
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ctypes probably isn't the way to do this long term, but might be OK as a POC.

The tracing in cudf-polars is currently based around *host* times. We
record a `start` and `stop` based on Python's monotonic_ns around
a call to `IR.do_evaluate`. But inside of that IR.do_evaluate we call
some non-blocking, asynchronous pylibucdf calls whose runtime extends
past the end of `stop`, running on the GPU.

To measure the *actual* runtime of GPU operations associated with some
IR node, we need to measure when sequence of GPU operations actually
finishes. There's several ways to do this, but I've opted for CUDA
Events and `cudaLaunchHostFunc`.

Using `cudaLaunchHostFunc` to call a *python* function can be fraught...
We deliberately keep the work done inside this function (and so, on the
thread calling it) simple: just putting an integer completion token on a Queue.

The actual logging is done by a background thread.
@TomAugspurger TomAugspurger added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 1, 2026
@TomAugspurger
Copy link
Copy Markdown
Contributor Author

/ok to test 1eb332d

@TomAugspurger TomAugspurger changed the title Log GPU traces in cudf-polars Log GPU timing in cudf-polars traces Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

2 participants