Log GPU timing in cudf-polars traces#21970
Log GPU timing in cudf-polars traces#21970TomAugspurger wants to merge 8 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Proof: this is currently hanging on some queries. I don't know why yet, but it's related to recording the completion event on |
Proof 2: even after "fixing" that (by recording the completion event on another stream with) I'm getting intermittent hangs. Maybe this suggests the approach of using |
| trace_event_id: str | None = None | ||
| query_id_str: str | None = None |
There was a problem hiding this comment.
I think query_id should be included through context variables?
|
|
||
| from __future__ import annotations | ||
|
|
||
| import ctypes |
There was a problem hiding this comment.
ctypes probably isn't the way to do this long term, but might be OK as a POC.
The tracing in cudf-polars is currently based around *host* times. We record a `start` and `stop` based on Python's monotonic_ns around a call to `IR.do_evaluate`. But inside of that IR.do_evaluate we call some non-blocking, asynchronous pylibucdf calls whose runtime extends past the end of `stop`, running on the GPU. To measure the *actual* runtime of GPU operations associated with some IR node, we need to measure when sequence of GPU operations actually finishes. There's several ways to do this, but I've opted for CUDA Events and `cudaLaunchHostFunc`. Using `cudaLaunchHostFunc` to call a *python* function can be fraught... We deliberately keep the work done inside this function (and so, on the thread calling it) simple: just putting an integer completion token on a Queue. The actual logging is done by a background thread.
a90e3b6 to
372d0f9
Compare
|
/ok to test 1eb332d |
Description
The tracing in cudf-polars is currently based around host times. We record a
startandstopbased on Python's monotonic_ns around a call toIR.do_evaluate. But inside of that IR.do_evaluate we call some non-blocking, asynchronous pylibucdf calls whose runtime extends past the end ofstop, running on the GPU.To measure the actual runtime of GPU operations associated with some IR node, we need to measure when that sequence of GPU operations actually finishes. There's several ways to do this, but I've opted for CUDA Events and
cudaLaunchHostFunc.Using
cudaLaunchHostFuncto call a python function can be fraught (deadlocks, often related to the GIL, are apparently a risk)... We deliberately keep the work done inside this function (and so, on the thread calling it) simple: just putting an integer completion token on a Queue.The actual logging is done by a background thread.
When enabled, we'll emit two traces per
IR.do_evaluate:scope=evaluate_ir_nodecontaining host informationscope=evaluate_ir_node_gpucontaining gpu informationWe include a
trace_event_id(a UUID generated on the fly) so that consumers can correlate GPU traces with host traces.