Trace node execution in cudf-polars #19895

TomAugspurger · 2025-09-05T13:52:15Z

Description

This PR introduces a new low-overhead tracing tool for cudf-polars. When enabled, we'll capture a record for each IR.do_evaluate node executed while running the polars query. The record will contain

The type of the IR node being executed
The number and size (in bytes and rows) of each input and output DataFrame
The RMM statistics before and after executing the node
The duration of the node execution.

The overhead of this when not tracing should just be a conditional check against a constant bool, and an extra function call. The overhead when tracing is enabled should be minimal (mostly getting the dataframe sizes, RMM statistics, and emitting the log record).

The implementation is based on structlog. This is helpful for us because we can inject additional context (like the PDSH query ID being run in our benchmarks) into the log records.

As an example of what this can be used for, I've updated our PDSH benchmarks to include a new --collect-traces flag. When set, this will include the traces on each record exported to our pdsh_results.jsonl file. The records for that run (one per query / iteration pair) will now contain a list[traces] field, with the traces from each node executed during each query.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-09-05T13:52:19Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

This PR introduces a new *low-overhead* tracing tool for cudf-polars. When enabled, we'll capture a record for each `IR.do_evaluate` node executed while running the polars query.

python/cudf_polars/cudf_polars/callback.py

python/cudf_polars/cudf_polars/dsl/tracing.py

wence-

Can you show some benchmark data for the overhead this adds? In both off and on modes?

python/cudf_polars/cudf_polars/dsl/tracing.py

python/cudf_polars/cudf_polars/experimental/benchmarks/utils.py

python/cudf_polars/cudf_polars/dsl/tracing.py

TomAugspurger · 2025-09-10T19:52:22Z

Can you show some benchmark data for the overhead this adds? In both off and on modes?

First, there's the extra time at import (measured with python -X importtime -c 'import cudf_polars'), and I've selected a few interesting package imports for different cases:

case	import time: self [us]	cumulative	imported package
main	1325	1137945	cudf_polars
off	1011	1097003	cudf_polars
on	1022	1274190	cudf_polars
off	1782	219498	structlog
on	1558	225015	structlog

main = the time on branch-25.10
off = the time on this branch with CUDF_POLARS_LOG_TRACES=0 (not enabled)
on = the time on this branch with CUDF_POLARS_LOG_TRACE (enabled)

The overall cudf_polars import is slightly slower (row 1 vs. 2 / 3). This
seems to be from the time it takes to import structlog (last two rows,
we could technically only import structlog if CUDF_POLARS_LOG_TRACES is set
but IMO it's not worth the code complexity).

So tl/dr: I'm not worried about the import time cost. Which leaves the runtime overhead. Let's measure the cost of executing a DataFrameScan.do_evaluate node on a small input dataframe:

import os
import sys
import time

import cudf_polars
import cudf_polars.containers
import polars as pl
from cudf_polars.dsl.translate import Translator

q = pl.DataFrame({"a": [1, 2, 3]}).lazy()
ir = Translator(q._ldf.visit(), pl.GPUEngine()).translate_ir()
df = q.collect()

schema = {
    "a": cudf_polars.containers.DataType(pl.Int64)
}
t = type(ir)

t0 = time.monotonic()
t.do_evaluate(schema, df, None)
t1 = time.monotonic()
print("CUDF_POLARS_LOG_TRACES", os.environ.get("CUDF_POLARS_LOG_TRACES", "0"), file=sys.stderr)
print(f"Initial: {t1 - t0:0.2f} seconds", file=sys.stderr)

t0 = time.monotonic()
for i in range(100):
    t.do_evaluate(schema, df, None)
t1 = time.monotonic()
print(f"Subsequent: {t1 - t0:0.2f} seconds", file=sys.stderr)

On main, the first call takes between 0.4 - 0.6 seconds and the subsequent
1000 calls take between 0.05 - 0.10 seconds in total. There's a decent
amount of variance here.

On this branch withouthout tracing enabled, the first call takes roughly 0.4-0.6 seconds, and the
subsequent 1000 calls take between 0.05-0.10 seconds in total. i.e. no change,
which is consistent with the there being zero overhead since the code is literally the same.

With tracing, the first call takes between 0.4 - 0.6 seconds (very roughly the same). The
subsequent calls take between 0.2 - 0.3 seconds in total (so something like 3x - 6x slower).

I think that DataFrameScan with a small input should be the worst case scenario. In general, the amount of overhead includes will be fixed, so the faster the node the larger the overhead is in percentage terms. The only part of the tracing that scales is in the number of input dataframes, and the number of columns per dataframe (but not in the number of rows).

Overall, I think things look acceptable.

TomAugspurger · 2025-09-26T19:18:53Z

I've run the full pdsh benchmarks for both single (SF-1K) and multi-GPU (SF-3K) from this branch with 3 iterations each. The results.jsonl files are about 30 MB now. That's mostly from JSON being a poor storage format for this type of data. Converting to parquet:

import pandas as pd
import json
import pyarrow as pa

runs = [json.loads(x) for x in open("single.jsonl").readlines()]

dfs = []
for run in runs:
    for query_id, records in run["records"].items():
        for iteration_id, iteration in enumerate(records):
            traces = iteration["traces"]
            df = pa.Table.from_pylist(traces).to_pandas(types_mapper=pd.ArrowDtype)
            dfs.append(df)

df = pd.concat(dfs)
df.to_parquet("traces.parquet")

And the file drops to about 2 MB.

I think keeping them in the results.jsonl file is fine: It's nice to keep the single file output.

docs/cudf/source/cudf_polars/usage.md

python/cudf_polars/cudf_polars/dsl/tracing.py

Matt711

Thanks @TomAugspurger, I think this a nice low-overhead improvement. We aren't including network latency breakdown in this PR, but do you think that's something we could eventually achieve?

At the very least there is a --nic-metrics flag in nsys. So we could turn that on (if we haven't) for profiling to get an idea of network activity.

docs/cudf/source/cudf_polars/usage.md

TomAugspurger · 2025-10-02T19:30:14Z

We aren't including network latency breakdown in this PR, but do you think that's something we could eventually achieve?

One clarifying question: do you mean network latency for things like reading parquet files over the network? Or network latency for moving data between nodes in some kind of multi-GPU setup?

The tracing here is fundamentally at the "execute this IR node level". So it could possibly capture things like the network activity that happened while executing the IR node. That said any kind of metric based off reading a device counter would be hard to trace back to a specific IR node, since IIUC other stuff could be happening on that device to increment that counter.

Then there's the communication that needs to happen to ensure that the data necessary to execute some node, in a multi-GPU setup. Today, that's essentially outside of cudf-polars hands. With our dask-based execution, communication happens implicitly thanks to the task framework. Some IR node will depend on the output of one or more earlier IR nodes, and Dask will take care of getting the data there for you. That would show up this tracing only as a gap in time between the conclusion of one IR node and the start of the subsequent nodes. This is a major weakness of doing the tracing at the IR node execution level.

TomAugspurger · 2025-10-02T21:41:16Z

/merge

Fixes a regression in rapidsai#19895, which changed the logic for when we catch this exception and fall back to the default device memory size.

Fixes a regression in #19895, which changed the logic for when we catch this exception and fall back to the default device memory size. Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #20179

The so-called "low-overhead" tracing added in #19895 *can* have some measurable overhead in some cases (see below). This PR adds additional configuration options to control which metrics are collected when tracing is enabled. The default is not to collect any traces, which is zero overhead. With `CUDF_POLARS_LOG_TRACES=1`, *all* tracing is enabled, which includes information on memory from RMM and NVML, and information on the input / output dataframes from cudf-polars. Users can disable certain metrics by setting another environment variable. For example, this would log disable logging of memory (from RMM and nvml): ``` CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 python ... ``` and this would disable the memory and dataframe-related metrics: ``` CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0 python ... ``` This boxplot shows the runtime of our PDSH benchmarks at SF-3K with the distributed scheduler, using 8 workers with an H100 each, 5 iterations per run. There are 3 runs show: 1. "on": tracing was enabled with `CUDF_POLARS_LOG_TRACES=1` 2. "off": tracing was not enabled 3. time-only: tracing was enabled, but memory and dataframe metrics were disabled, with `CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0` <img width="1600" height="800" alt="tracing-overhead" src="https://github.com/user-attachments/assets/ccb4b454-233f-45d5-8a4e-36cb586e1ba0" /> The interesting parts are the large gaps between the "on" box and the two "off" / "time-only" boxes, which I've highlighted. These indicate that the tracing overhead is relatively large with all the metrics turned on. But the limited tracing that only measures durations doesn't have that same overhead, because the "off" and "time-only" boxes are overlapping. --- A note on the implementation: I wasn't sure whether to make things opt-in or opt-out. Right now we have a mix (opt in to everything with `CUDF_POLARS_LOG_TRACES=1`, and opt out of specific metrics with `CUDF_POLARS_LOG_TRACES_MEMORY=0`). We could easily make it opt-in to specific metrics (e.g. `CUDF_POLARS_LOG_TRACES_MEMORY=1` would enable just memory, `CUDF_POLARS_LOG_TRACES_DATAFRAMES=1` would enable just dataframe tracing). Neither option seemed obviously better to me. Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #20223

github-actions bot assigned TomAugspurger Sep 5, 2025

TomAugspurger force-pushed the tom/log-stats branch 3 times, most recently from 49ae03e to cad2f48 Compare September 5, 2025 14:14

Tracing node execution in cudf-polars

cad2f48

This PR introduces a new *low-overhead* tracing tool for cudf-polars. When enabled, we'll capture a record for each `IR.do_evaluate` node executed while running the polars query.

TomAugspurger commented Sep 8, 2025

View reviewed changes

TomAugspurger added 7 commits September 9, 2025 07:24

Merge remote-tracking branch 'upstream/branch-25.10' into tom/log-stats

fcf5531

Added tests

31606d1

More careful with statistics

6ba1cc9

type fixes

40dc34e

Merge remote-tracking branch 'upstream/branch-25.10' into tom/log-stats

f7e3b72

dead code

9c256ef

more types

820680a

wence- reviewed Sep 10, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/tracing.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/dsl/tracing.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/dsl/tracing.py Outdated Show resolved Hide resolved

TomAugspurger added 8 commits September 10, 2025 06:12

Track nvml stats too

ad26c4e

Merge remote-tracking branch 'upstream/branch-25.10' into tom/log-stats

64da24a

fixups

c57f7a4

Packaging

678bb6f

private HAS_STRUCTLOG

7ecceda

docs

3febbd6

Merge remote-tracking branch 'upstream/branch-25.10' into tom/log-stats

75ccbec

fixes

4c34d39

TomAugspurger commented Sep 10, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/tracing.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/experimental/benchmarks/utils.py Outdated Show resolved Hide resolved

python/cudf_polars/cudf_polars/dsl/tracing.py Outdated Show resolved Hide resolved

TomAugspurger added 2 commits September 10, 2025 11:42

Import-time determination

db57c2f

Merge remote-tracking branch 'upstream/branch-25.10' into tom/log-stats

1f45e00

TomAugspurger added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Sep 10, 2025

TomAugspurger marked this pull request as ready for review September 10, 2025 20:22

TomAugspurger added 6 commits September 26, 2025 12:20

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

fcf72dd

Fix stale comment

d14f433

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

54d23eb

Merge branch 'branch-25.12' into tom/log-stats

1889e2f

Merge branch 'branch-25.12' into tom/log-stats

c5fc4c6

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

db916b7

Matt711 reviewed Sep 30, 2025

View reviewed changes

TomAugspurger added 4 commits September 30, 2025 14:51

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

c5f9c67

fixes

6a5c5bd

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

a4eb89e

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

a0efb2f

Matt711 reviewed Oct 1, 2025

View reviewed changes

python/cudf_polars/cudf_polars/dsl/tracing.py Outdated Show resolved Hide resolved

TomAugspurger added 3 commits October 1, 2025 09:43

Cache the cachable

be92e9f

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

138ab2e

typo

355cc74

Matt711 approved these changes Oct 2, 2025

View reviewed changes

docs/cudf/source/cudf_polars/usage.md Outdated Show resolved Hide resolved

docs/cudf/source/cudf_polars/usage.md Outdated Show resolved Hide resolved

TomAugspurger added 3 commits October 2, 2025 13:12

coverage

23d4ebf

Merge remote-tracking branch 'upstream/branch-25.12' into tom/log-stats

e97b98e

doc fixes

bd9662f

rapids-bot bot merged commit ebe3d8b into rapidsai:branch-25.12 Oct 2, 2025
133 of 135 checks passed

TomAugspurger deleted the tom/log-stats branch October 2, 2025 21:41

github-project-automation bot moved this from In Progress to Done in cuDF Python Oct 2, 2025

TomAugspurger mentioned this pull request Oct 3, 2025

Handle NVMLError_NotSupported in cudf-polars #20179

Merged

TomAugspurger mentioned this pull request Oct 9, 2025

Configuration for which metrics are enabled during tracing #20223

Merged

Trace node execution in cudf-polars #19895

Trace node execution in cudf-polars #19895

Uh oh!

Conversation

TomAugspurger commented Sep 5, 2025

Description

Checklist

Uh oh!

copy-pr-bot bot commented Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Matt711 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TomAugspurger commented Oct 2, 2025

Uh oh!

TomAugspurger commented Oct 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomAugspurger commented Sep 10, 2025 •

edited

Loading

TomAugspurger commented Sep 26, 2025 •

edited

Loading