Skip to content

Commit c0d228e

Browse files
Configuration for which metrics are enabled during tracing (#20223)
The so-called "low-overhead" tracing added in #19895 *can* have some measurable overhead in some cases (see below). This PR adds additional configuration options to control which metrics are collected when tracing is enabled. The default is not to collect any traces, which is zero overhead. With `CUDF_POLARS_LOG_TRACES=1`, *all* tracing is enabled, which includes information on memory from RMM and NVML, and information on the input / output dataframes from cudf-polars. Users can disable certain metrics by setting another environment variable. For example, this would log disable logging of memory (from RMM and nvml): ``` CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 python ... ``` and this would disable the memory and dataframe-related metrics: ``` CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0 python ... ``` This boxplot shows the runtime of our PDSH benchmarks at SF-3K with the distributed scheduler, using 8 workers with an H100 each, 5 iterations per run. There are 3 runs show: 1. "on": tracing was enabled with `CUDF_POLARS_LOG_TRACES=1` 2. "off": tracing was not enabled 3. time-only: tracing was enabled, but memory and dataframe metrics were disabled, with `CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0` <img width="1600" height="800" alt="tracing-overhead" src="https://github.com/user-attachments/assets/ccb4b454-233f-45d5-8a4e-36cb586e1ba0" /> The interesting parts are the large gaps between the "on" box and the two "off" / "time-only" boxes, which I've highlighted. These indicate that the tracing overhead is relatively large with all the metrics turned on. But the limited tracing that only measures durations doesn't have that same overhead, because the "off" and "time-only" boxes are overlapping. --- A note on the implementation: I wasn't sure whether to make things opt-in or opt-out. Right now we have a mix (opt in to everything with `CUDF_POLARS_LOG_TRACES=1`, and opt out of specific metrics with `CUDF_POLARS_LOG_TRACES_MEMORY=0`). We could easily make it opt-in to specific metrics (e.g. `CUDF_POLARS_LOG_TRACES_MEMORY=1` would enable just memory, `CUDF_POLARS_LOG_TRACES_DATAFRAMES=1` would enable just dataframe tracing). Neither option seemed obviously better to me. Authors: - Tom Augspurger (https://github.com/TomAugspurger) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #20223
1 parent deda9cb commit c0d228e

File tree

2 files changed

+75
-32
lines changed

2 files changed

+75
-32
lines changed

docs/cudf/source/cudf_polars/usage.md

Lines changed: 37 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -89,18 +89,47 @@ shape: (3, 3)
8989
cudf-polars can optionally trace execution of each node in the query plan.
9090
To enable tracing, set the environment variable ``CUDF_POLARS_LOG_TRACES`` to a
9191
true value ("1", "true", "y", "yes") before starting your process. This will
92-
capture and log information about each node before and after it executes, including information on
92+
capture and log information about each node before and after it executes, and includes
93+
information on timing, memory usage, and the input and output dataframes. The log message
94+
includes the following fields:
95+
96+
| Field Name | Type | Description |
97+
| ---------- | ----- | ----------- |
98+
| type | string | The name of the IR node |
99+
| start | int | A nanosecond-precision counter indicating when this node started executing |
100+
| stop | int | A nanosecond-precision counter indicating when this node finished executing |
101+
| overhead | int | The overhead, in nanoseconds, added by tracing |
102+
| `count_frames_{phase}` | int | The number of dataframes for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_DATAFRAMES=0`. |
103+
| `frames_{phase}` | `list[dict]` | A list with dictionaries with "shape" and "size" fields, one per input dataframe, for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_DATAFRAMES=0`. |
104+
| `total_bytes_{phase}` | int | The sum of the size (in bytes) of the dataframes for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
105+
| `rmm_current_bytes_{phase}` | int | The current number of bytes allocated by RMM Memory Resource used by cudf-polars for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
106+
| `rmm_current_count_{phase}` | int | The current number of allocations made by RMM Memory Resource used by cudf-polars for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
107+
| `rmm_peak_bytes_{phase}` | int | The peak number of bytes allocated by RMM Memory Resource used by cudf-polars for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
108+
| `rmm_peak_count_{phase}` | int | The peak number of allocations made by RMM Memory Resource used by cudf-polars for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
109+
| `rmm_total_bytes_{phase}` | int | The total number of bytes allocated by RMM Memory Resource used by cudf-polars for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
110+
| `rmm_total_count_{phase}` | int | The total number of allocations made by RMM Memory Resource used by cudf-polars for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
111+
| `nvml_current_bytes_{phase}` | int | The device memory usage of this process, as reported by NVML, for the input / output `phase`. This metric can be disabled by setting `CUDF_POLARS_LOG_TRACES_MEMORY=0`. |
112+
113+
Setting `CUDF_POLARS_LOG_TRACES=1` enables all the metrics. Depending on the query, the overhead
114+
from collecting the memory or dataframe metrics can be measurable. You can disable some metrics
115+
through additional environment variables. For example, do disable the memory related metrics, set:
93116

94-
- The type of the node being executed (e.g. `Scan`, `Select`, `Join`, `Groupby`, etc.)
95-
- The shape and size (in memory) of each input and output DataFrame
96-
- The GPU memory usage, as reported by [nvml], before and after executing the node
97-
- A start and stop clock, which can be used to measure the duration of the node's execution
117+
```
118+
CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0
119+
```
120+
121+
And to disable the memory and dataframe metrics, which essentially leaves just
122+
the duration metrics, set
123+
```
124+
CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0
125+
```
126+
127+
Note that tracing still needs to be enabled with `CUDF_POLARS_LOG_TRACES=1`.
98128

99129
The implementation uses [structlog] to build log records. You can configure the
100130
output using structlog's [configuration][structlog-configure] and enrich the
101131
records with [context variables][structlog-context].
102132

103-
104133
```
105134
>>> df = pl.DataFrame({"a": ["a", "a", "b"], "b": [1, 2, 3]}).lazy()
106135
>>> df.group_by("a").agg(pl.col("b").min().alias("min"), pl.col("b").max().alias("max")).collect(engine="gpu")
@@ -120,5 +149,5 @@ shape: (2, 3)
120149
[nvml]: https://developer.nvidia.com/management-library-nvml
121150
[rmm-stats]: https://docs.rapids.ai/api/rmm/stable/guide/#memory-statistics-and-profiling
122151
[structlog]: https://www.structlog.org/
123-
[structlog-configure](https://www.structlog.org/en/stable/configuration.html)
124-
[structlog-context](https://www.structlog.org/en/stable/contextvars.html)
152+
[structlog-configure]: https://www.structlog.org/en/stable/configuration.html
153+
[structlog-context]: https://www.structlog.org/en/stable/contextvars.html

python/cudf_polars/cudf_polars/dsl/tracing.py

Lines changed: 38 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,12 @@
3030
LOG_TRACES = _HAS_STRUCTLOG and _bool_converter(
3131
os.environ.get("CUDF_POLARS_LOG_TRACES", "0")
3232
)
33+
LOG_MEMORY = LOG_TRACES and _bool_converter(
34+
os.environ.get("CUDF_POLARS_LOG_TRACES_MEMORY", "1")
35+
)
36+
LOG_DATAFRAMES = LOG_TRACES and _bool_converter(
37+
os.environ.get("CUDF_POLARS_LOG_TRACES_DATAFRAMES", "1")
38+
)
3339

3440
CUDF_POLARS_NVTX_DOMAIN = "cudf_polars"
3541

@@ -84,40 +90,48 @@ def make_snapshot(
8490

8591
d: dict[str, Any] = {
8692
"type": ir_name,
87-
f"count_frames_{phase}": len(frames),
88-
f"frames_{phase}": [
89-
{
90-
"shape": frame.table.shape(),
91-
"size": sum(col.device_buffer_size() for col in frame.table.columns()),
92-
}
93-
for frame in frames
94-
],
9593
}
96-
d[f"total_bytes_{phase}"] = sum(x["size"] for x in d[f"frames_{phase}"])
9794

98-
stats = rmm.statistics.get_statistics()
99-
if stats:
95+
if LOG_DATAFRAMES:
10096
d.update(
10197
{
102-
f"rmm_current_bytes_{phase}": stats.current_bytes,
103-
f"rmm_current_count_{phase}": stats.current_count,
104-
f"rmm_peak_bytes_{phase}": stats.peak_bytes,
105-
f"rmm_peak_count_{phase}": stats.peak_count,
106-
f"rmm_total_bytes_{phase}": stats.total_bytes,
107-
f"rmm_total_count_{phase}": stats.total_count,
98+
f"count_frames_{phase}": len(frames),
99+
f"frames_{phase}": [
100+
{
101+
"shape": frame.table.shape(),
102+
"size": sum(
103+
col.device_buffer_size() for col in frame.table.columns()
104+
),
105+
}
106+
for frame in frames
107+
],
108108
}
109109
)
110+
d[f"total_bytes_{phase}"] = sum(x["size"] for x in d[f"frames_{phase}"])
111+
112+
if LOG_MEMORY:
113+
stats = rmm.statistics.get_statistics()
114+
if stats:
115+
d.update(
116+
{
117+
f"rmm_current_bytes_{phase}": stats.current_bytes,
118+
f"rmm_current_count_{phase}": stats.current_count,
119+
f"rmm_peak_bytes_{phase}": stats.peak_bytes,
120+
f"rmm_peak_count_{phase}": stats.peak_count,
121+
f"rmm_total_bytes_{phase}": stats.total_bytes,
122+
f"rmm_total_count_{phase}": stats.total_count,
123+
}
124+
)
110125

126+
if device_handle is not None:
127+
processes = pynvml.nvmlDeviceGetComputeRunningProcesses(device_handle)
128+
for proc in processes:
129+
if proc.pid == pid:
130+
d[f"nvml_current_bytes_{phase}"] = proc.usedGpuMemory
131+
break
111132
if extra:
112133
d.update(extra)
113134

114-
if device_handle is not None:
115-
processes = pynvml.nvmlDeviceGetComputeRunningProcesses(device_handle)
116-
for proc in processes:
117-
if proc.pid == pid:
118-
d[f"nvml_current_bytes_{phase}"] = proc.usedGpuMemory
119-
break
120-
121135
return d
122136

123137

0 commit comments

Comments
 (0)