Commit c0d228e
authored
Configuration for which metrics are enabled during tracing (#20223)
The so-called "low-overhead" tracing added in #19895 *can* have some measurable overhead in some cases (see below).
This PR adds additional configuration options to control which metrics are collected when tracing is enabled. The default is not to collect any traces, which is zero overhead. With `CUDF_POLARS_LOG_TRACES=1`, *all* tracing is enabled, which includes information on memory from RMM and NVML, and information on the input / output dataframes from cudf-polars. Users can disable certain metrics by setting another environment variable. For example, this would log disable logging of memory (from RMM and nvml):
```
CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 python ...
```
and this would disable the memory and dataframe-related metrics:
```
CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0 python ...
```
This boxplot shows the runtime of our PDSH benchmarks at SF-3K with the distributed scheduler, using 8 workers with an H100 each, 5 iterations per run. There are 3 runs show:
1. "on": tracing was enabled with `CUDF_POLARS_LOG_TRACES=1`
2. "off": tracing was not enabled
3. time-only: tracing was enabled, but memory and dataframe metrics were disabled, with `CUDF_POLARS_LOG_TRACES=1 CUDF_POLARS_LOG_TRACES_MEMORY=0 CUDF_POLARS_LOG_TRACES_DATAFRAMES=0`
<img width="1600" height="800" alt="tracing-overhead" src="https://github.com/user-attachments/assets/ccb4b454-233f-45d5-8a4e-36cb586e1ba0" />
The interesting parts are the large gaps between the "on" box and the two "off" / "time-only" boxes, which I've highlighted. These indicate that the tracing overhead is relatively large with all the metrics turned on. But the limited tracing that only measures durations doesn't have that same overhead, because the "off" and "time-only" boxes are overlapping.
---
A note on the implementation: I wasn't sure whether to make things opt-in or opt-out. Right now we have a mix (opt in to everything with `CUDF_POLARS_LOG_TRACES=1`, and opt out of specific metrics with `CUDF_POLARS_LOG_TRACES_MEMORY=0`). We could easily make it opt-in to specific metrics (e.g. `CUDF_POLARS_LOG_TRACES_MEMORY=1` would enable just memory, `CUDF_POLARS_LOG_TRACES_DATAFRAMES=1` would enable just dataframe tracing). Neither option seemed obviously better to me.
Authors:
- Tom Augspurger (https://github.com/TomAugspurger)
Approvers:
- Matthew Murray (https://github.com/Matt711)
URL: #202231 parent deda9cb commit c0d228e
File tree
2 files changed
+75
-32
lines changed- docs/cudf/source/cudf_polars
- python/cudf_polars/cudf_polars/dsl
2 files changed
+75
-32
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
89 | 89 | | |
90 | 90 | | |
91 | 91 | | |
92 | | - | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
93 | 116 | | |
94 | | - | |
95 | | - | |
96 | | - | |
97 | | - | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
98 | 128 | | |
99 | 129 | | |
100 | 130 | | |
101 | 131 | | |
102 | 132 | | |
103 | | - | |
104 | 133 | | |
105 | 134 | | |
106 | 135 | | |
| |||
120 | 149 | | |
121 | 150 | | |
122 | 151 | | |
123 | | - | |
124 | | - | |
| 152 | + | |
| 153 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
33 | 39 | | |
34 | 40 | | |
35 | 41 | | |
| |||
84 | 90 | | |
85 | 91 | | |
86 | 92 | | |
87 | | - | |
88 | | - | |
89 | | - | |
90 | | - | |
91 | | - | |
92 | | - | |
93 | | - | |
94 | | - | |
95 | 93 | | |
96 | | - | |
97 | 94 | | |
98 | | - | |
99 | | - | |
| 95 | + | |
100 | 96 | | |
101 | 97 | | |
102 | | - | |
103 | | - | |
104 | | - | |
105 | | - | |
106 | | - | |
107 | | - | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
108 | 108 | | |
109 | 109 | | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
110 | 125 | | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
111 | 132 | | |
112 | 133 | | |
113 | 134 | | |
114 | | - | |
115 | | - | |
116 | | - | |
117 | | - | |
118 | | - | |
119 | | - | |
120 | | - | |
121 | 135 | | |
122 | 136 | | |
123 | 137 | | |
| |||
0 commit comments