Avoid repeated EmitTo::First in partial hash aggregate output#23250
Conversation
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/emit-first-partial-investigation (eedbae7) to 01bf68c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/emit-first-partial-investigation (eedbae7) to 01bf68c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing codex/emit-first-partial-investigation (eedbae7) to 01bf68c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
@2010YOUY01 @Rachelint The result looks good. Can you help take a look when you have time? |
Which issue does this PR close?
Rationale for this change
The migrated partial hash aggregate output path still used
EmitTo::First(batch_size)when draining grouped aggregate state in batches.For terminal output this is unnecessary and can be expensive:
EmitTo::Firstis not just slicing the first N rows, it also shifts remaining group indexes and maintainsGroupValueslookup state. For high-cardinality partial aggregate output, this can cause repeated work during output draining.The final hash aggregate path already avoids this by materializing output once with
EmitTo::Alland then slicing the resultingRecordBatch. This PR applies the same approach to partial hash aggregate output.What changes are included in this PR?
EmitTo::First(batch_size)for hash aggregate terminal output.EmitTo::AllRecordBatchintobatch_sizechunks across output pollsGroupsAccumulatorthat fails if partial terminal output callsEmitTo::First(_).Are these changes tested?
Yes.
Local targeted tests:
Additional local verification run during development:
The new regression test was also applied to the pre-fix baseline and failed with the expected internal error when the partial output path used
EmitTo::First.Local benchmark evidence was collected against the implementation commit before the final test/naming polish commit.
ClickBench full 43-query run, 5 iterations, 24 cores skip partial aggregation probe ratio
0.8:Largest patched/current wins included:
q33: 32961.02ms -> 1642.08ms
q34: 32739.34ms -> 1635.07ms
q18: 25673.25ms -> 1767.25ms
q16: 5949.82ms -> 810.17ms
q17: 5906.51ms -> 807.10ms
TPC-DS SF10 full99, 10 rounds:
Failures: 0
Aggregate geomean current/main: 0.982817
Aggregate current speedup: 1.748%
Are there any user-facing changes?
No. This is an internal physical execution change for hash aggregate output draining. There are no public API or documented behavior changes.