Describe the bug
While comparing query latencies between DataFusion and Trino on a prod workload, I noticed a significant gap for a particular HashJoin pattern. The query is a straightforward inner join with:
- A small build side (~32K rows, ~415 distinct string keys)
- A large probe side ~2.3M rows -- all (or most of them) carrying the same key, so a single partition does all the join work
- High fanout: each matched probe row joins to ~78 build rows → ~176M output pairs (avg_fanout ~7800%)
- String join keys (~26 chars)
- Partitioned mode (if build side has no row statistics, the planner cannot prove it is small and repartitions both sides by key)
Observed latency:
DataFusion: join_time ≈ 6s -- query time ≈ 7.5s
Trino under the same query conditions (hash partitioned join, same level of skew): ≈ 3.4s -- query time ≈ 4.5s
To Reproduce
Left a benchmark repro here #23257 (hj.rs new Q23)
- profile of Q23 main -- SF100
- profile of Q23 SF100 on the fix attempt (avoiding the per-pair key recheck on collision-free build sides and O(matched_pairs) Arrow allocations on collision-free build sides)
Looks like the hot path is in equal_rows_arr. This operation is expensive because it calls take() to extract the build and probe rows into new arrays, then eq_dyn_null() to compare them, then builds a boolean filter -- allocating for each step. Additionally, equal_rows_arr does a full key comparison for every matched pair to check for hash collisions, which is wasteful when the build side has no real collisions and many probe rows carry the same key -- all comparisons return true, so you are paying for checks that will never reject a pair.
Note that for Q23 in particular, the issue is more prominent because aside from the large fanout (which implies calling equal_rows_arr more times -- once per output batch of 8192 pairs), most probe rows hash to the same key -- so one partition ends up doing nearly all the join work.
Expected behavior
No response
Additional context
No response
Describe the bug
While comparing query latencies between DataFusion and Trino on a prod workload, I noticed a significant gap for a particular HashJoin pattern. The query is a straightforward inner join with:
Observed latency:
DataFusion:
join_time≈ 6s -- query time ≈ 7.5sTrino under the same query conditions (hash partitioned join, same level of skew): ≈ 3.4s -- query time ≈ 4.5s
To Reproduce
Left a benchmark repro here #23257 (hj.rs new Q23)
Looks like the hot path is in
equal_rows_arr. This operation is expensive because it callstake()to extract the build and probe rows into new arrays, theneq_dyn_null()to compare them, then builds a boolean filter -- allocating for each step. Additionally,equal_rows_arrdoes a full key comparison for every matched pair to check for hash collisions, which is wasteful when the build side has no real collisions and many probe rows carry the same key -- all comparisons return true, so you are paying for checks that will never reject a pair.Note that for Q23 in particular, the issue is more prominent because aside from the large fanout (which implies calling
equal_rows_arrmore times -- once per output batch of 8192 pairs), most probe rows hash to the same key -- so one partition ends up doing nearly all the join work.Expected behavior
No response
Additional context
No response