High HashJoin probe latency under high fanout and key skew (Partitioned mode)

### Describe the bug

While comparing query latencies between DataFusion and Trino on a prod workload, I noticed a significant gap for a particular HashJoin pattern. The query is a straightforward inner join with:

- A small build side (~32K rows, ~415 distinct string keys)
- A large probe side ~2.3M rows -- all (or most of them) carrying the same key, so a single partition does all the join work
- High fanout: each matched probe row joins to ~78 build rows → ~176M output pairs (avg_fanout ~7800%)
- String join keys (~26 chars)
- Partitioned mode (if build side has no row statistics, the planner cannot prove it is small and repartitions both sides by key)

**Observed latency:**
DataFusion: `join_time` ≈ 6s -- query time ≈ 7.5s
Trino under the same query conditions (hash partitioned join, same level of skew): ≈ 3.4s -- query time ≈ 4.5s



### To Reproduce

Left a benchmark repro here https://github.com/apache/datafusion/pull/23257 (hj.rs new Q23)

- [profile of Q23 main -- SF100 ](https://share.firefox.dev/4atAYU9)
- [profile](https://share.firefox.dev/3SzQFTF) of Q23 SF100 on the [fix](https://github.com/apache/datafusion/pull/23209) attempt  (avoiding the per-pair key recheck on collision-free build sides and O(matched_pairs) Arrow allocations on collision-free build sides)

<img width="2826" height="408" alt="Image" src="https://github.com/user-attachments/assets/4526aee3-279c-46ea-8e32-321b1d42e061" />

<img width="2794" height="284" alt="Image" src="https://github.com/user-attachments/assets/28712f4d-9435-499d-b56a-3e2daed3b919" />

Looks like the hot path is in `equal_rows_arr`. This operation is expensive because it calls `take()` to extract the build and probe rows into new arrays, then `eq_dyn_null()` to compare them, then builds a boolean filter -- allocating for each step. Additionally, `equal_rows_arr` does a full key comparison for every matched pair to check for hash collisions, which is wasteful when the build side has no real collisions and many probe rows carry the same key -- all comparisons return true, so you are paying for checks that will never reject a pair.

Note that for Q23 in particular, the issue is more prominent because aside from the large fanout (which implies calling `equal_rows_arr` more times -- once per output batch of 8192 pairs), most probe rows hash to the same key -- so one partition ends up doing nearly all the join work.

### Expected behavior

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

High HashJoin probe latency under high fanout and key skew (Partitioned mode) #23237

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

High HashJoin probe latency under high fanout and key skew (Partitioned mode) #23237

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions