Skip to content

High HashJoin probe latency under high fanout and key skew (Partitioned mode) #23237

Description

@LiaCastaneda

Describe the bug

While comparing query latencies between DataFusion and Trino on a prod workload, I noticed a significant gap for a particular HashJoin pattern. The query is a straightforward inner join with:

  • A small build side (~32K rows, ~415 distinct string keys)
  • A large probe side ~2.3M rows -- all (or most of them) carrying the same key, so a single partition does all the join work
  • High fanout: each matched probe row joins to ~78 build rows → ~176M output pairs (avg_fanout ~7800%)
  • String join keys (~26 chars)
  • Partitioned mode (if build side has no row statistics, the planner cannot prove it is small and repartitions both sides by key)

Observed latency:
DataFusion: join_time ≈ 6s -- query time ≈ 7.5s
Trino under the same query conditions (hash partitioned join, same level of skew): ≈ 3.4s -- query time ≈ 4.5s

To Reproduce

Left a benchmark repro here #23257 (hj.rs new Q23)

  • profile of Q23 main -- SF100
  • profile of Q23 SF100 on the fix attempt (avoiding the per-pair key recheck on collision-free build sides and O(matched_pairs) Arrow allocations on collision-free build sides)
Image Image

Looks like the hot path is in equal_rows_arr. This operation is expensive because it calls take() to extract the build and probe rows into new arrays, then eq_dyn_null() to compare them, then builds a boolean filter -- allocating for each step. Additionally, equal_rows_arr does a full key comparison for every matched pair to check for hash collisions, which is wasteful when the build side has no real collisions and many probe rows carry the same key -- all comparisons return true, so you are paying for checks that will never reject a pair.

Note that for Q23 in particular, the issue is more prominent because aside from the large fanout (which implies calling equal_rows_arr more times -- once per output batch of 8192 pairs), most probe rows hash to the same key -- so one partition ends up doing nearly all the join work.

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions