Background
GroupValuesColumn (the column-wise multi-column GROUP BY storage) provides type-specific specializations under multi_group_by/ so a wide GROUP BY can use the column-native + short-circuit fast path instead of falling back to the byte-encoded GroupValuesRows path. Today the type allow-list is partial: any column outside the supported set drags the entire grouping onto the slow path, even when every other column would have qualified for the fast one.
This EPIC tracks completing the GroupValuesColumn type coverage so that the row-encoded fallback is needed only as an explicit opt-in, not as a forced fallback for missing specializations.
Already supported (today on main)
Int8..Int64, UInt8..UInt64, Float32, Float64, Decimal128, Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, BinaryView, Boolean, Date32, Date64, Time32(Second/Millisecond), Time64(Microsecond/Nanosecond)†, Timestamp(*).
† Time64 alignment between supported_type and the dispatcher is being fixed as part of PR #22706.
Tracking
Nested types
Remaining primitives
Each item below blocks on the make_group_column factory and recursive supported_type landing in PR 1 of the #22706 sequence. After that, each is an independently mergeable PR.
Related strategic direction (not blocked by this EPIC)
#22701 proposes a generic FallbackGroupColumn so any Arrow type can go through GroupValuesColumn with a type-erased Arrow comparator. If that lands, the items in this EPIC become opt-in fast-path specializations on top of the generic fallback rather than prerequisites for the column-wise path. The two directions are complementary.
Cross-cutting requirements
Every new builder added under this EPIC should follow the testing structure established by PR #22706:
Background
GroupValuesColumn(the column-wise multi-column GROUP BY storage) provides type-specific specializations undermulti_group_by/so a wide GROUP BY can use the column-native + short-circuit fast path instead of falling back to the byte-encodedGroupValuesRowspath. Today the type allow-list is partial: any column outside the supported set drags the entire grouping onto the slow path, even when every other column would have qualified for the fast one.This EPIC tracks completing the GroupValuesColumn type coverage so that the row-encoded fallback is needed only as an explicit opt-in, not as a forced fallback for missing specializations.
Already supported (today on main)
Int8..Int64, UInt8..UInt64, Float32, Float64, Decimal128, Utf8, LargeUtf8, Utf8View, Binary, LargeBinary, BinaryView, Boolean, Date32, Date64, Time32(Second/Millisecond), Time64(Microsecond/Nanosecond)†, Timestamp(*).
† Time64 alignment between
supported_typeand the dispatcher is being fixed as part of PR #22706.Tracking
Nested types
FixedSizeList<primitive>,List<T>,LargeList<T>,Struct<...>with recursive children. In flight via PR feat(physical-plan): add GroupColumn support for FixedSizeList / List / LargeList / Struct in multi-column GROUP BY #22706 (split into 5 stacked PRs per maintainer request).Remaining primitives
Each item below blocks on the
make_group_columnfactory and recursivesupported_typelanding in PR 1 of the #22706 sequence. After that, each is an independently mergeable PR.FixedSizeBinary. Fixed-width bytes per row. Closest in shape toPrimitiveGroupValueBuilderbut with a runtime-known fixed byte width. Likely the smallest new builder.Float16. Arrow already has the primitive type; need explicit NaN handling inis_eq(match the Float32 / Float64 behavior inPrimitiveGroupValueBuilder).Duration(TimeUnit). Same shape asTimestamp(fourTimeUnitarms in the dispatcher), fourDurationXxxTypeslot-ins.Interval(IntervalUnit). Three variants (YearMonth = 4 bytes, DayTime = 8 bytes, MonthDayNano = 16 bytes), three separate dispatcher arms and three native widths.Decimal256.arrow::array::types::Decimal256TypehasNative = arrow_buffer::i256, a 32-byte struct rather than aCopy-cheap native scalar. Either relax theT: Copyrequirement inPrimitiveGroupValueBuilderor add a sibling builder specialized to wide native types.Dictionary<K, V>. Most involved. Need to decide:Utf8/Binary. Costlier in memory because each unique decoded value is materialized at intern time.K -> Vmapping is asserted across batches). Cheaper but only safe if the dictionary is shared / known-stable, which is not guaranteed by Arrow at the schema level.Related strategic direction (not blocked by this EPIC)
#22701 proposes a generic
FallbackGroupColumnso any Arrow type can go throughGroupValuesColumnwith a type-erased Arrow comparator. If that lands, the items in this EPIC become opt-in fast-path specializations on top of the generic fallback rather than prerequisites for the column-wise path. The two directions are complementary.Cross-cutting requirements
Every new builder added under this EPIC should follow the testing structure established by PR #22706:
equal_tofor identical / different / null edgestake_nboundary cases (n=0,n=len, with-null prefix)vectorized_*matches per-row referencesize()grows with appendsbuildon empty buildersupported_type↔make_group_columnconsistency fuzz