fix: array_compact handle edge case with NULLs#23192
Conversation
| let mut offsets = Vec::<O>::with_capacity(list_array.len() + 1); | ||
| offsets.push(O::zero()); | ||
| let capacity = original_data.len() - values_null_count; | ||
| let mut offsets = OffsetBufferBuilder::<O>::new(list_array.len()); |
There was a problem hiding this comment.
I think typically we have found Vec to be faster than OffsetBufferBuilder -- is there a reason to switch?
There was a problem hiding this comment.
Thanks @alamb for catching this, placed it back. I believe Claude took a pattern from other array functions, that currently use OffsetBufferBuilder. I'll check if those existing usages can also be replaced with Vec in separate PR
There was a problem hiding this comment.
It turns out the rust team has optimized Vec quite a lot and it plays many tricks 👍
| let values = list_array.values(); | ||
| // Use logical nulls so element types without a validity buffer | ||
| // (e.g. NullArray) are still treated as null. | ||
| let values_null_count = values.logical_null_count(); |
There was a problem hiding this comment.
I think this might call logical_nulls under the coverws -- it might make sense to just call logical_nulls()` here and then use that directly rather than having to call logical_nulls again below
let values_nulls = values
.logical_nulls()
.expect("non-zero logical_null_count implies logical_nulls is Some");## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Followup apache#23192 (comment) ## Rationale for this change `arrow::buffer::OffsetBufferBuilder` is a thin wrapper around `Vec<O>` plus a `last_offset: usize` running counter; every `push_length(n)` does a `checked_add` on `usize` and a `usize_as(O)` conversion. For per-row loops with a known upfront row count, a direct `Vec<O>` that stores the running offset via `offsets[row] + O::usize_as(len)` can save measurable work in tight per-row loops — provided the offset push is a meaningful fraction of per-row cost. I swapped the pattern in all eight `OffsetBufferBuilder` call sites in the repo (`array_normalize`, `array_filter`, `remove`, `replace`, `array_add`, `utils::general_array_zip_with`, `array_scale`, `encoding::delegated_decode`), benchmarked the three sites that have criterion benches, and found the win is **not** uniform. ## What changes are included in this PR? Replace `OffsetBufferBuilder<O>` with `Vec<O>` (preinitialized with `O::zero()` and finalized with `OffsetBuffer::new(v.into())`) **only** in `datafusion/functions-nested/src/remove.rs`, where benches show clean wins with no regressions. The other seven sites are left on `OffsetBufferBuilder` — benches showed flat-to-regressing results, see below. ## Are these changes tested? Existing unit tests, doctests, and sqllogictests (`array_remove*`) pass unchanged. No new tests — refactor is functionally equivalent. ## Are there any user-facing changes? No. ## Benchmark results The biggest win is `array_remove` ### `array_remove` | Bench | size 10 | size 100 | size 500 | |---|---:|---:|---:| | `int64` | −0.2% | −1.0% | **−50.0%** | | `n_int64` | −0.7% | +0.05% | **−23.1%** | | `all_int64` | +0.2% | −1.8% | **−15.1%** | | `strings` | +3.8% | +0.6% | **−4.6%** | | `boolean` | −0.01% | +1.1% | +0.3% | | `fixed_size_binary` | −0.06% | **−20.0%** | **−2.6%** | | `int64_nested` | flat | flat | flat | For others its more like noise
…Ls (#23196) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes #. ## Rationale for this change Backport `array_compact` handle edge case with NULLs. The issue causing correctness issues and has no breaking API, therefore should fall into backport criteria <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Rationale for this change
array_compact(make_array(NULL, NULL, NULL))returned[NULL, NULL, NULL]instead of an empty array.Root cause:
make_array(NULL, NULL, NULL)has typeList(Null), whose inner values are an ArrowNullArray.NullArray::nulls()returnsNone(it has no validity buffer), so the defaultArray::is_null()returnsfalsefor every index — even though every element is logically null. The compaction loop saw "no nulls" and copied all elements through unchanged.What changes are included in this PR?
compact_list, resolve the values' null mask once viavalues.logical_nulls()and use that buffer for both the fast-path check and the per-element null test. This correctly treatsNullArray(and anyother type without a physical validity buffer) as all-null.
select array_compact(make_array(NULL, NULL, NULL))→[].Are these changes tested?
Yes — new test added in
datafusion/sqllogictest/test_files/array/array_distinct.sltalongside the existingarray_compactcoverage. Existingarray_compacttests continue to pass.Are there any user-facing changes?
Yes —
array_compacton a list of untyped NULLs now returns[](matching the typed-NULL behavior and user expectation) instead of preserving the null elements. No API changes.