Skip to content

fix: array_compact handle edge case with NULLs#23192

Merged
comphead merged 3 commits into
apache:mainfrom
comphead:array_compact
Jun 25, 2026
Merged

fix: array_compact handle edge case with NULLs#23192
comphead merged 3 commits into
apache:mainfrom
comphead:array_compact

Conversation

@comphead

Copy link
Copy Markdown
Contributor

Rationale for this change

array_compact(make_array(NULL, NULL, NULL)) returned [NULL, NULL, NULL] instead of an empty array.

Root cause: make_array(NULL, NULL, NULL) has type List(Null), whose inner values are an Arrow NullArray. NullArray::nulls() returns None (it has no validity buffer), so the default
Array::is_null() returns false for every index — even though every element is logically null. The compaction loop saw "no nulls" and copied all elements through unchanged.

What changes are included in this PR?

  • In compact_list, resolve the values' null mask once via values.logical_nulls() and use that buffer for both the fast-path check and the per-element null test. This correctly treats NullArray (and any
    other type without a physical validity buffer) as all-null.
  • Added a sqllogictest covering the untyped-NULL case: select array_compact(make_array(NULL, NULL, NULL))[].

Are these changes tested?

Yes — new test added in datafusion/sqllogictest/test_files/array/array_distinct.slt alongside the existing array_compact coverage. Existing array_compact tests continue to pass.

Are there any user-facing changes?

Yes — array_compact on a list of untyped NULLs now returns [] (matching the typed-NULL behavior and user expectation) instead of preserving the null elements. No API changes.

let mut offsets = Vec::<O>::with_capacity(list_array.len() + 1);
offsets.push(O::zero());
let capacity = original_data.len() - values_null_count;
let mut offsets = OffsetBufferBuilder::<O>::new(list_array.len());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think typically we have found Vec to be faster than OffsetBufferBuilder -- is there a reason to switch?

@comphead comphead Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb for catching this, placed it back. I believe Claude took a pattern from other array functions, that currently use OffsetBufferBuilder. I'll check if those existing usages can also be replaced with Vec in separate PR

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out the rust team has optimized Vec quite a lot and it plays many tricks 👍

let values = list_array.values();
// Use logical nulls so element types without a validity buffer
// (e.g. NullArray) are still treated as null.
let values_null_count = values.logical_null_count();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might call logical_nulls under the coverws -- it might make sense to just call logical_nulls()` here and then use that directly rather than having to call logical_nulls again below

    let values_nulls = values
        .logical_nulls()
        .expect("non-zero logical_null_count implies logical_nulls is Some");

@comphead comphead added this pull request to the merge queue Jun 25, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 25, 2026
comphead added a commit to comphead/arrow-datafusion that referenced this pull request Jun 25, 2026
@comphead comphead added this pull request to the merge queue Jun 25, 2026
Merged via the queue into apache:main with commit 476a76d Jun 25, 2026
35 checks passed
@comphead comphead deleted the array_compact branch June 25, 2026 21:31
zzcclp pushed a commit to zzcclp/arrow-datafusion that referenced this pull request Jun 26, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Followup
apache#23192 (comment)

## Rationale for this change

`arrow::buffer::OffsetBufferBuilder` is a thin wrapper around `Vec<O>`
plus a `last_offset: usize` running counter; every `push_length(n)` does
a `checked_add` on `usize` and a `usize_as(O)` conversion. For
per-row loops with a known upfront row count, a direct `Vec<O>` that
stores the running offset via `offsets[row] + O::usize_as(len)` can save
measurable work in tight per-row loops — provided the offset push
  is a meaningful fraction of per-row cost.
  
I swapped the pattern in all eight `OffsetBufferBuilder` call sites in
the repo (`array_normalize`, `array_filter`, `remove`, `replace`,
`array_add`, `utils::general_array_zip_with`, `array_scale`,
`encoding::delegated_decode`), benchmarked the three sites that have
criterion benches, and found the win is **not** uniform.

  ## What changes are included in this PR?

Replace `OffsetBufferBuilder<O>` with `Vec<O>` (preinitialized with
`O::zero()` and finalized with `OffsetBuffer::new(v.into())`) **only**
in `datafusion/functions-nested/src/remove.rs`, where benches show
  clean wins with no regressions.

The other seven sites are left on `OffsetBufferBuilder` — benches showed
flat-to-regressing results, see below.

  ## Are these changes tested?
  
Existing unit tests, doctests, and sqllogictests (`array_remove*`) pass
unchanged. No new tests — refactor is functionally equivalent.

  ## Are there any user-facing changes?
  
  No.

  ## Benchmark results

The biggest win is `array_remove`
  
  ### `array_remove`

  | Bench | size 10 | size 100 | size 500 |
  |---|---:|---:|---:|
  | `int64` | −0.2% | −1.0% | **−50.0%** |
  | `n_int64` | −0.7% | +0.05% | **−23.1%** |
  | `all_int64` | +0.2% | −1.8% | **−15.1%** |
  | `strings` | +3.8% | +0.6% | **−4.6%** |
  | `boolean` | −0.01% | +1.1% | +0.3% |
  | `fixed_size_binary` | −0.06% | **−20.0%** | **−2.6%** |
  | `int64_nested` | flat | flat | flat |


For others its more like noise
comphead added a commit that referenced this pull request Jun 29, 2026
…Ls (#23196)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes #.

## Rationale for this change

Backport `array_compact` handle edge case with NULLs. The issue causing
correctness issues and has no breaking API, therefore should fall into
backport criteria

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants