[SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2 by ueshin · Pull Request #55121 · apache/spark

ueshin · 2026-03-31T22:15:50Z

What changes were proposed in this pull request?

This is a follow-up of #55021.

This PR updates pandas-on-Spark GroupBy.idxmax and GroupBy.idxmin for skipna=False to keep the legacy behavior for all pandas 2 versions.

With this change:

pandas < 3.0.0 keeps the legacy idxmax and idxmin result for skipna=False
pandas >= 3.0.0 keeps the existing error behavior for NA-containing input

This PR also updates the related test in python/pyspark/pandas/tests/groupby/test_index.py to validate the pandas 2 behavior directly instead of relying on pandas 2.2 and 2.3 having the same result.

Why are the changes needed?

The previous fix split pandas 2.2 and pandas 2.3 behavior for GroupBy.idxmax(skipna=False) and GroupBy.idxmin(skipna=False) on NA-containing input.

For example:

pdf = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, None, 3, 4], "c": [4, 3, 2, 1]})
pdf.groupby(["a"]).idxmax(skipna=False).sort_index()

In pandas 2.2, this returns:

In pandas 2.3, this returns:

     b  c
a
1  NaN  0
2  3.0  2

In pandas 3, this raises ValueError.

Instead of matching the pandas 2.2 / 2.3 difference, this PR keeps the legacy pandas 2 behavior across all pandas 2 environments and continues to follow the pandas 3 behavior in pandas 3 environments.

Does this PR introduce any user-facing change?

Yes.

In pandas-on-Spark with pandas 2.x, GroupBy.idxmax(skipna=False) and GroupBy.idxmin(skipna=False) on NA-containing groups now consistently keep the legacy result behavior instead of varying with the installed pandas 2 version.

For pandas 3, behavior is unchanged from the current implementation.

How was this patch tested?

Ran the related pandas-on-Spark regression test in three environments:

pandas 2.2: GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na
pandas 2.3: GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na
pandas 3.0: GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex (GPT-5)

ueshin · 2026-03-31T22:15:59Z

cc @gaogaotiantian @HyukjinKwon @zhengruifeng

Fix groupby idxmax and idxmin skipna=False for pandas 2.2

824c72c

Fix.

7e2dfa6

ueshin changed the title ~~[SPARK-56219][PS][FOLLOW-UP] Fix groupby idxmax and idxmin skipna=False for pandas 2.2~~ [SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2 Mar 31, 2026

zhengruifeng approved these changes Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2#55121

[SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2#55121
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-56219/pd2.2

ueshin commented Mar 31, 2026 •

edited

Loading

Uh oh!

ueshin commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ueshin commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ueshin commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ueshin commented Mar 31, 2026 •

edited

Loading