Skip to content

[SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2#55121

Open
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-56219/pd2.2
Open

[SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2#55121
ueshin wants to merge 2 commits intoapache:masterfrom
ueshin:issues/SPARK-56219/pd2.2

Conversation

@ueshin
Copy link
Copy Markdown
Member

@ueshin ueshin commented Mar 31, 2026

What changes were proposed in this pull request?

This is a follow-up of #55021.

This PR updates pandas-on-Spark GroupBy.idxmax and GroupBy.idxmin for skipna=False to keep the legacy behavior for all pandas 2 versions.

With this change:

  • pandas < 3.0.0 keeps the legacy idxmax and idxmin result for skipna=False
  • pandas >= 3.0.0 keeps the existing error behavior for NA-containing input

This PR also updates the related test in python/pyspark/pandas/tests/groupby/test_index.py to validate the pandas 2 behavior directly instead of relying on pandas 2.2 and 2.3 having the same result.

Why are the changes needed?

The previous fix split pandas 2.2 and pandas 2.3 behavior for GroupBy.idxmax(skipna=False) and GroupBy.idxmin(skipna=False) on NA-containing input.

For example:

pdf = pd.DataFrame({"a": [1, 1, 2, 2], "b": [1, None, 3, 4], "c": [4, 3, 2, 1]})
pdf.groupby(["a"]).idxmax(skipna=False).sort_index()

In pandas 2.2, this returns:

   b  c
a
1  0  0
2  3  2

In pandas 2.3, this returns:

     b  c
a
1  NaN  0
2  3.0  2

In pandas 3, this raises ValueError.

Instead of matching the pandas 2.2 / 2.3 difference, this PR keeps the legacy pandas 2 behavior across all pandas 2 environments and continues to follow the pandas 3 behavior in pandas 3 environments.

Does this PR introduce any user-facing change?

Yes.

In pandas-on-Spark with pandas 2.x, GroupBy.idxmax(skipna=False) and GroupBy.idxmin(skipna=False) on NA-containing groups now consistently keep the legacy result behavior instead of varying with the installed pandas 2 version.

For pandas 3, behavior is unchanged from the current implementation.

How was this patch tested?

Ran the related pandas-on-Spark regression test in three environments:

  • pandas 2.2: GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na
  • pandas 2.3: GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na
  • pandas 3.0: GroupbyIndexTests.test_idxmax_idxmin_skipna_false_with_na

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Codex (GPT-5)

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Mar 31, 2026

@ueshin ueshin changed the title [SPARK-56219][PS][FOLLOW-UP] Fix groupby idxmax and idxmin skipna=False for pandas 2.2 [SPARK-56219][PS][FOLLOW-UP] Keep legacy groupby idxmax and idxmin skipna=False behavior for pandas 2 Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants