Skip to content

[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas#55118

Closed
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56310/dtypes
Closed

[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas#55118
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56310/dtypes

Conversation

@ueshin
Copy link
Copy Markdown
Member

@ueshin ueshin commented Mar 31, 2026

What changes were proposed in this pull request?

This PR updates PySpark DataFrame.toPandas() dtype correction for pandas 3.x.

In python/pyspark/sql/pandas/types.py, StringType is mapped to pd.StringDtype(na_value=np.nan) when running with pandas 3.x instead of leaving the column as object. The TimestampType conversion path is also adjusted so that after timezone normalization the series is cast back to the expected pandas dtype only for pandas 3.x.

The related assertions in python/pyspark/sql/tests/test_collection.py are updated to check pandas-version-specific dtypes for string, datetime, and timedelta columns, and the Arrow on/off loops now use subTest(...) for clearer failures.

Since the pandas 3 string dtype changes also affect downstream restoration behavior, python/pyspark/pandas/data_type_ops/string_ops.py now restores missing string values as None before casting back to a non-string dtype. The Spark Connect coverage in python/pyspark/sql/tests/connect/test_connect_dataframe_property.py is also updated to reflect the pandas 3 string dtype expectation.

Why are the changes needed?

pandas 3 changes dtype behavior for strings and datetime-related values compared to earlier pandas versions. The existing toPandas() logic and related tests still assume object string columns and older datetime/timedelta dtype expectations in places where pandas 3 now returns string extension dtypes and microsecond-resolution timestamp/timedelta dtypes.

Without these changes, DataFrame.toPandas() does not preserve pandas 3 string dtype behavior correctly, and some tests and pandas-on-Spark restoration paths still assume the pre-pandas-3 representation.

Does this PR introduce any user-facing change?

Yes.

With pandas 3.x, DataFrame.toPandas() can now return Spark string columns as pandas StringDtype(na_value=np.nan) instead of object, and timestamp/timedelta columns follow the pandas 3 dtype expectations more consistently after conversion.

This is a user-facing behavior change compared to released versions that still return the older dtype behavior.

How was this patch tested?

Updated the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Mar 31, 2026

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Apr 1, 2026

Thanks! merging to master.

@ueshin ueshin closed this in fec2804 Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants