[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas by ueshin · Pull Request #55118 · apache/spark

ueshin · 2026-03-31T19:02:51Z

What changes were proposed in this pull request?

This PR updates PySpark DataFrame.toPandas() dtype correction for pandas 3.x.

In python/pyspark/sql/pandas/types.py, StringType is mapped to pd.StringDtype(na_value=np.nan) when running with pandas 3.x instead of leaving the column as object. The TimestampType conversion path is also adjusted so that after timezone normalization the series is cast back to the expected pandas dtype only for pandas 3.x.

The related assertions in python/pyspark/sql/tests/test_collection.py are updated to check pandas-version-specific dtypes for string, datetime, and timedelta columns, and the Arrow on/off loops now use subTest(...) for clearer failures.

Since the pandas 3 string dtype changes also affect downstream restoration behavior, python/pyspark/pandas/data_type_ops/string_ops.py now restores missing string values as None before casting back to a non-string dtype. The Spark Connect coverage in python/pyspark/sql/tests/connect/test_connect_dataframe_property.py is also updated to reflect the pandas 3 string dtype expectation.

Why are the changes needed?

pandas 3 changes dtype behavior for strings and datetime-related values compared to earlier pandas versions. The existing toPandas() logic and related tests still assume object string columns and older datetime/timedelta dtype expectations in places where pandas 3 now returns string extension dtypes and microsecond-resolution timestamp/timedelta dtypes.

Without these changes, DataFrame.toPandas() does not preserve pandas 3 string dtype behavior correctly, and some tests and pandas-on-Spark restoration paths still assume the pre-pandas-3 representation.

Does this PR introduce any user-facing change?

Yes.

With pandas 3.x, DataFrame.toPandas() can now return Spark string columns as pandas StringDtype(na_value=np.nan) instead of object, and timestamp/timedelta columns follow the pandas 3 dtype expectations more consistently after conversion.

This is a user-facing behavior change compared to released versions that still return the older dtype behavior.

How was this patch tested?

Updated the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

ueshin · 2026-03-31T19:03:03Z

cc @gaogaotiantian @HyukjinKwon @zhengruifeng

ueshin · 2026-04-01T18:12:56Z

Thanks! merging to master.

Handle pandas 3 string dtype in DataFrame.toPandas

b17a539

zhengruifeng approved these changes Apr 1, 2026

View reviewed changes

ueshin closed this in fec2804 Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas#55118

[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas#55118
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-56310/dtypes

ueshin commented Mar 31, 2026

Uh oh!

ueshin commented Mar 31, 2026

Uh oh!

ueshin commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ueshin commented Mar 31, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ueshin commented Mar 31, 2026

Uh oh!

ueshin commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants