[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas#55118
Closed
ueshin wants to merge 1 commit intoapache:masterfrom
Closed
[SPARK-56310][PYTHON] Handle pandas 3 dtype in DataFrame.toPandas#55118ueshin wants to merge 1 commit intoapache:masterfrom
ueshin wants to merge 1 commit intoapache:masterfrom
Conversation
Member
Author
zhengruifeng
approved these changes
Apr 1, 2026
Member
Author
|
Thanks! merging to master. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR updates PySpark
DataFrame.toPandas()dtype correction for pandas 3.x.In
python/pyspark/sql/pandas/types.py,StringTypeis mapped topd.StringDtype(na_value=np.nan)when running with pandas 3.x instead of leaving the column asobject. TheTimestampTypeconversion path is also adjusted so that after timezone normalization the series is cast back to the expected pandas dtype only for pandas 3.x.The related assertions in
python/pyspark/sql/tests/test_collection.pyare updated to check pandas-version-specific dtypes for string, datetime, and timedelta columns, and the Arrow on/off loops now usesubTest(...)for clearer failures.Since the pandas 3 string dtype changes also affect downstream restoration behavior,
python/pyspark/pandas/data_type_ops/string_ops.pynow restores missing string values asNonebefore casting back to a non-string dtype. The Spark Connect coverage inpython/pyspark/sql/tests/connect/test_connect_dataframe_property.pyis also updated to reflect the pandas 3 string dtype expectation.Why are the changes needed?
pandas 3 changes dtype behavior for strings and datetime-related values compared to earlier pandas versions. The existing
toPandas()logic and related tests still assumeobjectstring columns and older datetime/timedelta dtype expectations in places where pandas 3 now returns string extension dtypes and microsecond-resolution timestamp/timedelta dtypes.Without these changes,
DataFrame.toPandas()does not preserve pandas 3 string dtype behavior correctly, and some tests and pandas-on-Spark restoration paths still assume the pre-pandas-3 representation.Does this PR introduce any user-facing change?
Yes.
With pandas 3.x,
DataFrame.toPandas()can now return Spark string columns as pandasStringDtype(na_value=np.nan)instead ofobject, and timestamp/timedelta columns follow the pandas 3 dtype expectations more consistently after conversion.This is a user-facing behavior change compared to released versions that still return the older dtype behavior.
How was this patch tested?
Updated the related tests.
Was this patch authored or co-authored using generative AI tooling?
No.