Skip to content

[SPARK-56257][PYTHON][CONNECT] Support DataFrame input for spark.read.json/csv/xml#55057

Closed
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:SPARK-56257
Closed

[SPARK-56257][PYTHON][CONNECT] Support DataFrame input for spark.read.json/csv/xml#55057
Yicong-Huang wants to merge 2 commits intoapache:masterfrom
Yicong-Huang:SPARK-56257

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Allow spark.read.json(), spark.read.csv(), and spark.read.xml() to accept a DataFrame with a single string column as input. Connect supports JSON and CSV only — XML will be added in a follow-up PR after extending the Parse proto.

Why are the changes needed?

Parsing in-memory text data into a structured DataFrame currently requires sc.parallelize(), which is unavailable on Spark Connect. This is the inverse of DataFrame.toJSON().

Does this PR introduce any user-facing change?

Yes. spark.read.json(), csv(), and xml() now accept a single-string-column DataFrame as input.

How was this patch tested?

7 new tests: 4 classic (JSON, JSON+schema, CSV, XML) and 3 Connect (JSON, CSV, XML-unsupported).

Was this patch authored or co-authored using generative AI tooling?

No

@Yicong-Huang Yicong-Huang marked this pull request as draft March 27, 2026 10:15
@Yicong-Huang Yicong-Huang marked this pull request as ready for review March 27, 2026 10:42
@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

cc @HyukjinKwon

@Yicong-Huang Yicong-Huang marked this pull request as draft March 28, 2026 06:36
@Yicong-Huang
Copy link
Copy Markdown
Contributor Author

Yicong-Huang commented Mar 30, 2026

Closing this POC PR. Split into individual PRs per JIRA sub-task under SPARK-55227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant