[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling#55095
[SPARK-56286][PYTHON] Add DataFrame.dataQuality API for column profiling#55095sougata99 wants to merge 3 commits intoapache:masterfrom
Conversation
allisonwang-db
left a comment
There was a problem hiding this comment.
Thanks for the contribution! I have concerns about whether dataQuality() belongs as a built-in DataFrame API in PySpark. It has significant overlaps with the existing describe() and summary() API which already provide count, mean, stddev, min, and max. I'd suggest discussing the design on the dev mailing list before proceeding. cc @HyukjinKwon
|
Thanks for the feedback @allisonwang-db . That makes sense. My intention was to provide a more data-quality-focused profiling API that includes metrics such as null counts, null ratios, distinct counts, mode, median, and a dataset-level summary row, which are not directly available from That said, I agree this is a public API design question and should be discussed more broadly first. I’m happy to start a thread on the dev mailing list to gather feedback on whether this should be a new DataFrame API, an extension of an existing API, or something else. |
What changes were proposed in this pull request?
This PR adds a new PySpark
DataFrame.dataQuality()API for exploratory dataset profiling.The new method returns a DataFrame with one row per input column and one synthetic
__dataset__row for overall dataset-level completeness metrics. The output includesrow_count,column_count,total_cells,non_null_count,null_count,null_ratio,distinct_count,min,max, andmode. For numeric columns, it also includesmean,stddev, andmedian.This PR also adds PySpark unit test coverage for the new API, including null handling,
NaNhandling for floating-point columns, numeric profiling, categorical mode, and overall dataset metrics.Why are the changes needed?
PySpark currently provides
describe()andsummary(), but there is no built-in API focused on practical data quality profiling.A common early step in exploratory analysis is understanding completeness and basic quality characteristics of a dataset, such as null distribution, distinct values, central tendency, and per-column summary metrics. Today, users typically have to compose several custom aggregations to gather this information. This change makes that workflow easier and more discoverable through a single DataFrame API.
Does this PR introduce any user-facing change?
Yes.
This PR introduces a new PySpark API:
Example:
This allows users to retrieve column-level and dataset-level quality metrics directly from a DataFrame without composing multiple manual aggregations.
How was this patch tested?
Added a new PySpark test in
python/pyspark/sql/tests/test_dataframe.py.The test covers:
NaNhandling for floating-point columnsI also verified that the modified Python files compile successfully with:
Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex (GPT-5)