diff --git a/docs/features/profile_values.md b/docs/features/profile_values.md index b83e83e89..affdb8c42 100644 --- a/docs/features/profile_values.md +++ b/docs/features/profile_values.md @@ -1,7 +1,43 @@ # Accessing profile files -## Json output structure +ydata-profiling allows you to access and export the computed profile data +programmatically, beyond just the HTML report. + +## JSON output structure + +You can export the full profile as a JSON file: +```python +import pandas as pd +from ydata_profiling import ProfileReport + +df = pd.read_csv("your_data.csv") +profile = ProfileReport(df, title="My Report") +profile.to_file("report.json") +``` + +The JSON output contains all computed statistics organized by variable name, +including type, missing values, descriptive statistics, and correlations. ## Univariate variables statistics through description_set +You can access per-variable statistics directly in Python via `description_set`: +```python +description = profile.get_description() +# Access stats for a specific variable +print(description.variables["your_column_name"]) +``` + +This returns a dictionary of computed metrics for each variable — type, +missing count, distinct count, mean, std, quantiles, and more. + ## Correlation matrices through description_set + +Correlation matrices computed during profiling are also accessible: +```python +description = profile.get_description() +# Pearson correlation matrix +print(description.correlations["pearson"]) +``` + +Available correlation keys depend on your configuration but typically include +`pearson`, `spearman`, `kendall`, and `cramers`. \ No newline at end of file diff --git a/docs/getting-started/concepts.md b/docs/getting-started/concepts.md index aa38fcfba..aee79b48d 100644 --- a/docs/getting-started/concepts.md +++ b/docs/getting-started/concepts.md @@ -62,7 +62,21 @@ This section provides a comprehensive overview of individual variables within a as it automatically calculated detailed statistics, visualizations, and insights for each variable in the dataset. It offers information such as data type, missing values, unique values, basic descriptive statistics , histogram plots, and distribution plots. This allows data analysts and scientists to quickly understand the characteristics of each variable, identify potential data quality issues, and gain initial insights into the data's distribution and variability. -For more details about the different metrics and visualizations check the Univariate section details page. + +**Univariate analysis** examines each variable individually. For every column in your dataset, ydata-profiling automatically computes: + +- **Descriptive statistics** — count, mean, median, standard deviation, min/max +- **Missing values** — count and percentage of null entries +- **Unique values** — number and percentage of distinct values +- **Distribution plots** — histogram and density curve +- **Data type** — inferred type (Numerical, Categorical, Date, etc.) + +**Multivariate analysis** examines relationships between variables. ydata-profiling computes: + +- **Correlations** — Pearson, Spearman, Kendall, and Cramér's V matrices +- **Interactions** — pairwise scatter plots between numerical variables +- **Missing data patterns** — which variables tend to be missing together +- **Duplicate rows** — detection of identical records across the dataset ## Multivariate profiling diff --git a/src/ydata_profiling/profile_report.py b/src/ydata_profiling/profile_report.py index a7d6d9134..7c16b3403 100644 --- a/src/ydata_profiling/profile_report.py +++ b/src/ydata_profiling/profile_report.py @@ -8,7 +8,7 @@ with warnings.catch_warnings(): warnings.simplefilter("ignore") - import pkg_resources + from importlib.metadata import version if not is_pyspark_installed(): from typing import TypeVar @@ -359,7 +359,7 @@ def to_file(self, output_file: Union[str, Path], silent: bool = True) -> None: """ with warnings.catch_warnings(): warnings.simplefilter("ignore") - pillow_version = pkg_resources.get_distribution("Pillow").version + pillow_version = version("Pillow") version_tuple = tuple(map(int, pillow_version.split("."))) if version_tuple < (9, 5, 0): warnings.warn(