Skip to content

[BUG] Empty Effective DoF in summary() for overparametrized models #535

@hritikkumarpradhan

Description

@hritikkumarpradhan

🔍 Issue Description

Fix empty Effective Degrees of Freedom (EDoF) in summary() when splines exceed samples (n_splines > n_samples).

📌 Issue Type

  • Bug

📝 Description

In pyGAM, when a model is fit with more splines/coefficients than the number of training samples (e.g., s(0, n_splines=n+1) where n=len(X)), the term-by-term Effective Degrees of Freedom (EDoF) is omitted from the summary() output. The table instead silently displays a blank string "" for the EDoF of all terms.

What is happening?

Within the summary method of pygam.py (around line 1753), a conditional check explicitly verifies if the length of self.statistics_["edof_per_coef"] is equal to the length of the model coefficients (self.coef_). Because the number of computed values in edof_per_coef becomes smaller than the number of coefficients in an overparametrized setting (n_samples < n_coefs), the check fails. As an internal fallback documented with an inline # TODO bug, it assigns an empty string (edof = "") instead of the computed EDoF in the term summary.

What should happen instead?

The term summary should display a gracefully computed or approximated Effective Degrees of Freedom for each feature even when n_samples < n_coefs. If an approximation is mathematically unfeasible, it should explicitly display NaN or a documented placeholder rather than a silently empty string, along with proper warnings indicating unidentifiability due to insufficient samples.

Why is this needed?

The summary() table is a primary tool for evaluating model flexibility and complexity in GAMs. Without EDoF, it becomes significantly harder for data scientists to inspect overparametrized or small-data models and responsibly tune their smoothing penalties (lam). An empty string disrupts the readability of the summary and leaves users without feedback on how their subterms consume degrees of freedom.

🎯 Proposed Solution (Optional but Encouraged)

In pygam/pygam.py, properly compute the trace of the influence (hat) matrix for the individual terms or adapt the EDoF estimation logic to handle overparametrized settings via pseudo-inverses or regularized rank computations over the covariance matrix. If EDoF computation is wholly unavailable, replace the hidden empty string fallback with NaN and emit an actionable warning:

python

if len(self.statistics_["edof_per_coef"]) == len(self.coef_):
    idx = self.terms.get_coef_indices(i)
    edof = np.round(self.statistics_["edof_per_coef"][idx].sum(), 1)
else:
    edof = float('nan') # Display NaN rather than a blank space

Relevant modules/files:

  • pygam/pygam.py (specifically summary())
  • pygam/tests/test_GAM_methods.py (test: test_more_splines_than_samples contains an explicit TODO here is our bug: hook for this)

📎 Additional Context
This bug explicitly aligns with two inline comments found locally in the dswah/pyGAM repository. In
pygam/pygam.py: # TODO bug: if the number of samples is less than the number of coefficients we cant get the edof per term. And in
pygam/tests/test_GAM_methods.py: # TODO here is our bug: we cannot display the term-by-term effective DoF.

🙋 Claiming This Issue

To avoid duplicated work:

  • I'm willing to solve this issue by myself

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions