-
Notifications
You must be signed in to change notification settings - Fork 280
Description
🔍 Issue Description
Fix empty Effective Degrees of Freedom (EDoF) in summary() when splines exceed samples (n_splines > n_samples).
📌 Issue Type
- Bug
📝 Description
In pyGAM, when a model is fit with more splines/coefficients than the number of training samples (e.g., s(0, n_splines=n+1) where n=len(X)), the term-by-term Effective Degrees of Freedom (EDoF) is omitted from the summary() output. The table instead silently displays a blank string "" for the EDoF of all terms.
What is happening?
Within the summary method of pygam.py (around line 1753), a conditional check explicitly verifies if the length of self.statistics_["edof_per_coef"] is equal to the length of the model coefficients (self.coef_). Because the number of computed values in edof_per_coef becomes smaller than the number of coefficients in an overparametrized setting (n_samples < n_coefs), the check fails. As an internal fallback documented with an inline # TODO bug, it assigns an empty string (edof = "") instead of the computed EDoF in the term summary.
What should happen instead?
The term summary should display a gracefully computed or approximated Effective Degrees of Freedom for each feature even when n_samples < n_coefs. If an approximation is mathematically unfeasible, it should explicitly display NaN or a documented placeholder rather than a silently empty string, along with proper warnings indicating unidentifiability due to insufficient samples.
Why is this needed?
The summary() table is a primary tool for evaluating model flexibility and complexity in GAMs. Without EDoF, it becomes significantly harder for data scientists to inspect overparametrized or small-data models and responsibly tune their smoothing penalties (lam). An empty string disrupts the readability of the summary and leaves users without feedback on how their subterms consume degrees of freedom.
🎯 Proposed Solution (Optional but Encouraged)
In pygam/pygam.py, properly compute the trace of the influence (hat) matrix for the individual terms or adapt the EDoF estimation logic to handle overparametrized settings via pseudo-inverses or regularized rank computations over the covariance matrix. If EDoF computation is wholly unavailable, replace the hidden empty string fallback with NaN and emit an actionable warning:
python
if len(self.statistics_["edof_per_coef"]) == len(self.coef_): idx = self.terms.get_coef_indices(i) edof = np.round(self.statistics_["edof_per_coef"][idx].sum(), 1) else: edof = float('nan') # Display NaN rather than a blank space
Relevant modules/files:
- pygam/pygam.py (specifically summary())
- pygam/tests/test_GAM_methods.py (test: test_more_splines_than_samples contains an explicit TODO here is our bug: hook for this)
📎 Additional Context
This bug explicitly aligns with two inline comments found locally in the dswah/pyGAM repository. In
pygam/pygam.py: # TODO bug: if the number of samples is less than the number of coefficients we cant get the edof per term. And in
pygam/tests/test_GAM_methods.py: # TODO here is our bug: we cannot display the term-by-term effective DoF.
🙋 Claiming This Issue
To avoid duplicated work:
- I'm willing to solve this issue by myself