Taking weighting seriously #487

gragusa · 2022-07-15T16:07:11Z

This PR addresses several problems with the current GLM implementation.

Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Deal with weighted standard errors with rank deficient designs
Document the new API
Improve testing

Closes #186, #259.

…liaStats-master

codecov-commenter · 2022-07-16T08:43:43Z

Codecov Report

Attention: Patch coverage is 95.44419% with 20 lines in your changes missing coverage. Please review.

Project coverage is 95.20%. Comparing base (8f58b34) to head (aff48d6).

Files with missing lines	Patch %	Lines
src/glmfit.jl	93.00%	10 Missing ⚠️
src/lm.jl	93.13%	7 Missing ⚠️
src/glmtools.jl	92.30%	2 Missing ⚠️
src/negbinfit.jl	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
+ Coverage   94.82%   95.20%   +0.37%     
==========================================
  Files           8        8              
  Lines        1044     1251     +207     
==========================================
+ Hits          990     1191     +201     
- Misses         54       60       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lrnv · 2022-07-20T07:45:33Z

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

gragusa · 2022-07-20T17:14:41Z

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

alecloudenback · 2022-08-14T19:14:57Z

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

nalimilan · 2022-08-28T18:27:50Z

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

nalimilan

I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!

What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.

src/GLM.jl

src/glmfit.jl

src/lm.jl

test/runtests.jl

bkamins · 2022-08-31T08:49:28Z

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

Co-authored-by: Milan Bouchet-Valat <[email protected]>

gragusa · 2025-04-29T15:15:49Z

@nalimilan @ajinkya-k all tests pass.. (The two failures are due to HTTPS failures.) What is still needed?

nalimilan · 2025-04-29T21:16:23Z

Thanks. A few uncovered lines really still need testing. I also think some of our comments haven't been addressed yet (I can check if you don't find them).

nalimilan · 2025-04-24T11:48:15Z

src/lm.jl

             dropcollinear::Bool=true, method::Symbol=:qr)
    # For backward compatibility accept wts as AbstractArray and coerce them to FrequencyWeights
    _wts = convert_weights(wts)
-    if !(wts isa AbstractWeights && isempty(_wts))


Did you revert this because it doesn't work?

nalimilan · 2025-04-29T21:09:13Z

src/glmfit.jl

+                throw(ArgumentError("The `nullloglikelihood` for analytic weighted models with `Bernoulli` and `Binomial` families is not supported."))
            end
+            @inbounds for i in eachindex(y, mu, wts)
+                ll += loglik_apweights_obs(d, y[i], mu[i], wts[i], δ, sum(wts), N)


This should be tested.

nalimilan · 2025-04-29T21:09:59Z

src/glmfit.jl

+    # For backward compatibility accept wts as AbstractArray and coerce them to FrequencyWeights
+    _wts = convert_weights(wts)
+    if !(wts isa AbstractWeights) && isempty(_wts)
+        Base.depwarn("Using `wts` of zero length for unweighted regression is deprecated in favor of " *


Can you test this too?

nalimilan · 2025-04-29T21:11:03Z

src/lm.jl

+function loglikelihood(r::LmResp{T,<:AnalyticWeights}) where {T}
+    N = length(r.y)
+    n = sum(log, weights(r))
+    return (n - N * (log(2π * deviance(r) / N) + 1)) / 2


Needs testing too.

nalimilan · 2025-04-29T21:11:51Z

src/lm.jl

+                     :fit)
+        fweights(wts)
+    else
+        throw(ArgumentError("`wts` should be an `AbstractVector` coercible to `AbstractWeights`"))


Also worth testing.

nalimilan · 2025-04-29T21:13:05Z

src/glmtools.jl

+    return wt * logpdf(Gamma(inv(ϕ / sumwt), μ * ϕ / sumwt), y)
+end
+function loglik_apweights_obs(::Geometric, y, μ, wt, ϕ, sumwt, n)
+    return wt * logpdf(Geometric(1 / (μ + 1)), y)


Also test this.

ajinkya-k

I may have missed some conversation but it should be possible to use multiple dispatch instead of using loglik_aweights_obs right?

ajinkya-k · 2025-04-29T21:29:38Z

src/glmtools.jl

+## sumwt is sum(wt)
+## n is the number of observations
+
+function loglik_apweights_obs(::Gamma, y, μ, wt, ϕ, sumwt, n)


Suggested change

function loglik_apweights_obs(::Gamma, y, μ, wt, ϕ, sumwt, n)

function loglik_obs(::Gamma, y, μ, wt::AnalyticWeights, ϕ, sumwt, n)

also the same in other places below

gragusa · 2025-04-30T06:34:02Z

Thanks. A few uncovered lines really still need testing. I also think some of our comments haven't been addressed yet (I can check if you don't find them).

I don't know - there are so many comments that I cannot find anything among them.

nalimilan · 2025-04-30T12:49:47Z

Yeah, GitHub makes it painful to find them, especially as you have to click many times to expand hidden comments. But unresolved comments are still there. Here's a list:
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)
#487 (comment)

nalimilan · 2025-04-29T21:31:24Z

src/linpred.jl

 end

-function delbeta!(p::DensePredChol{T,<:Cholesky}, r::Vector{T},
+function delbeta!(p::DensePredChol{T,<:Cholesky,<:AbstractWeights}, r::Vector{T},


Adding <:AbstractWeights doesn't seem necessary here nor below?

Co-authored-by: Milan Bouchet-Valat <[email protected]>

nalimilan · 2025-04-30T16:56:22Z

Something I forgot: at #350 we wanted to rename the wts argument to weights. Can be done in a separate PR though if you prefer.

ajinkya-k · 2025-04-30T22:04:14Z

Something I forgot: at #350 we wanted to rename the wts argument to weights. Can be done in a separate PR though if you prefer.

I think this should be in a different PR given that this one is huge already 😅

gragusa added 20 commits June 10, 2022 20:53

WIP

1754cbd

WIP

1d778a5

WIP

12121a3

Taking weights seriously

4363ba4

WIP

ca702dc

Taking weights seriously

e2b2d12

Merge branch 'master' of https://github.com/JuliaStats/GLM.jl into Ju…

bc8709a

…liaStats-master

Add depwarn for passing wts with Vector

84cd990

Cosmettic changes

cbc329f

WIP

23d67f5

Fix loglik for weighted models

f4d90a9

Fix remaining issues

6b7d95c

Final commit

c236b82

Merge branch 'master'

d4bd0c2

Fix merge

8bdfb55

Fix nulldeviance

3eb2ca4

Bypass crossmodelmatrix drom StatsAPI

63c8358

Delete momentmatrix.jl

e93a919

Delete scratch.jl

7bb0959

Delete settings.json

ded17a8

ararslan requested review from andreasnoack and nalimilan August 15, 2022 19:54

nalimilan mentioned this pull request Aug 28, 2022

Fixed linear model with perfectly collinear rhs variables and weights #432

Closed

nalimilan reviewed Aug 31, 2022

View reviewed changes

gragusa and others added 19 commits April 23, 2025 08:45

Update src/linpred.jl [no ci]

ce233b4

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/lm.jl [no ci]

a3ff516

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/lm.jl [no ci]

81299f1

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/negbinfit.jl [no ci]

fa23a79

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

28a950c

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

38ff9a2

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

064ad35

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

7635b10

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

53c2f8c

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

125a44f

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update src/glmfit.jl [no ci]

5a5ae3d

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Fix comment in GlmResp

b839090

Fix how wts are checked for empty size

9ecf9b5

Fix formatting [noci]

efbcbf6

Remove loglil_apweights_obs for Bernoulli and Binomial

94d1044

Fix formatting

ef6c86b

Fix formatting

786dd99

Test linkinv with CauchitLink

a3d68bc

Fix CauchitLink linkinv

aff48d6

nalimilan reviewed Apr 29, 2025

View reviewed changes

ajinkya-k suggested changes Apr 29, 2025

View reviewed changes

nalimilan reviewed Apr 30, 2025

View reviewed changes

Update src/lm.jl [no ci]

39300ab

Co-authored-by: Milan Bouchet-Valat <[email protected]>

nalimilan mentioned this pull request Apr 30, 2025

nobs() should be number of obs; wobs() should be current nobs #259

Open

	function loglik_apweights_obs(::Gamma, y, μ, wt, ϕ, sumwt, n)
	function loglik_obs(::Gamma, y, μ, wt::AnalyticWeights, ϕ, sumwt, n)

Taking weighting seriously #487

Are you sure you want to change the base?

Taking weighting seriously #487

Uh oh!

Conversation

gragusa commented Jul 15, 2022 • edited by nalimilan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Jul 16, 2022 • edited by codecov bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lrnv commented Jul 20, 2022

Uh oh!

gragusa commented Jul 20, 2022

Uh oh!

alecloudenback commented Aug 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nalimilan commented Aug 28, 2022

Uh oh!

nalimilan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bkamins commented Aug 31, 2022

Uh oh!

gragusa commented Apr 29, 2025

Uh oh!

nalimilan commented Apr 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajinkya-k left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gragusa commented Apr 30, 2025

Uh oh!

nalimilan commented Apr 30, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nalimilan commented Apr 30, 2025

Uh oh!

ajinkya-k commented Apr 30, 2025

Uh oh!

Uh oh!

gragusa commented Jul 15, 2022 •

edited by nalimilan

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

alecloudenback commented Aug 14, 2022 •

edited

Loading