Skip to content

Conversation

@sueoglu
Copy link
Collaborator

@sueoglu sueoglu commented Oct 24, 2025

fixes #950

  • improved qc_metrics function with new metrics:

  • in _compute_obs_metrics :
    unique_values_abs
    unique_values_ratio
    entropy_of_missingness

  • in _compute_var_metrics :
    unique_values_abs
    unique_values_ratio
    entropy_of_missingness
    coefficient_of_variation
    is_constant
    constant_variable_ratio
    range_ratio

  • updated tests accordingly with new metrics

TODO

  • porting to 3D

@Zethson Zethson marked this pull request as draft October 24, 2025 13:30
@eroell
Copy link
Collaborator

eroell commented Nov 21, 2025

in _compute_obs_metrics :
unique_values_abs
unique_values_ratio

These only make sense for categorical data, since for floats this will be not very meaningful.
For ehrapy to know about categorical data, infer_feature_types must be called, and I think it would be nice to require this for as little functions as possible.

entropy_of_missingness

Cool

in _compute_var_metrics :
unique_values_abs
unique_values_ratio

Comment on categorical vs numeric above applies

entropy_of_missingness

Cool from above applies

coefficient_of_variation
is_constant
constant_variable_ratio
range_ratio
skewness
kurtosis

All of this require knowledge on categorical and numerical variables - comment above applies.

What do you think about having by default a qc metrics which does only compute the things without feature type information needed; and have e.g. an argument for qc metrics to be computed that is a list, and when someone wants the fancy stuff you suggest here that requires numeric/categorical distinction, they'd need to run infer_feature_types first?

@eroell
Copy link
Collaborator

eroell commented Nov 23, 2025

Can you also while resolving merge-conflicts move the ARRAY_TYPES variable to compat.py please? :)

@eroell eroell mentioned this pull request Nov 23, 2025
4 tasks
Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some intermediate comments - not sure you already asked Andreas to review or still refining things :)

qc_vars: Collection[str] = (),
*,
layer: str | None = None,
advanced: bool = False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be made two arguments
observation_level and variable_level, which take lists of strings and by default the lists are what the current default is?

It would seem to me a bit more readable than "advanced"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, I'd try to keep the number of parameters as low as possible. Can we come up with design where these parameters do not exist? Like a scenario where some things would be skipped over unless the feature specs were calculated but it doesn't crash.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the function computes too many things at once where it adds even more complexity computing 8 metrics in some cases and 12 in another without changing the passed argument. What do you think? I don't have a strong opinion here

The way to address your design that I'd see would be to document well what will be computed if no feature types are found, and what will be computed additionally if they are found.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah kinda like that. If possible, we should purge all and any parameters. Users rarely read API docs.

@sueoglu sueoglu requested a review from agerardy November 26, 2025 17:52
Copy link
Collaborator

@agerardy agerardy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't find any big issues, looks pretty good to me already. tests seem to cover everything as far as I understand it.

@Zethson
Copy link
Member

Zethson commented Nov 28, 2025

Eventually, I'd like to do a final review because there's a few things that I think need to be changed.

Could you please update the PR description?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Longitudinal qc_metrics

5 participants