-
Notifications
You must be signed in to change notification settings - Fork 38
#944 longitudinal normalization #958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
… maxabs_norm and robust_scale_norm
…ehrapy into 944-longitudinal-normalization
for more information, see https://pre-commit.ci
…oved old 3d tests that only raised valueErrors
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Already looks pretty good.
- Many of my comments are repetitive so I stopped repeating them after some time 😄
- Many of your tests have tons of useless comments. Let the code speak for itself and clean up any LLM leftovers, please.
- Please also follow the comments that I make in Öyku's PRs. One of them is to improve the PR description and add some usage examples.
Just a first quick pass. I'll let @eroell have a go and then I might have a look again.
Thanks!
… properly handle NaN values
for more information, see https://pre-commit.ci
…ehrapy into 944-longitudinal-normalization
… of .R and to use decorator for 3D arrays
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dropped a first intermediate review already, to be considered together with @sueoglu's :)
…e for more complicated functions that expect certain outcomes. removed unnecessary docstrings
… though. maxabs_norm and power _norm now advise the user about not usign dask arrays and correctly raise a NotImplementedError if still used. log_norm now also uses the new decorator
…FAULT_TEM_LAYER_NAME for examples
…nt about necessary rasing of NotImplementedError, moved basic tests down to precise tests, removed docstrings
|
|
||
| if group_key is None: | ||
| var_values = scale_func(var_values) | ||
| # group wise normalization doesnt work with dask arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does group-wise normalization not work with dask arrays?
I see no part in the group computations that cannot be computed lazily.
Next to unlocking a feature here, The whole testing logic would become much easier once this is enabled, too.
Can you please check or show here with a small example what is not working, after fixing the currently broken group-wise normalization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dask doesn't support in place assignment with fancy indexing so in line 76 for example X[np.ix_(group_mask, var_indices)] = scale_func(X[np.ix_(group_mask, var_indices)]) doesn't work, same for line 80. The most reasonable approach I've found is to convert to numpy arrays, normalize and convert back. The dask native approaches look rather complex. Is that a reasonable approach or should we preserve the dask arrays at all cost?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any time you convert to numpy you are materializing (and loading) the whole array at once. Therefore, this must be avoided.
eroell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test for 3D is very complex, and test things that are not 3D specific - kick things out that are not very 3D specific to make it more similar in size to the simple_impute test.
I had to dig quite a while to check some fundamental behaviors. And the group_by argument for 3D seems to not work at all:
This looks fine
edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")0.61
5.88
! Feature was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
-0.00
1.00
With groupby, the overall mean and std might not be exactly 0 or 1 as above. But currently, the input is not modified at all:
edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data", group_key="cluster")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")0.61
5.88
! Feature was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
0.61
5.88
Simplified tests focusing on the most important parts are really required here
| edata_select.layers[DEFAULT_TEM_LAYER_NAME][:, 1, :], layer_before_select[:, 1, :], equal_nan=True | ||
| ) | ||
|
|
||
| if edata_select.layers[DEFAULT_TEM_LAYER_NAME].shape[1] > 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact same condition is checked already before entering this branch on line 739
| edata.layers[DEFAULT_TEM_LAYER_NAME].dtype, np.floating | ||
| ) | ||
|
|
||
| assert edata.obs.shape == orig_obs_shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obs and var should not be affected by 3D operations, this does not need to be tested here. this improves the focus on the 3D relevant checks
| assert edata.obs.shape == orig_obs_shape | ||
| assert edata.var.shape[0] == orig_var_shape[0] | ||
|
|
||
| edata.layers["test_isolated_layer"] = layer_original.copy() * 2 + 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this is not very specific to 3D normalization - that the layer is respected during the function does not need to be tested here again
| assert "normalization" in edata.uns | ||
| assert len(edata.uns["normalization"]) > 0 | ||
|
|
||
| edata_invalid = edata_blobs_timeseries_small.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole test of edata_invalid for simply a non-existent var can be removed - the variable finding is handled the same for 2D and 3D in our code, and this test does not need to check it again - it won't improve the test coverage
| edata_select.layers[DEFAULT_TEM_LAYER_NAME][:, 2, :], layer_before_select[:, 2, :], equal_nan=True | ||
| ) | ||
|
|
||
| edata_copy = edata_blobs_timeseries_small.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is so many copies, I have troubles keeping track :)
If you consider simple_impute tests, a few lines that are quite quickly understood are preferred over concatenations of many copies of an ehrdata object chaining tests together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right this function got way out of hand, I'll rewrite it based on simple_impute
…d minor things in examples
…ementedError for dask arrays in group wise functions. added test_norm_group_3D that also actually verifies that the data has been changed by normalization
| group_a_flat = group_a_data.flatten()[~np.isnan(group_a_data.flatten())] | ||
| group_b_flat = group_b_data.flatten()[~np.isnan(group_b_data.flatten())] | ||
|
|
||
| if len(group_a_flat) > 0 and len(group_b_flat) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: You could use the new cool match case syntax here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different question: why even check for this? Isn't the group size is fixed, and even with a few nans will never be 0 for a group?
eroell
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has improved now: The function calls seem to do their job, and from what I see internally, dask never computes the full data.
I'll try to stop being picky :) But there's a few things I spotted that should be improved before we can merge this.
| "Normalization did not modify the data - check that feature types are set correctly" | ||
| ) | ||
|
|
||
| layer_data = edata.layers[DEFAULT_TEM_LAYER_NAME] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do a second compute here, this is just again the layer_after variable?
| group_a_flat = group_a_data.flatten()[~np.isnan(group_a_data.flatten())] | ||
| group_b_flat = group_b_data.flatten()[~np.isnan(group_b_data.flatten())] | ||
|
|
||
| if len(group_a_flat) > 0 and len(group_b_flat) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a different question: why even check for this? Isn't the group size is fixed, and even with a few nans will never be 0 for a group?
PR Checklist
docsis updatedDescription of changes
#944
This PR implements normalization support for 3D EHRData objects. The implementation enables all existing normalization functions to work with longitudinal data with shape
(n_obs, n_var, n_timestamps)but maintains backward compatibility with 2D data.Technical details
Treats .R as a named layer with 3D structure. Uses helper functions (
_get_target_layer,_set_target_layer, andnormalize_3d_data,_normalize_2d_data) to avoid code duplication.Each variable is processed independently by flattening the time dimension
(n_obs x n_timestamps), applying the sklearn normalization function, then reshaping to 3D.Added tests for the new functions, including group functionality and NaN cases
Examples: