Skip to content

Conversation

@agerardy
Copy link
Collaborator

@agerardy agerardy commented Oct 7, 2025

PR Checklist

  • This comment contains a description of changes (with reason)
  • Referenced issue is linked
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

Description of changes
#944
This PR implements normalization support for 3D EHRData objects. The implementation enables all existing normalization functions to work with longitudinal data with shape (n_obs, n_var, n_timestamps) but maintains backward compatibility with 2D data.

Technical details
Treats .R as a named layer with 3D structure. Uses helper functions (_get_target_layer, _set_target_layer, and normalize_3d_data, _normalize_2d_data) to avoid code duplication.
Each variable is processed independently by flattening the time dimension (n_obs x n_timestamps), applying the sklearn normalization function, then reshaping to 3D.

Added tests for the new functions, including group functionality and NaN cases

Examples:

edata = ed.dt.ehrdata_blobs(n_observations=100, base_timepoints=24, cluster_std=0.5, n_centers=3, seasonality=True, 
    time_shifts=True, variable_length=False)

# standard scaling
ep.pp.scale_norm(edata)

# log transformation
ep.pp.offset_negative_values(edata)
ep.pp.log_norm(edata)

@agerardy agerardy linked an issue Oct 7, 2025 that may be closed by this pull request
14 tasks
@Zethson Zethson mentioned this pull request Oct 16, 2025
@agerardy agerardy marked this pull request as ready for review October 20, 2025 10:16
@agerardy agerardy requested a review from Zethson October 20, 2025 10:17
Copy link
Member

@Zethson Zethson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Already looks pretty good.

  1. Many of my comments are repetitive so I stopped repeating them after some time 😄
  2. Many of your tests have tons of useless comments. Let the code speak for itself and clean up any LLM leftovers, please.
  3. Please also follow the comments that I make in Öyku's PRs. One of them is to improve the PR description and add some usage examples.

Just a first quick pass. I'll let @eroell have a go and then I might have a look again.

Thanks!

@agerardy agerardy requested a review from sueoglu November 26, 2025 17:09
Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped a first intermediate review already, to be considered together with @sueoglu's :)

…e for more complicated functions that expect certain outcomes. removed unnecessary docstrings
… though. maxabs_norm and power _norm now advise the user about not usign dask arrays and correctly raise a NotImplementedError if still used. log_norm now also uses the new decorator
…nt about necessary rasing of NotImplementedError, moved basic tests down to precise tests, removed docstrings
@agerardy agerardy requested a review from eroell December 3, 2025 13:01

if group_key is None:
var_values = scale_func(var_values)
# group wise normalization doesnt work with dask arrays
Copy link
Collaborator

@eroell eroell Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does group-wise normalization not work with dask arrays?
I see no part in the group computations that cannot be computed lazily.
Next to unlocking a feature here, The whole testing logic would become much easier once this is enabled, too.

Can you please check or show here with a small example what is not working, after fixing the currently broken group-wise normalization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask doesn't support in place assignment with fancy indexing so in line 76 for example X[np.ix_(group_mask, var_indices)] = scale_func(X[np.ix_(group_mask, var_indices)]) doesn't work, same for line 80. The most reasonable approach I've found is to convert to numpy arrays, normalize and convert back. The dask native approaches look rather complex. Is that a reasonable approach or should we preserve the dask arrays at all cost?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any time you convert to numpy you are materializing (and loading) the whole array at once. Therefore, this must be avoided.

Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for 3D is very complex, and test things that are not 3D specific - kick things out that are not very 3D specific to make it more similar in size to the simple_impute test.

I had to dig quite a while to check some fundamental behaviors. And the group_by argument for 3D seems to not work at all:

This looks fine

edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
0.61
5.88
! Feature  was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
-0.00
1.00

With groupby, the overall mean and std might not be exactly 0 or 1 as above. But currently, the input is not modified at all:

edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data", group_key="cluster")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
0.61
5.88
! Feature  was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
0.61
5.88

Simplified tests focusing on the most important parts are really required here

edata_select.layers[DEFAULT_TEM_LAYER_NAME][:, 1, :], layer_before_select[:, 1, :], equal_nan=True
)

if edata_select.layers[DEFAULT_TEM_LAYER_NAME].shape[1] > 2:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact same condition is checked already before entering this branch on line 739

edata.layers[DEFAULT_TEM_LAYER_NAME].dtype, np.floating
)

assert edata.obs.shape == orig_obs_shape
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obs and var should not be affected by 3D operations, this does not need to be tested here. this improves the focus on the 3D relevant checks

assert edata.obs.shape == orig_obs_shape
assert edata.var.shape[0] == orig_var_shape[0]

edata.layers["test_isolated_layer"] = layer_original.copy() * 2 + 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this is not very specific to 3D normalization - that the layer is respected during the function does not need to be tested here again

assert "normalization" in edata.uns
assert len(edata.uns["normalization"]) > 0

edata_invalid = edata_blobs_timeseries_small.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole test of edata_invalid for simply a non-existent var can be removed - the variable finding is handled the same for 2D and 3D in our code, and this test does not need to check it again - it won't improve the test coverage

edata_select.layers[DEFAULT_TEM_LAYER_NAME][:, 2, :], layer_before_select[:, 2, :], equal_nan=True
)

edata_copy = edata_blobs_timeseries_small.copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is so many copies, I have troubles keeping track :)

If you consider simple_impute tests, a few lines that are quite quickly understood are preferred over concatenations of many copies of an ehrdata object chaining tests together

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right this function got way out of hand, I'll rewrite it based on simple_impute

…ementedError for dask arrays in group wise functions. added test_norm_group_3D that also actually verifies that the data has been changed by normalization
group_a_flat = group_a_data.flatten()[~np.isnan(group_a_data.flatten())]
group_b_flat = group_b_data.flatten()[~np.isnan(group_b_data.flatten())]

if len(group_a_flat) > 0 and len(group_b_flat) > 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: You could use the new cool match case syntax here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different question: why even check for this? Isn't the group size is fixed, and even with a few nans will never be 0 for a group?

Copy link
Collaborator

@eroell eroell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has improved now: The function calls seem to do their job, and from what I see internally, dask never computes the full data.

I'll try to stop being picky :) But there's a few things I spotted that should be improved before we can merge this.

"Normalization did not modify the data - check that feature types are set correctly"
)

layer_data = edata.layers[DEFAULT_TEM_LAYER_NAME]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do a second compute here, this is just again the layer_after variable?

group_a_flat = group_a_data.flatten()[~np.isnan(group_a_data.flatten())]
group_b_flat = group_b_data.flatten()[~np.isnan(group_b_data.flatten())]

if len(group_a_flat) > 0 and len(group_b_flat) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a different question: why even check for this? Isn't the group size is fixed, and even with a few nans will never be 0 for a group?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Longitudinal normalization

5 participants