#944 longitudinal normalization #958

agerardy · 2025-10-07T13:42:57Z

PR Checklist

This comment contains a description of changes (with reason)
Referenced issue is linked
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

Description of changes
#944
This PR implements normalization support for 3D EHRData objects. The implementation enables all existing normalization functions to work with longitudinal data with shape (n_obs, n_var, n_timestamps) but maintains backward compatibility with 2D data.

Technical details
Treats .R as a named layer with 3D structure. Uses helper functions (_get_target_layer, _set_target_layer, and normalize_3d_data, _normalize_2d_data) to avoid code duplication.
Each variable is processed independently by flattening the time dimension (n_obs x n_timestamps), applying the sklearn normalization function, then reshaping to 3D.

Added tests for the new functions, including group functionality and NaN cases

Examples:

edata = ed.dt.ehrdata_blobs(n_observations=100, base_timepoints=24, cluster_std=0.5, n_centers=3, seasonality=True, 
    time_shifts=True, variable_length=False)

# standard scaling
ep.pp.scale_norm(edata)

# log transformation
ep.pp.offset_negative_values(edata)
ep.pp.log_norm(edata)

…omments for affected functions

for more information, see https://pre-commit.ci

ehrapy/preprocessing/_normalization.py

… maxabs_norm and robust_scale_norm

…ehrapy into 944-longitudinal-normalization

for more information, see https://pre-commit.ci

…oved old 3d tests that only raised valueErrors

for more information, see https://pre-commit.ci

…d more tests

for more information, see https://pre-commit.ci

Zethson

Thank you! Already looks pretty good.

Many of my comments are repetitive so I stopped repeating them after some time 😄
Many of your tests have tons of useless comments. Let the code speak for itself and clean up any LLM leftovers, please.
Please also follow the comments that I make in Öyku's PRs. One of them is to improve the PR description and add some usage examples.

Just a first quick pass. I'll let @eroell have a go and then I might have a look again.

Thanks!

ehrapy/preprocessing/_normalization.py

tests/conftest.py

… properly handle NaN values

for more information, see https://pre-commit.ci

tests/preprocessing/test_normalization.py

…ehrapy into 944-longitudinal-normalization

…alization

… of .R and to use decorator for 3D arrays

eroell

Dropped a first intermediate review already, to be considered together with @sueoglu's :)

ehrapy/preprocessing/_normalization.py

tests/preprocessing/test_normalization.py

ehrapy/preprocessing/_normalization.py

…e for more complicated functions that expect certain outcomes. removed unnecessary docstrings

… though. maxabs_norm and power _norm now advise the user about not usign dask arrays and correctly raise a NotImplementedError if still used. log_norm now also uses the new decorator

ehrapy/preprocessing/_normalization.py

…FAULT_TEM_LAYER_NAME for examples

tests/preprocessing/test_normalization.py

…nt about necessary rasing of NotImplementedError, moved basic tests down to precise tests, removed docstrings

eroell · 2025-12-03T15:17:30Z

ehrapy/preprocessing/_normalization.py


-    if group_key is None:
-        var_values = scale_func(var_values)
+    # group wise normalization doesnt work with dask arrays


Why does group-wise normalization not work with dask arrays?
I see no part in the group computations that cannot be computed lazily.
Next to unlocking a feature here, The whole testing logic would become much easier once this is enabled, too.

Can you please check or show here with a small example what is not working, after fixing the currently broken group-wise normalization?

dask doesn't support in place assignment with fancy indexing so in line 76 for example X[np.ix_(group_mask, var_indices)] = scale_func(X[np.ix_(group_mask, var_indices)]) doesn't work, same for line 80. The most reasonable approach I've found is to convert to numpy arrays, normalize and convert back. The dask native approaches look rather complex. Is that a reasonable approach or should we preserve the dask arrays at all cost?

Any time you convert to numpy you are materializing (and loading) the whole array at once. Therefore, this must be avoided.

ehrapy/preprocessing/_normalization.py

tests/preprocessing/test_normalization.py

eroell

The test for 3D is very complex, and test things that are not 3D specific - kick things out that are not very 3D specific to make it more similar in size to the simple_impute test.

I had to dig quite a while to check some fundamental behaviors. And the group_by argument for 3D seems to not work at all:

This looks fine

edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")

0.61
5.88
! Feature  was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
-0.00
1.00

With groupby, the overall mean and std might not be exactly 0 or 1 as above. But currently, the input is not modified at all:

edata = ed.dt.ehrdata_blobs(layer="tem_data")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")
ep.pp.scale_norm(edata, layer="tem_data", group_key="cluster")
print(f"{edata.layers["tem_data"].mean():.2f}")
print(f"{edata.layers["tem_data"].std():.2f}")

0.61
5.88
! Feature  was detected as categorical features stored numerically.Please verify and adjust if necessary using `ed.replace_feature_types`.
! Feature types were inferred and stored in edata.var[feature_type]. Please verify using `ehrdata.feature_type_overview` and adjust if necessary using `ehrdata.replace_feature_types`.
0.61
5.88

Simplified tests focusing on the most important parts are really required here

tests/preprocessing/test_normalization.py

eroell · 2025-12-03T15:39:34Z

tests/preprocessing/test_normalization.py

+            edata_select.layers[DEFAULT_TEM_LAYER_NAME][:, 1, :], layer_before_select[:, 1, :], equal_nan=True
+        )
+
+        if edata_select.layers[DEFAULT_TEM_LAYER_NAME].shape[1] > 2:


The exact same condition is checked already before entering this branch on line 739

eroell · 2025-12-03T15:43:04Z

tests/preprocessing/test_normalization.py

+        edata.layers[DEFAULT_TEM_LAYER_NAME].dtype, np.floating
+    )
+
+    assert edata.obs.shape == orig_obs_shape


obs and var should not be affected by 3D operations, this does not need to be tested here. this improves the focus on the 3D relevant checks

eroell · 2025-12-03T15:44:22Z

tests/preprocessing/test_normalization.py

+    assert edata.obs.shape == orig_obs_shape
+    assert edata.var.shape[0] == orig_var_shape[0]
+
+    edata.layers["test_isolated_layer"] = layer_original.copy() * 2 + 5


again, this is not very specific to 3D normalization - that the layer is respected during the function does not need to be tested here again

eroell · 2025-12-03T15:46:34Z

tests/preprocessing/test_normalization.py

+    assert "normalization" in edata.uns
+    assert len(edata.uns["normalization"]) > 0
+
+    edata_invalid = edata_blobs_timeseries_small.copy()


this whole test of edata_invalid for simply a non-existent var can be removed - the variable finding is handled the same for 2D and 3D in our code, and this test does not need to check it again - it won't improve the test coverage

eroell · 2025-12-03T15:47:52Z

tests/preprocessing/test_normalization.py

+                edata_select.layers[DEFAULT_TEM_LAYER_NAME][:, 2, :], layer_before_select[:, 2, :], equal_nan=True
+            )
+
+    edata_copy = edata_blobs_timeseries_small.copy()


there is so many copies, I have troubles keeping track :)

If you consider simple_impute tests, a few lines that are quite quickly understood are preferred over concatenations of many copies of an ehrdata object chaining tests together

you're right this function got way out of hand, I'll rewrite it based on simple_impute

…d minor things in examples

…ementedError for dask arrays in group wise functions. added test_norm_group_3D that also actually verifies that the data has been changed by normalization

Zethson · 2025-12-04T18:00:32Z

tests/preprocessing/test_normalization.py

+    group_a_flat = group_a_data.flatten()[~np.isnan(group_a_data.flatten())]
+    group_b_flat = group_b_data.flatten()[~np.isnan(group_b_data.flatten())]
+
+    if len(group_a_flat) > 0 and len(group_b_flat) > 0:


Nit: You could use the new cool match case syntax here.

I have a different question: why even check for this? Isn't the group size is fixed, and even with a few nans will never be 0 for a group?

tests/preprocessing/test_normalization.py

eroell

This has improved now: The function calls seem to do their job, and from what I see internally, dask never computes the full data.

I'll try to stop being picky :) But there's a few things I spotted that should be improved before we can merge this.

ehrapy/preprocessing/_normalization.py

eroell · 2025-12-05T15:38:29Z

tests/preprocessing/test_normalization.py

+        "Normalization did not modify the data - check that feature types are set correctly"
+    )
+
+    layer_data = edata.layers[DEFAULT_TEM_LAYER_NAME]


Why do a second compute here, this is just again the layer_after variable?

eroell · 2025-12-05T15:41:30Z

tests/preprocessing/test_normalization.py

+    group_a_flat = group_a_data.flatten()[~np.isnan(group_a_data.flatten())]
+    group_b_flat = group_b_data.flatten()[~np.isnan(group_b_data.flatten())]
+
+    if len(group_a_flat) > 0 and len(group_b_flat) > 0:


I have a different question: why even check for this? Isn't the group size is fixed, and even with a few nans will never be 0 for a group?

agerardy added 4 commits September 25, 2025 17:47

changed _scale_func_group to also work with 3D objects, updated doc c…

0623b9b

…omments for affected functions

added 3d object fixture to conftest

9ef4818

attempted 3D version of scale_norm

4d5873c

test for scale_norm

0247e5a

agerardy linked an issue Oct 7, 2025 that may be closed by this pull request

Longitudinal normalization #944

Open

14 tasks

[pre-commit.ci] auto fixes from pre-commit.com hooks

f579bc6

for more information, see https://pre-commit.ci

eroell reviewed Oct 7, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

agerardy and others added 5 commits October 9, 2025 14:40

fixed scale_norm and added 3D functionality and tests for minmax_norm…

7fd87aa

… maxabs_norm and robust_scale_norm

Merge branch '944-longitudinal-normalization' of github.com:theislab/…

a6117ad

…ehrapy into 944-longitudinal-normalization

[pre-commit.ci] auto fixes from pre-commit.com hooks

07d187c

for more information, see https://pre-commit.ci

added 3D functionality and tests for all normalization functions. rem…

2260102

…oved old 3d tests that only raised valueErrors

[pre-commit.ci] auto fixes from pre-commit.com hooks

fefda75

for more information, see https://pre-commit.ci

Zethson mentioned this pull request Oct 16, 2025

norm axis #913

Closed

agerardy and others added 5 commits October 20, 2025 11:42

updated normalization to correctly work with selected variables. adde…

799e84f

…d more tests

Merge branch 'main' into 944-longitudinal-normalization

654d355

[pre-commit.ci] auto fixes from pre-commit.com hooks

e7887d8

for more information, see https://pre-commit.ci

fixed small 3d fixture not returning anything

155a72f

minor comment edits

0efca8a

agerardy marked this pull request as ready for review October 20, 2025 10:16

agerardy requested a review from Zethson October 20, 2025 10:17

[pre-commit.ci] auto fixes from pre-commit.com hooks

cbb07d6

for more information, see https://pre-commit.ci

Zethson requested changes Oct 20, 2025

View reviewed changes

agerardy and others added 5 commits October 20, 2025 14:16

Changed logic to work with layers and R as just a layer

cd26ac2

removed unnecessary comments and a nonfunctional test

1f3c554

3D normalization tests now work with edata_blobs_timeseries_small and…

f7e6bfa

… properly handle NaN values

removed unmecessary copy and fixed docstrings

da9cc22

[pre-commit.ci] auto fixes from pre-commit.com hooks

82bd238

for more information, see https://pre-commit.ci

eroell reviewed Oct 20, 2025

View reviewed changes

tests/preprocessing/test_normalization.py Outdated Show resolved Hide resolved

eroell reviewed Oct 20, 2025

View reviewed changes

tests/preprocessing/test_normalization.py Outdated Show resolved Hide resolved

eroell reviewed Oct 20, 2025

View reviewed changes

tests/preprocessing/test_normalization.py Outdated Show resolved Hide resolved

Zethson and others added 7 commits October 22, 2025 11:57

Merge branch 'main' into 944-longitudinal-normalization

e8808a7

refined docstrings in test_normalization to be more informative

81e4873

update examples for test functions

e38bc83

Merge branch 'main' into 944-longitudinal-normalization

0566a17

Merge branch '944-longitudinal-normalization' of github.com:theislab/…

32d3869

…ehrapy into 944-longitudinal-normalization

Merge remote-tracking branch 'origin/main' into 944-longitudinal-norm…

312ec7a

…alization

updated normalization functions and tests to work with layers instead…

bed0dd4

… of .R and to use decorator for 3D arrays

agerardy requested a review from sueoglu November 26, 2025 17:09

eroell requested changes Nov 26, 2025

View reviewed changes

sueoglu reviewed Nov 28, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Show resolved Hide resolved

sueoglu reviewed Nov 28, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

sueoglu reviewed Nov 28, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

agerardy added 2 commits November 28, 2025 13:38

combined most tests into a basic test_norm_3D and test_norm_3D_precis…

5d3381f

…e for more complicated functions that expect certain outcomes. removed unnecessary docstrings

updated examples to use layers. numbers are not real examples anymore…

f1b2e94

… though. maxabs_norm and power _norm now advise the user about not usign dask arrays and correctly raise a NotImplementedError if still used. log_norm now also uses the new decorator

eroell requested changes Nov 28, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

ehrapy/preprocessing/_normalization.py Show resolved Hide resolved

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

fixed unnecessary copy(), uses array_not_implemented error, import DE…

89a14ca

…FAULT_TEM_LAYER_NAME for examples

eroell reviewed Dec 3, 2025

View reviewed changes

tests/preprocessing/test_normalization.py Outdated Show resolved Hide resolved

tests/preprocessing/test_normalization.py Outdated Show resolved Hide resolved

tests/preprocessing/test_normalization.py Show resolved Hide resolved

agerardy added 2 commits December 3, 2025 13:50

redid examples with real data and different layer name

4416c3b

split up precise tests, removed unnecessary shape checks, added comme…

a0a494a

…nt about necessary rasing of NotImplementedError, moved basic tests down to precise tests, removed docstrings

agerardy requested a review from eroell December 3, 2025 13:01

eroell reviewed Dec 3, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

eroell reviewed Dec 3, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

eroell reviewed Dec 3, 2025

View reviewed changes

ehrapy/preprocessing/_normalization.py Outdated Show resolved Hide resolved

eroell reviewed Dec 3, 2025

View reviewed changes

tests/preprocessing/test_normalization.py Show resolved Hide resolved

eroell requested changes Dec 3, 2025

View reviewed changes

agerardy added 2 commits December 3, 2025 20:32

group wise normalization now works with 3D and with dask arrays. fixe…

882bfc2

…d minor things in examples

simplified test_norm_3D to only test basic functions. removed NotImpl…

faa8ef5

…ementedError for dask arrays in group wise functions. added test_norm_group_3D that also actually verifies that the data has been changed by normalization

Zethson reviewed Dec 4, 2025

View reviewed changes

eroell requested changes Dec 5, 2025

View reviewed changes

#944 longitudinal normalization #958

Are you sure you want to change the base?

#944 longitudinal normalization #958

Conversation

agerardy commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Zethson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eroell left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eroell Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

eroell left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eroell left a comment

Choose a reason for hiding this comment

Uh oh!

agerardy commented Oct 7, 2025 •

edited

Loading

Zethson left a comment •

edited

Loading

eroell left a comment •

edited

Loading

eroell Dec 3, 2025 •

edited

Loading