Fix IndexError in HBOS with n_bins='auto' when test data exceeds training range #644
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes an
IndexErrorthat occurs in the HBOS algorithm when usingn_bins='auto'and test data contains values outside the training data range.Fixes: #643
Problem
When using HBOS with automatic bin selection (
n_bins='auto'), the model crashes with anIndexErrorduring prediction if test data contains values that exceed the training data range for any feature.Error Traceback
Minimal Reproduction
Root Cause
The
_calculate_outlier_scores_autofunction was recalculating the optimal number of bins on the test data (usingget_optimal_n_bins(X[:, i])), while using histograms and bin edges computed from the training data.When a test value exceeds the training range:
np.digitizereturns an index equal tolen(bin_edges[i])(i.e.,n_bins_train + 1)bin_inds[j] == optimal_n_bins + 1fails becauseoptimal_n_bins(from test data) ≠n_bins_trainoutlier_scores[j, i] = out_score_i[bin_inds[j] - 1]out_score_i[n_bins_train]which is out of boundsSolution
Changed line 233 in
_calculate_outlier_scores_autoto use the training histogram size:This ensures consistency between the bin edges (from training) and the bin count used for boundary checks.
Changes
Modified Files
pyod/models/hbos.py: Fixed_calculate_outlier_scores_autofunction (1 line)pyod/test/test_hbos_auto_bins_fix.py: Added comprehensive test suite (new file)Diff
Testing
Added comprehensive test suite (
test_hbos_auto_bins_fix.py) that verifies:Test Results
Impact
Benefits
get_optimal_n_binscall)_calculate_outlier_scores)Backward Compatibility
Checklist
Additional Context
This is a critical fix for production use cases where test/production data naturally contains values outside the training distribution - a common scenario in anomaly detection where anomalies often have extreme values. The static bin version (
n_bins=<int>) handles this correctly, but the auto version was crashing.