Skip to content

Conversation

@MohammadMdv
Copy link

Summary

This PR fixes an IndexError that occurs in the HBOS algorithm when using n_bins='auto' and test data contains values outside the training data range.

Fixes: #643

Problem

When using HBOS with automatic bin selection (n_bins='auto'), the model crashes with an IndexError during prediction if test data contains values that exceed the training data range for any feature.

Error Traceback

IndexError: index 147 is out of bounds for axis 0 with size 147
  File "pyod/models/hbos.py", line 274, in _calculate_outlier_scores_auto
    outlier_scores[j, i] = out_score_i[bin_inds[j] - 1]

Minimal Reproduction

from pyod.models.hbos import HBOS
import numpy as np

# Training data: range [0, 10]
X_train = np.random.randn(100, 5) * 2 + 5
X_train = np.clip(X_train, 0, 10)

model = HBOS(n_bins='auto')
model.fit(X_train)

# Test data with value exceeding training range
X_test = np.array([[5, 5, 15, 5, 5]])  # Feature 2 = 15 > 10

predictions = model.predict(X_test)  # ❌ IndexError!

Root Cause

The _calculate_outlier_scores_auto function was recalculating the optimal number of bins on the test data (using get_optimal_n_bins(X[:, i])), while using histograms and bin edges computed from the training data.

When a test value exceeds the training range:

  1. np.digitize returns an index equal to len(bin_edges[i]) (i.e., n_bins_train + 1)
  2. The boundary check bin_inds[j] == optimal_n_bins + 1 fails because optimal_n_bins (from test data) ≠ n_bins_train
  3. Code falls through to: outlier_scores[j, i] = out_score_i[bin_inds[j] - 1]
  4. This attempts to access out_score_i[n_bins_train] which is out of bounds

Solution

Changed line 233 in _calculate_outlier_scores_auto to use the training histogram size:

# Before:
optimal_n_bins = get_optimal_n_bins(X[:, i])  # ❌ Recalculates on test data

# After:
optimal_n_bins = hist[i].shape[0]  # ✅ Uses training histogram size

This ensures consistency between the bin edges (from training) and the bin count used for boundary checks.

Changes

Modified Files

  • pyod/models/hbos.py: Fixed _calculate_outlier_scores_auto function (1 line)
  • pyod/test/test_hbos_auto_bins_fix.py: Added comprehensive test suite (new file)

Diff

@@ -230,7 +230,8 @@ def _calculate_outlier_scores_auto(X, bin_edges, hist, alpha,
         # Add a regularizer for preventing overflow
         out_score_i = np.log2(hist[i] + alpha)
 
-        optimal_n_bins = get_optimal_n_bins(X[:, i])
+        # Use the number of bins determined during fit (training)
+        optimal_n_bins = hist[i].shape[0]
 
         for j in range(n_samples):

Testing

Added comprehensive test suite (test_hbos_auto_bins_fix.py) that verifies:

  • ✅ Test data with values outside training range
  • ✅ All test values above training range
  • ✅ All test values below training range
  • ✅ Mixed in-range and out-of-range values
  • ✅ Consistency with static bins behavior

Test Results

✓ ALL TESTS PASSED!
The fix correctly handles out-of-range test values.

Impact

Benefits

  • ✅ Fixes crash when test data exceeds training range
  • ✅ Maintains correct outlier detection behavior
  • ✅ Slight performance improvement (removes redundant get_optimal_n_bins call)
  • ✅ Aligns behavior with static bin version (_calculate_outlier_scores)

Backward Compatibility

  • ✅ No API changes
  • ✅ No breaking changes to existing functionality
  • ✅ Only fixes buggy edge case
  • ✅ Test data within training range behaves identically

Checklist

Additional Context

This is a critical fix for production use cases where test/production data naturally contains values outside the training distribution - a common scenario in anomaly detection where anomalies often have extreme values. The static bin version (n_bins=<int>) handles this correctly, but the auto version was crashing.

…ning range

- Fixed bug where test values outside training range caused IndexError
- Changed _calculate_outlier_scores_auto to use training histogram size
  instead of recalculating optimal_n_bins on test data
- Added comprehensive test suite to verify the fix
- Fixes issue yzhao062#643
@yzhao062
Copy link
Owner

This is great -- can you resubmit to the development branch? thank you

@coveralls
Copy link

coveralls commented Oct 14, 2025

Pull Request Test Coverage Report for Build 18498171972

Details

  • 85 of 120 (70.83%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 95.093%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pyod/test/test_hbos_auto_bins_fix.py 85 120 70.83%
Totals Coverage Status
Change from base Build 15575914138: -0.3%
Covered Lines: 10446
Relevant Lines: 10985

💛 - Coveralls

@MohammadMdv
Copy link
Author

This is great -- can you resubmit to the development branch? thank you

Ofcourse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants