pandas IntervalIndex elegant but slow - alternatives ?

in the dot-calling procedure there is a step where we extract significantly enriched pixels based on their observed-values and a set of thresholds.
These thresholds are defined for groups of pixels based on some score (locally adjusted expected).
Turnes out, assigning these thresholds to pixels (based on that score) isn't trivial ...

`IntervalIndex` from pandas if a perfect tool for the job, yet it is much slower then the `pd.cut`-based alternative .
Here is a minimal self-sufficient example
... let's setup some data:
```python
from numpy.random import random
import numpy as np
import pandas as pd

npixels = 20_000
maxval = 99

# fake pixels with fake counts and scores
scored_pixels_df = pd.DataFrame({
    "count" : (maxval * random(npixels)).astype(int), # not used
    "score" :  maxval * random(npixels)
})

# fake score-intervals aka "lambda-bins"
nbins = 6
bin_edges = np.r_[-np.inf, np.linspace(0, maxval, nbins), np.inf]

# fake thresholds for "lambda-bins"
thresholds = pd.Series(
    data=np.linspace(0, maxval, nbins+1),
    index=pd.IntervalIndex.from_breaks(bin_edges)
)
```

after this setup - to the operation in question - how do you assign thresolds to `scored_pixels_df` according to lambda-bin each `scored_pixels_df["score"]` belongs:

1. elegant, but slow - using the power of `pd.IntervalIndex`, where `thresolds.loc` will associate each score to the appropriate interval !
```python
%%timeit
thresh_slow = thresholds.loc[scored_pixels_df["score"]]
# ~30ms on my machine
```
2. less elegant - but fast, using `pd.cut` (searchsored inside ?) - where we first extract indices of lambda-bins each `scored_df["score"]` belongs, and then using `thresolds.iloc` to extract appropriate thresholds:
```python
%%timeit
lbins_idxs = pd.cut(scored_pixels_df["score"], bin_edges, labels=False)
thresh_fast = thresholds.iloc[lbins_idxs]
# ~1ms on my machine
```
...
make sure results are identical:
```python
assert (thresh_slow == thresh_fast).all()
```

Does anyone have thought on how to make (1) work as fast as (2) ? Why (1) is so slow ? are there faster/more elegant solutions ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pandas IntervalIndex elegant but slow - alternatives ? #343

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

pandas IntervalIndex elegant but slow - alternatives ? #343

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions