Skip to content

pandas IntervalIndex elegant but slow - alternatives ? #343

@sergpolly

Description

@sergpolly

in the dot-calling procedure there is a step where we extract significantly enriched pixels based on their observed-values and a set of thresholds.
These thresholds are defined for groups of pixels based on some score (locally adjusted expected).
Turnes out, assigning these thresholds to pixels (based on that score) isn't trivial ...

IntervalIndex from pandas if a perfect tool for the job, yet it is much slower then the pd.cut-based alternative .
Here is a minimal self-sufficient example
... let's setup some data:

from numpy.random import random
import numpy as np
import pandas as pd

npixels = 20_000
maxval = 99

# fake pixels with fake counts and scores
scored_pixels_df = pd.DataFrame({
    "count" : (maxval * random(npixels)).astype(int), # not used
    "score" :  maxval * random(npixels)
})

# fake score-intervals aka "lambda-bins"
nbins = 6
bin_edges = np.r_[-np.inf, np.linspace(0, maxval, nbins), np.inf]

# fake thresholds for "lambda-bins"
thresholds = pd.Series(
    data=np.linspace(0, maxval, nbins+1),
    index=pd.IntervalIndex.from_breaks(bin_edges)
)

after this setup - to the operation in question - how do you assign thresolds to scored_pixels_df according to lambda-bin each scored_pixels_df["score"] belongs:

  1. elegant, but slow - using the power of pd.IntervalIndex, where thresolds.loc will associate each score to the appropriate interval !
%%timeit
thresh_slow = thresholds.loc[scored_pixels_df["score"]]
# ~30ms on my machine
  1. less elegant - but fast, using pd.cut (searchsored inside ?) - where we first extract indices of lambda-bins each scored_df["score"] belongs, and then using thresolds.iloc to extract appropriate thresholds:
%%timeit
lbins_idxs = pd.cut(scored_pixels_df["score"], bin_edges, labels=False)
thresh_fast = thresholds.iloc[lbins_idxs]
# ~1ms on my machine

...
make sure results are identical:

assert (thresh_slow == thresh_fast).all()

Does anyone have thought on how to make (1) work as fast as (2) ? Why (1) is so slow ? are there faster/more elegant solutions ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions