-
Notifications
You must be signed in to change notification settings - Fork 52
Description
in the dot-calling procedure there is a step where we extract significantly enriched pixels based on their observed-values and a set of thresholds.
These thresholds are defined for groups of pixels based on some score (locally adjusted expected).
Turnes out, assigning these thresholds to pixels (based on that score) isn't trivial ...
IntervalIndex
from pandas if a perfect tool for the job, yet it is much slower then the pd.cut
-based alternative .
Here is a minimal self-sufficient example
... let's setup some data:
from numpy.random import random
import numpy as np
import pandas as pd
npixels = 20_000
maxval = 99
# fake pixels with fake counts and scores
scored_pixels_df = pd.DataFrame({
"count" : (maxval * random(npixels)).astype(int), # not used
"score" : maxval * random(npixels)
})
# fake score-intervals aka "lambda-bins"
nbins = 6
bin_edges = np.r_[-np.inf, np.linspace(0, maxval, nbins), np.inf]
# fake thresholds for "lambda-bins"
thresholds = pd.Series(
data=np.linspace(0, maxval, nbins+1),
index=pd.IntervalIndex.from_breaks(bin_edges)
)
after this setup - to the operation in question - how do you assign thresolds to scored_pixels_df
according to lambda-bin each scored_pixels_df["score"]
belongs:
- elegant, but slow - using the power of
pd.IntervalIndex
, wherethresolds.loc
will associate each score to the appropriate interval !
%%timeit
thresh_slow = thresholds.loc[scored_pixels_df["score"]]
# ~30ms on my machine
- less elegant - but fast, using
pd.cut
(searchsored inside ?) - where we first extract indices of lambda-bins eachscored_df["score"]
belongs, and then usingthresolds.iloc
to extract appropriate thresholds:
%%timeit
lbins_idxs = pd.cut(scored_pixels_df["score"], bin_edges, labels=False)
thresh_fast = thresholds.iloc[lbins_idxs]
# ~1ms on my machine
...
make sure results are identical:
assert (thresh_slow == thresh_fast).all()
Does anyone have thought on how to make (1) work as fast as (2) ? Why (1) is so slow ? are there faster/more elegant solutions ?