I open this issue to investigate a performance problem that might be related to #17230.
I adapted the reproducer of #17230 to display more info and make it work on a medium-size random dataset.
from sklearn import cluster
from time import time
from pprint import pprint
from threadpoolctl import threadpool_info
import numpy as np
pprint(threadpool_info())
rng = np.random.RandomState(0)
data = rng.randn(5000, 50)
t0_global = time()
for k in range(1, 15):
t0 = time()
# print(f"Running k-means with k={k}: ", end="", flush=True)
cluster.KMeans(
n_clusters=k,
random_state=42,
n_init=10,
max_iter=2000,
algorithm='lloyd',
init='k-means++').fit(data)
# print(f"{time() - t0:.3f} s")
print(f"Total duration: {time() - t0_global:.3f} s")
I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for OMP_NUM_THREADS (unset, OMP_NUM_THREADS=1, OMP_NUM_THREADS=2, OMP_NUM_THREADS=4) on a laptop with 2 physical cpu cores (4 logical cpus).
In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.
The summary is:
- with MKL there is no problem: large or unset values of
OMP_NUM_THREADS are faster than OMP_NUM_THREADS=1;
- with OpenBLAS without explicit setting of
OMP_NUM_THREADS or setting a large value for it is significanlty slower forced sequential run with OMP_NUM_THREADS=1.
I will include my runs in the first comment.
/cc @jeremiedbb
I open this issue to investigate a performance problem that might be related to #17230.
I adapted the reproducer of #17230 to display more info and make it work on a medium-size random dataset.
I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix) with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for
OMP_NUM_THREADS(unset,OMP_NUM_THREADS=1,OMP_NUM_THREADS=2,OMP_NUM_THREADS=4) on a laptop with 2 physical cpu cores (4 logical cpus).In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.
The summary is:
OMP_NUM_THREADSare faster thanOMP_NUM_THREADS=1;OMP_NUM_THREADSor setting a large value for it is significanlty slower forced sequential run withOMP_NUM_THREADS=1.I will include my runs in the first comment.
/cc @jeremiedbb