KMeans(init='k-means++') performance issue with OpenBLAS

I open this issue to investigate a performance problem that might be related to #17230.

I adapted the reproducer of #17230 to display more info and make it work on a medium-size random dataset.

```python
from sklearn import cluster
from time import time
from pprint import pprint
from threadpoolctl import threadpool_info
import numpy as np


pprint(threadpool_info())
rng = np.random.RandomState(0)
data = rng.randn(5000, 50)
t0_global = time()
for k in range(1, 15):
    t0 = time()
    # print(f"Running k-means with k={k}: ", end="", flush=True)
    cluster.KMeans(
        n_clusters=k,
        random_state=42,
        n_init=10,
        max_iter=2000,
        algorithm='lloyd',
        init='k-means++').fit(data)
    # print(f"{time() - t0:.3f} s")

print(f"Total duration: {time() - t0_global:.3f} s")
```

I tried to run this on Linux with scikit-learn master (therefore including the #16499 fix)  with 2 different builds of scipy (with openblas from pypi and MKL from anaconda) and various values for `OMP_NUM_THREADS` (unset, `OMP_NUM_THREADS=1`, `OMP_NUM_THREADS=2`, `OMP_NUM_THREADS=4`) on a laptop with 2 physical cpu cores (4 logical cpus).

In both cases, I use the same scikit-learn binaries (built with GCC in editable mode). I just change the env.

The summary is:

- with MKL there is no problem: large or unset values of `OMP_NUM_THREADS` are faster than `OMP_NUM_THREADS=1`;
- with OpenBLAS without explicit setting of `OMP_NUM_THREADS` or setting a large value for it is significanlty slower forced sequential run with `OMP_NUM_THREADS=1`.

I will include my runs in the first comment. 

/cc @jeremiedbb 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KMeans(init='k-means++') performance issue with OpenBLAS #17334

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

KMeans(init='k-means++') performance issue with OpenBLAS #17334

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions