Skip to content

Add only_performance_cores in cpu_count as a boolean option#630

Draft
Copilot wants to merge 34 commits into
masterfrom
copilot/implement-system-command-inspection
Draft

Add only_performance_cores in cpu_count as a boolean option#630
Copilot wants to merge 34 commits into
masterfrom
copilot/implement-system-command-inspection

Conversation

Copy link
Copy Markdown

Copilot AI commented May 15, 2026

Fixes #629.

Detect performance cores to avoid oversubscription problem when loky.cpu_count is used to automatically adjust the number of worker threads or processes for CPU intensive parallel tasks.

The new only_performance_cores=True boolean flag is set to true by default but only active when only_physical_cores=True (not the default).

I (@ogrisel) chose this design because people who pass only_physical_cores=True are generally interested in counting the number of cores to adjust the number of worker threads/processes for CPU intensive parallel work, in which case, including logical SMT threads and efficiency cores in the worker count often leads to detrimental scheduling overhead that hurts the scalability.

EDIT: moved this PR back to draft because the cascading behavior between the two boolean options is surprising.

Copilot AI and others added 7 commits May 15, 2026 16:48
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 15, 2026

@copilot fix the CI failures.

Copy link
Copy Markdown
Author

Copilot AI commented May 15, 2026

@copilot fix the CI failures.

Fixed in commit 2b40df0. I investigated the failing workflow run and updated the Linux sysfs mock path handling in tests/test_loky_module.py so the tests pass consistently on Windows (backslash paths) and Linux. Targeted tests now pass locally.

Copilot AI requested a review from ogrisel May 15, 2026 17:27
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 16, 2026

@copilot let's simplify the logic and make only_performance_cores a boolean option (True or False and True by default):

  • when only_physical_cores=False, only_performance_cores is ignored.
  • when only_physical_cores=True, performance core filtering is applied when only_performance_cores is True.

Update tests, doc, changelog and implementation accordingly.

Copilot AI and others added 2 commits May 16, 2026 07:36
Copy link
Copy Markdown
Author

Copilot AI commented May 16, 2026

@copilot let's simplify the logic and make only_performance_cores a boolean option (True or False and True by default):

  • when only_physical_cores=False, only_performance_cores is ignor...

Implemented in de8686d (plus e380923 clarification): only_performance_cores is now a boolean with default True, ignored when only_physical_cores=False, and used only when only_physical_cores=True. I updated implementation, tests, function docs, and changelog accordingly.

Copilot AI changed the title Add only_performance_cores toggle to cpu_count while preserving "auto" hybrid behavior Simplify only_performance_cores in cpu_count to a boolean option May 16, 2026
@ogrisel ogrisel changed the title Simplify only_performance_cores in cpu_count to a boolean option Add only_performance_cores in cpu_count as a boolean option May 18, 2026
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 18, 2026

I have tested this PR under Linux with a hybrid CPU (Intel(R) Core(TM) Ultra X7 358H with 4 performance cores and 16 total) and under macOS with an Apple M4 CPU (4 performance cores and 10 total), and it seems to work as expected.

I will now give it a try with the Intel(R) Core(TM) Ultra X7 358H CPU under windows.

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 18, 2026

@copilot on the Intel(R) Core(TM) Ultra X7 358H CPU under Linux, loky.cpu_count(only_physical_cores=True) reports 4 core as expected, but reports 12 under Windows (same machine with same CPU, double boot setting, no virtualization)

On both systems, loky.cpu_count(only_physical_cores=False) and loky.cpu_count(only_physical_cores=True, only_performance_cores=False) report 16 as expected.

Please fix the Windows specific inspection logic to get a behavior consistent with the Linux outcome (and update the tests accordingly).

Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot here is another pass of feedback to address.

Comment thread loky/backend/context.py Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread CHANGES.md Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread tests/test_loky_module.py Outdated
Comment thread loky/backend/context.py Outdated
Comment thread loky/backend/context.py Outdated
Comment thread loky/backend/context.py Outdated
Comment thread loky/backend/context.py Outdated
Copilot AI and others added 2 commits May 19, 2026 14:00
… update tests and changelog

Agent-Logs-Url: https://github.com/joblib/loky/sessions/95179a54-fbe7-4faa-9593-933c579b0f62

Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Copy link
Copy Markdown
Author

Copilot AI commented May 19, 2026

@copilot here is another pass of feedback to address.

All feedback addressed in 345ad3a (with follow-up comment in 3826461): removed wrapper functions and _detect_platform_cores, used Counter, removed dead code, simplified test setup, made glob mocks specific, factorized win32 helpers, and added the invalid-data parametrized test.

Copilot AI requested a review from ogrisel May 19, 2026 14:04
Copy link
Copy Markdown
Contributor

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot some more minor feedback.

Comment thread tests/test_loky_module.py Outdated
Comment thread tests/test_loky_module.py Outdated
return bytes(data)


def _make_fake_ctypes_module(payload):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename this to _make_fake_win32_ctypes_module.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0b9ea16.

Comment thread tests/test_loky_module.py
- Rename _make_win32_record -> _make_win32_processor_relationship_record
- Rename _make_fake_ctypes_module -> _make_fake_win32_ctypes_module
- Parametrize test_count_performance_cores_win32_selects_highest_efficiency_class
  with 4 topology cases: 4P+12E, 2P+4M+8E (tri-hybrid), 8P+8E, 1P+3E

Agent-Logs-Url: https://github.com/joblib/loky/sessions/d56bbc35-758a-4474-87ff-0ba9e6c15055

Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
@ogrisel ogrisel marked this pull request as ready for review May 19, 2026 15:38
@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 19, 2026

I think I am satisfied with the current state of this PR.

@tomMoral or @jeremiedbb I'd love any feedback on this. I tested it on laptops with hybrid CPUs and all three platforms, and it behaves expected, for instance on Windows or Linux with an Intel CPU with 4 performance cores, 8 efficiency cores and 4 low-power cores I get:

>>> import loky
>>> loky.cpu_count()
16
>>> loky.cpu_count(only_physical_cores=True)
4
>>> loky.cpu_count(only_physical_cores=True, only_performance_cores=False)
16

and on a macOS macbook air laptop with M4 processor:

>>> import loky
>>> loky.cpu_count()
10
>>> loky.cpu_count(only_physical_cores=True)
4
>>> loky.cpu_count(only_physical_cores=True, only_performance_cores=False)
10

@jeremiedbb
Copy link
Copy Markdown
Contributor

>>> loky.cpu_count(only_physical_cores=True, only_performance_cores=False)
16

I would have expected 4, no ?

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 20, 2026

I would have expected 4, no ?

On this machine there is no SMT/hyperthreading on any core. So all are physical cores.

But in retrospect, there are machines with a mix of SMT and hybrid cores (e.g. 2 physical performance cores with SMT, therefore 4 local cores, and 4 efficiency cores without SMT). So maybe we should decouple the two detection mechanisms to give full control to the user.

@jeremiedbb
Copy link
Copy Markdown
Contributor

On this machine there is no SMT/hyperthreading on any core. So all are physical cores.

then why

>>> loky.cpu_count(only_physical_cores=True)
4

?

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 20, 2026

Because in this PR I enabled only_performance_cores=True as default and made the only_performance_cores option only enforced when only_physical_cores=True.

I agree this is weird. I will update the PR to decouple the two filtering mechanisms.

@ogrisel ogrisel marked this pull request as draft May 20, 2026 12:25
@tomMoral
Copy link
Copy Markdown
Contributor

Hello!
Thanks for the PR.
From what I understand, on my machine, I have 4 performance core with SMT and the other without and I have to say I have a hard time figuring if the results makes sense. Any easy way to introspect what results to expect (I have to say I don't really understand the differences).

Default: 12
loky.cpu_count(only_physical_cores=True, only_performance_cores=True): 4
loky.cpu_count(only_physical_cores=True, only_performance_cores=False): 10
loky.cpu_count(only_physical_cores=False, only_performance_cores=True): 12
loky.cpu_count(only_physical_cores=False, only_performance_cores=False): 12

One thing that I find surprising: as we don't have a way to constrain the jobs to run on performance cores, is this that useful to restrict the number of jobs to match them?

@ogrisel
Copy link
Copy Markdown
Contributor

ogrisel commented May 21, 2026

One thing that I find surprising: as we don't have a way to constrain the jobs to run on performance cores, is this that useful to restrict the number of jobs to match them?

We don't have an easy way to constraint to run on physical core either (on SMT CPUs). But still, its often detrimental to spawn too many threads/process on machines with SMT or non-performance cores: the overhead of thread/process scheduling can lead to wasted energy or even performance degradation compared to using fewer threads/processes.

However, it's hard to find a general rule. It can depend on the workfload, the parallel runtime and the hardware (core cache sizes, shared across cores or not, core frequency, and other lower level CPU architecture details).

@jeremiedbb
Copy link
Copy Markdown
Contributor

In addition, even if we can't constraint to run on specific cores, if you limit the number of cores to run on, the os scheduler will automatically chose what it considers best (in some sense), and usually it means prioritize physical / performance cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an option to detect performance cores in a cross-platform way

4 participants