Add only_performance_cores in cpu_count as a boolean option#630
Add only_performance_cores in cpu_count as a boolean option#630Copilot wants to merge 34 commits into
only_performance_cores in cpu_count as a boolean option#630Conversation
Agent-Logs-Url: https://github.com/joblib/loky/sessions/d8361f0d-d8d1-4594-8a3c-ed4513f92652 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/d8361f0d-d8d1-4594-8a3c-ed4513f92652 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/d8361f0d-d8d1-4594-8a3c-ed4513f92652 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/d8361f0d-d8d1-4594-8a3c-ed4513f92652 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/d8361f0d-d8d1-4594-8a3c-ed4513f92652 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/25ce1bba-2c10-4dac-acb0-227be98499d5 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/25ce1bba-2c10-4dac-acb0-227be98499d5 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
|
@copilot fix the CI failures. |
Agent-Logs-Url: https://github.com/joblib/loky/sessions/1f532cb6-e8a9-4fc3-81aa-62b93362297f Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Fixed in commit |
|
@copilot let's simplify the logic and make
Update tests, doc, changelog and implementation accordingly. |
Agent-Logs-Url: https://github.com/joblib/loky/sessions/8f61b182-0130-4ffd-bd5c-c364376aab8f Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/8f61b182-0130-4ffd-bd5c-c364376aab8f Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Implemented in |
only_performance_cores toggle to cpu_count while preserving "auto" hybrid behavioronly_performance_cores in cpu_count to a boolean option
only_performance_cores in cpu_count to a boolean optiononly_performance_cores in cpu_count as a boolean option
|
I have tested this PR under Linux with a hybrid CPU (Intel(R) Core(TM) Ultra X7 358H with 4 performance cores and 16 total) and under macOS with an Apple M4 CPU (4 performance cores and 10 total), and it seems to work as expected. I will now give it a try with the Intel(R) Core(TM) Ultra X7 358H CPU under windows. |
|
@copilot on the Intel(R) Core(TM) Ultra X7 358H CPU under Linux, On both systems, Please fix the Windows specific inspection logic to get a behavior consistent with the Linux outcome (and update the tests accordingly). |
… update tests and changelog Agent-Logs-Url: https://github.com/joblib/loky/sessions/95179a54-fbe7-4faa-9593-933c579b0f62 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/95179a54-fbe7-4faa-9593-933c579b0f62 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
All feedback addressed in |
| return bytes(data) | ||
|
|
||
|
|
||
| def _make_fake_ctypes_module(payload): |
There was a problem hiding this comment.
Let's rename this to _make_fake_win32_ctypes_module.
- Rename _make_win32_record -> _make_win32_processor_relationship_record - Rename _make_fake_ctypes_module -> _make_fake_win32_ctypes_module - Parametrize test_count_performance_cores_win32_selects_highest_efficiency_class with 4 topology cases: 4P+12E, 2P+4M+8E (tri-hybrid), 8P+8E, 1P+3E Agent-Logs-Url: https://github.com/joblib/loky/sessions/d56bbc35-758a-4474-87ff-0ba9e6c15055 Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
Agent-Logs-Url: https://github.com/joblib/loky/sessions/3f76eeca-a4a3-4a71-a9df-d316c42876df Co-authored-by: ogrisel <89061+ogrisel@users.noreply.github.com>
|
I think I am satisfied with the current state of this PR. @tomMoral or @jeremiedbb I'd love any feedback on this. I tested it on laptops with hybrid CPUs and all three platforms, and it behaves expected, for instance on Windows or Linux with an Intel CPU with 4 performance cores, 8 efficiency cores and 4 low-power cores I get: >>> import loky
>>> loky.cpu_count()
16
>>> loky.cpu_count(only_physical_cores=True)
4
>>> loky.cpu_count(only_physical_cores=True, only_performance_cores=False)
16and on a macOS macbook air laptop with M4 processor: >>> import loky
>>> loky.cpu_count()
10
>>> loky.cpu_count(only_physical_cores=True)
4
>>> loky.cpu_count(only_physical_cores=True, only_performance_cores=False)
10 |
>>> loky.cpu_count(only_physical_cores=True, only_performance_cores=False)
16I would have expected 4, no ? |
On this machine there is no SMT/hyperthreading on any core. So all are physical cores. But in retrospect, there are machines with a mix of SMT and hybrid cores (e.g. 2 physical performance cores with SMT, therefore 4 local cores, and 4 efficiency cores without SMT). So maybe we should decouple the two detection mechanisms to give full control to the user. |
then why ? |
|
Because in this PR I enabled I agree this is weird. I will update the PR to decouple the two filtering mechanisms. |
|
Hello! Default: 12 One thing that I find surprising: as we don't have a way to constrain the jobs to run on performance cores, is this that useful to restrict the number of jobs to match them? |
We don't have an easy way to constraint to run on physical core either (on SMT CPUs). But still, its often detrimental to spawn too many threads/process on machines with SMT or non-performance cores: the overhead of thread/process scheduling can lead to wasted energy or even performance degradation compared to using fewer threads/processes. However, it's hard to find a general rule. It can depend on the workfload, the parallel runtime and the hardware (core cache sizes, shared across cores or not, core frequency, and other lower level CPU architecture details). |
|
In addition, even if we can't constraint to run on specific cores, if you limit the number of cores to run on, the os scheduler will automatically chose what it considers best (in some sense), and usually it means prioritize physical / performance cores. |
Fixes #629.
Detect performance cores to avoid oversubscription problem when
loky.cpu_countis used to automatically adjust the number of worker threads or processes for CPU intensive parallel tasks.The new
only_performance_cores=Trueboolean flag is set to true by default but only active whenonly_physical_cores=True(not the default).I (@ogrisel) chose this design because people who pass
only_physical_cores=Trueare generally interested in counting the number of cores to adjust the number of worker threads/processes for CPU intensive parallel work, in which case, including logical SMT threads and efficiency cores in the worker count often leads to detrimental scheduling overhead that hurts the scalability.EDIT: moved this PR back to draft because the cascading behavior between the two boolean options is surprising.