lifecycle: Add exception handling for CPU affinity#113
lifecycle: Add exception handling for CPU affinity#113SangKyeong-Jeong wants to merge 4 commits intoeclipse-score:mainfrom
Conversation
License Check Results🚀 The license check job ran with the Bazel command: bazel run --lockfile_mode=error //:license-checkStatus: Click to expand output |
|
The created documentation from the pull request is available at: docu-html |
|
@SangKyeong-Jeong whats the status here can we review it |
FScholPer
left a comment
There was a problem hiding this comment.
Can you please check also the OS Abstraction Layer
|
@FScholPer We will double-check the OSAL implementation files (specifically under src/internal/osal/linux and qnx) to ensure the CPU core limit is handled consistently across different platforms. We'll update the PR if any adjustments are needed. |
|
This issue is unrelated to the OSAL. When the CPU core count is 32, the return value of the ConfigurationManager::kDefaultProcessorAffinityMask() function is 0, causing an error when setting the CPU affinity to 0. Shifting a 32-bit integer by 32 bits results in a shift-count-overflow error, which leads to undefined behavior. Consequently, the return value of ConfigurationManager::kDefaultProcessorAffinityMask() becomes 0, causing this issue. |
|
@FScholPer Please review this. Thanks. |
Shifting a 32-bit number by 32 bits causes undefined behavior, so modify it to perform the shift operation on a 64-bit number and then cast it.
|
We encounter the Formatting checks fail. It looks like the temporal download connection problem, so I want to re-run the batch, but maybe don't have a permission to do. |
| const uint32_t ConfigurationManager::kDefaultProcessExecutionError = 1U; | ||
| uint32_t ConfigurationManager::kDefaultProcessorAffinityMask() { | ||
| return (1U << osal::getNumCores()) - 1U; | ||
| return static_cast<uint32_t>((1ULL << osal::getNumCores()) - 1ULL); |
There was a problem hiding this comment.
When there are 64 cores, is there still the issue or ? ;) In general, this affinity in LM supports only up to 32 cores, since later on cpu_mask_ also does not support more. So either we do an assert is there is more than 32 cores, or we do wanirng logs that default affinity is set only to first 32 cores ;)
There was a problem hiding this comment.
This PR addresses the Undefined Behavior on 32-core systems. As chungsky mentioned, getNumCores() already caps the count at 32, so >32 core systems are safe with this default.
There was a problem hiding this comment.
I know what i am saying that we shall probably add at least a warning (since you are already fixing issues around it) that it was capped somehwere, since defult behaviour for ie 32 and 33 cores is completely different. @NicolasFussberger what do you think ?
There was a problem hiding this comment.
I agree, logging a warning could be helpful.
However, that might not be trivial. Probably you do not want to log the warning again and again every time osal::getNumCores() is called but only once during startup. So I think it would require refactoring getNumCores() method or adding another method for validation purposes.
The behaviour in case of > 32 cores seems reasonable to me. LaunchManager will just use the first 32 cores.
We could also add support for 64 cores if that is required, but that is probably a separate task.
There was a problem hiding this comment.
After a short investigation it looks like we are hitting an undefined behavior here...
For reference pls have look at C Standard (C18, ISO/IEC 9899:2018) section 6.5.7 Bitwise shift operators and C++ Standard (C++20, ISO/IEC 14882:2020) section 7.6.2.2 Bitwise shift operators [expr.shift].
If the value of the right operand is negative, or greater than or equal to the number of bits in the promoted left operand, the behavior is undefined.
IMHO it looks to me that hardware is not really doing the logical thing here. Please consider following example:
#include <cstdint>
#include <iostream>
void shiftLeftTest(std::uint32_t num) {
std::uint32_t first_result = 1U << num;
std::uint32_t second_result = 1U << 32;
std::cout << "Shifting left 1U by " << num << std::endl;
std::cout << "first_result --> " << first_result << std::endl;
std::cout << "second_result --> " << second_result << std::endl;
}
int main() {
shiftLeftTest(32);
}
This code will print following text:
Shifting left 1U by 32
first_result --> 1
second_result --> 0
So essentially compiler will calculate something different than hardware.
For this reason maybe we should go for the following code that especially address this case. This may be handy if we support 64 cores in the future.
uint32_t ConfigurationManager::kDefaultProcessorAffinityMask() {
constexpr uint32_t BITS_PER_BYTE = 8;
uint32_t bitMask = 0U;
uint32_t cores = osal::getNumCores();
uint32_t maskSize = sizeof(bitMask) * BITS_PER_BYTE;
return ( cores >= maskSizeInBits ? -1U : (1U << cores) - 1U );
}
Any opinion?
There was a problem hiding this comment.
Yup, true. For me fine. @SimonKozik / @NicolasFussberger can we create an gtihub issue to support 64 cores and/or provide a warning indication to the user if there is still more core ?
There was a problem hiding this comment.
To resolve the Undefined Behavior issue that occurs with 32 cores, is there any problem with the method I applied in the patch, which performs the shift operation using a 64-bit integer (1ULL) and then casts it?
Since the return type of the kDefaultProcessorAffinityMask function is uint32_t, returning -1U seems awkward because it applies a negative sign to an unsigned type.
There was a problem hiding this comment.
It is true that 32 cores is a bit hard coded now and changing the maximum to 64, will require some changes in few files. For this reason it will be hard to argue, that we should make this function very future proof.
May main concern is the fact, that the change proposed in this PR is just masking the problem. Not really solving it. So if we in the future increase the mask size to 64-bit, your code will step into the same problem. Am I right here?
I will argue that we should document the root problem we are facing here. This probably will be in the form of a comment.
Apart of that we should have fix.
Fix can be in the form proposed inside my comment, or the fix can be as proposed in PR. But if we are going to go with 1ULL then we should document that we are avoiding the undefined behavior by increasing the size of the type on which we are performing calculation.
I just want to avoid stepping into the same problem when we eventually provide support for 64 cores.
PS.
A big plus of using 1ULL is the fact that we can avoid branch in the calculation.
There was a problem hiding this comment.
I agree that if the mask size is increased to 64 bits, we may encounter the same problem again. Currently, the maximum value is limited to 32 in the osal::getNumCores() function, which is why I made this change.
It also seems important to document this so that the issue does not occur again.
There was a problem hiding this comment.
I added a ticket here #122 to support 64 cores, linking to the proposal #113 (comment) for bitshifting
|
Done rebasing |
|
@pawelrutkaq can you please also check again |
When there are 32 CPU cores, an issue arises where the process cannot be executed, and when there are more than 32 cores, a problem occurs where fewer cores than expected are used. To resolve this issue, exception handling is added to set the maximum number of cores using the uint32_t type when there are more than 32 cores.