-
Notifications
You must be signed in to change notification settings - Fork 16
lifecycle: Add exception handling for CPU affinity #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SangKyeong-Jeong
wants to merge
4
commits into
eclipse-score:main
Choose a base branch
from
SangKyeong-Jeong:lifecycle/bugfix/affinity-exception-handling
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
84a8717
lifecycle: Add exception handling for CPU affinity
SangKyeong-Jeong 21d3d28
lifecycle: Use 64-bit operations
SangKyeong-Jeong af714a8
Merge branch 'eclipse-score:main' into lifecycle/bugfix/affinity-exce…
SangKyeong-Jeong 7c700e5
Merge branch 'main' into lifecycle/bugfix/affinity-exception-handling
FScholPer File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there are 64 cores, is there still the issue or ? ;) In general, this affinity in LM supports only up to 32 cores, since later on
cpu_mask_also does not support more. So either we do an assert is there is more than 32 cores, or we do wanirng logs that default affinity is set only to first 32 cores ;)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR addresses the Undefined Behavior on 32-core systems. As chungsky mentioned, getNumCores() already caps the count at 32, so >32 core systems are safe with this default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what i am saying that we shall probably add at least a warning (since you are already fixing issues around it) that it was capped somehwere, since defult behaviour for ie 32 and 33 cores is completely different. @NicolasFussberger what do you think ?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, logging a warning could be helpful.
However, that might not be trivial. Probably you do not want to log the warning again and again every time
osal::getNumCores()is called but only once during startup. So I think it would require refactoringgetNumCores()method or adding another method for validation purposes.The behaviour in case of > 32 cores seems reasonable to me. LaunchManager will just use the first 32 cores.
We could also add support for 64 cores if that is required, but that is probably a separate task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After a short investigation it looks like we are hitting an undefined behavior here...
For reference pls have look at C Standard (C18, ISO/IEC 9899:2018) section 6.5.7 Bitwise shift operators and C++ Standard (C++20, ISO/IEC 14882:2020) section 7.6.2.2 Bitwise shift operators [expr.shift].
If the value of the right operand is negative, or greater than or equal to the number of bits in the promoted left operand, the behavior is undefined.IMHO it looks to me that hardware is not really doing the logical thing here. Please consider following example:
This code will print following text:
So essentially compiler will calculate something different than hardware.
For this reason maybe we should go for the following code that especially address this case. This may be handy if we support 64 cores in the future.
Any opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, true. For me fine. @SimonKozik / @NicolasFussberger can we create an gtihub issue to support 64 cores and/or provide a warning indication to the user if there is still more core ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To resolve the Undefined Behavior issue that occurs with 32 cores, is there any problem with the method I applied in the patch, which performs the shift operation using a 64-bit integer (1ULL) and then casts it?
Since the return type of the kDefaultProcessorAffinityMask function is uint32_t, returning -1U seems awkward because it applies a negative sign to an unsigned type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is true that 32 cores is a bit hard coded now and changing the maximum to 64, will require some changes in few files. For this reason it will be hard to argue, that we should make this function very future proof.
May main concern is the fact, that the change proposed in this PR is just masking the problem. Not really solving it. So if we in the future increase the mask size to 64-bit, your code will step into the same problem. Am I right here?
I will argue that we should document the root problem we are facing here. This probably will be in the form of a comment.
Apart of that we should have fix.
Fix can be in the form proposed inside my comment, or the fix can be as proposed in PR. But if we are going to go with
1ULLthen we should document that we are avoiding the undefined behavior by increasing the size of the type on which we are performing calculation.I just want to avoid stepping into the same problem when we eventually provide support for 64 cores.
PS.
A big plus of using
1ULLis the fact that we can avoid branch in the calculation.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that if the mask size is increased to 64 bits, we may encounter the same problem again. Currently, the maximum value is limited to 32 in the osal::getNumCores() function, which is why I made this change.
It also seems important to document this so that the issue does not occur again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a ticket here #122 to support 64 cores, linking to the proposal #113 (comment) for bitshifting