-
Notifications
You must be signed in to change notification settings - Fork 83
SWDEV-569101 - Resolve deadlock caused by graph packet batching when profiling is enabled #2084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR increases the signal list size in the HwQueueTracker to accommodate larger batch sizes for HIP graph operations. The change ensures the signal pool can handle at least DEBUG_HIP_GRAPH_BATCH_SIZE (256) signals instead of being limited to ROC_SIGNAL_POOL_SIZE (64).
- Signal list size now uses
std::max(ROC_SIGNAL_POOL_SIZE, DEBUG_HIP_GRAPH_BATCH_SIZE)to ensure sufficient capacity - Lookahead calculation updated from a fixed offset of 2 to
DEBUG_HIP_GRAPH_BATCH_SIZEfor consistency with the new sizing
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
saleelk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
d1e2d4e to
b44711d
Compare
|
/AzurePipelines run rocm-ci-caller |
|
Azure Pipelines successfully started running 1 pipeline(s). |
b44711d to
fab10f7
Compare
DEBUG_HIP_GRAPH_BATCH_SIZE (#2084) [rocm-systems] ROCm/rocm-systems#2084 (commit 65b769e)
Motivation
This PR fixes a deadlock in ActiveSignal when profiling and graph packet batching is enabled, that is introduced after #1354.
Technical Details
Added a timestamp check to the signal insertion logic. Previously, a new profiling signal was only created when the existing signal was still active (signal > 0). Added a second condition signal_list_[temp_id]->ts_ == ts to handle the case where the slot is already associated with the current timestamp.
Test Plan
Tested with the job from https://ontrack-internal.amd.com/browse/SWDEV-569101 and verified that it resolves the hang against with various combinations of ROC_SIGNAL_POOL_SIZE and DEBUG_HIP_GRAPH_BATCH_SIZE.
Test Result
The hang reported in https://ontrack-internal.amd.com/browse/SWDEV-569101 is now resolved.
Submission Checklist