Skip to content

Conversation

@iassiour
Copy link
Contributor

@iassiour iassiour commented Nov 29, 2025

Motivation

This PR fixes a deadlock in ActiveSignal when profiling and graph packet batching is enabled, that is introduced after #1354.

Technical Details

Added a timestamp check to the signal insertion logic. Previously, a new profiling signal was only created when the existing signal was still active (signal > 0). Added a second condition signal_list_[temp_id]->ts_ == ts to handle the case where the slot is already associated with the current timestamp.

Test Plan

Tested with the job from https://ontrack-internal.amd.com/browse/SWDEV-569101 and verified that it resolves the hang against with various combinations of ROC_SIGNAL_POOL_SIZE and DEBUG_HIP_GRAPH_BATCH_SIZE.

Test Result

The hang reported in https://ontrack-internal.amd.com/browse/SWDEV-569101 is now resolved.

Submission Checklist

Copilot AI review requested due to automatic review settings November 29, 2025 00:03
@iassiour iassiour requested a review from a team as a code owner November 29, 2025 00:03
Copilot finished reviewing on behalf of iassiour November 29, 2025 00:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR increases the signal list size in the HwQueueTracker to accommodate larger batch sizes for HIP graph operations. The change ensures the signal pool can handle at least DEBUG_HIP_GRAPH_BATCH_SIZE (256) signals instead of being limited to ROC_SIGNAL_POOL_SIZE (64).

  • Signal list size now uses std::max(ROC_SIGNAL_POOL_SIZE, DEBUG_HIP_GRAPH_BATCH_SIZE) to ensure sufficient capacity
  • Lookahead calculation updated from a fixed offset of 2 to DEBUG_HIP_GRAPH_BATCH_SIZE for consistency with the new sizing

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

Copy link
Contributor

@saleelk saleelk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@iassiour iassiour force-pushed the users/iassiour/SWDEV-569101 branch from d1e2d4e to b44711d Compare November 29, 2025 00:21
@iassiour iassiour changed the title SWDEV-569101 - increase signal list size to at least DEBUG_HIP_GRAPH_BATCH_SIZE SWDEV-569101 - Resolve deadlock caused by graph packet batching when profiling is enabled Nov 29, 2025
@iassiour
Copy link
Contributor Author

/AzurePipelines run rocm-ci-caller

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@iassiour iassiour force-pushed the users/iassiour/SWDEV-569101 branch from b44711d to fab10f7 Compare December 2, 2025 01:50
@stellaraccident stellaraccident merged commit 65b769e into develop Dec 2, 2025
20 of 21 checks passed
@stellaraccident stellaraccident deleted the users/iassiour/SWDEV-569101 branch December 2, 2025 02:52
systems-assistant bot pushed a commit to ROCm/clr that referenced this pull request Dec 2, 2025
 DEBUG_HIP_GRAPH_BATCH_SIZE (#2084)
[rocm-systems] ROCm/rocm-systems#2084 (commit 65b769e)
rakesroy pushed a commit that referenced this pull request Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants