Skip to content

Commit 6dd51f1

Browse files
authored
Merge pull request #1217 from crazy-JiangDongHua/bugfix_undo_plan
Bug in plan enqueue logic where plans could be silently not launched for some communicators. Triggered when both are true: 1. Multiple communicators per ncclGroup. 2. Communicators within a group have different plan counts. 2. Intra-process launch barrier disabled.
2 parents 48bb7fe + 9ef920a commit 6dd51f1

File tree

1 file changed

+2
-2
lines changed

1 file changed

+2
-2
lines changed

src/group.cc

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,15 +142,15 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
142142
}
143143

144144
while (true) { // Iterate rounds of launches for clique.
145-
bool moreRounds;
145+
bool moreRounds = false;
146146
comm = cliqueHead;
147147
do { // Iterate clique members.
148148
struct ncclComm* next = comm->groupNext;
149149
if (useBarrier) {
150150
// Barrier reduction result tells us if this was the final round.
151151
moreRounds = 0 != ncclCommIntraBarrierOut(comm);
152152
} else {
153-
moreRounds = comm->unlaunchedPlansHead != nullptr;
153+
moreRounds |= comm->unlaunchedPlansHead != nullptr;
154154
}
155155
if (moreRounds) {
156156
// Pop next unlaunched kernel

0 commit comments

Comments
 (0)