-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Question
I understand the sliding window reuses groupApiPool when refCount hits zero. But ncclProfileKernelLaunch increments the parent’s refCount, and the only decrement is in updateEvent—which it never reaches. Could this block the window?
path: /nccl/ext-profiler/example/plugin.cc
add refCount :
exampleProfilerStartEvent(){
else if (eDescr->type == ncclProfileKernelLaunch) {
.....
__atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
.....
}
}
sub refCount:
updateEvent(){
if (type == ncclProfileGroupApi) {
struct groupApi* event = (struct groupApi*) handle;
if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) { //decrement the refCount here
event->stopTs = gettime() - startTime;
__atomic_fetch_add(&event->ctx->groupApiPoolBase, 1, __ATOMIC_RELAXED);
}
}
else if (type == ncclProfileKernelLaunch) {
updateEvent(event->parent); //call updateEvent and type == ncclProfileGroupApi
}
}
However, updateEvent is only invoked from exampleProfilerStopEvent, and it returns early when type == ncclProfileKernelLaunch.
exampleProfilerStopEvent(){
else if (type == ncclProfileKernelLaunch) {
struct kernelLaunch* event = (struct kernelLaunch*) eHandle;
event->stopTs = gettime() - startTime;
return ncclSuccess;
}
updateEvent(eHandle);
return ncclSuccess;
}