Skip to content

[Question]: Missing refCount Decrement for ncclProfileKernelLaunch? #1910

@xxxxx-ctrl

Description

@xxxxx-ctrl

Question

I understand the sliding window reuses groupApiPool when refCount hits zero. But ncclProfileKernelLaunch increments the parent’s refCount, and the only decrement is in updateEvent—which it never reaches. Could this block the window?

path: /nccl/ext-profiler/example/plugin.cc

add refCount :

exampleProfilerStartEvent(){
   else if (eDescr->type == ncclProfileKernelLaunch) {
          .....
          __atomic_fetch_add(&parent->refCount, 1, __ATOMIC_RELAXED);
          ..... 
   }
}

sub refCount:
updateEvent(){
if (type == ncclProfileGroupApi) {
struct groupApi* event = (struct groupApi*) handle;
if (__atomic_sub_fetch(&event->refCount, 1, __ATOMIC_RELAXED) == 0) { //decrement the refCount here
event->stopTs = gettime() - startTime;
__atomic_fetch_add(&event->ctx->groupApiPoolBase, 1, __ATOMIC_RELAXED);
}
}
else if (type == ncclProfileKernelLaunch) {
updateEvent(event->parent); //call updateEvent and type == ncclProfileGroupApi
}
}

However, updateEvent is only invoked from exampleProfilerStopEvent, and it returns early when type == ncclProfileKernelLaunch.

exampleProfilerStopEvent(){
  else if (type == ncclProfileKernelLaunch) {
    struct kernelLaunch* event = (struct kernelLaunch*) eHandle;
    event->stopTs = gettime() - startTime;
    return ncclSuccess;
  }
  updateEvent(eHandle);
  return ncclSuccess;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions