-
Notifications
You must be signed in to change notification settings - Fork 191
Add ncclCommDump API #2068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Add ncclCommDump API #2068
Conversation
| return; | ||
| } | ||
|
|
||
| traceOpPtr->timestamps[counter] = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be pedantic with timing, you may want to sandwich timing calls between atomic signal fences or employ the DoNotOptimize builtin. The compiler does not guarantee that timing calls aren't reordered relative to other blocks of code.
| __attribute__ ((visibility("default"))) | ||
| ncclResult_t ncclCommDump( | ||
| const ncclComm_t comm, | ||
| std::unordered_map<std::string, std::string>& map) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make const std::unordered_map<std::string, std::string>& map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it, but why not pass in ncclComm_t as a const reference rather than passing it by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make const std::unordered_map<std::string, std::string>& map
In NCCLX and RCCLX, this map is not const because it is where we store some structured trace data that callers like PyTorch can use.
See the NCCLX API: https://github.com/meta-pytorch/torchcomms/blob/fe4e8116f2107b5aed0e38db10e072471ea95126/comms/ncclx/v2_27/meta/commDump.cc#L219
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I may have missed it, but why not pass in ncclComm_t as a const reference rather than passing it by value?
Good point, I was just following the NCCLX implementation but I don't see why we copy the communicator here. @dmwu @YulunW any idea why the the communicator is passed by value in ncclCommDump()?
|
I've updated the PR to fix issues with thread safety. |
Context
PyTorch Distributed's ProcessGroupNCCL has a watchdog thread that detects collective hangs or errors. For Meta's NCCLX and RCCLX, this triggers a custom NCCL API we have defined called
ncclCommDump(), which prints data from our ProxyTrace. This is useful because we can't rely on the ProxyTrace print incommFree(), since PyTorch Distributed does not always clean up all the communicators during NCCL errors.We want to open-source this functionality so that PTD jobs on open-source RCCL can have the same functionality.
Implementation Details
ncclCommDump()API that dumps the ProxyTrace contents, similar to our current behavior incommFree()ncclCommDump()to nccl.htransport/net.ccproxyTraceInit()->ProxyTrace::ProxyTrace()updateProxyOpCounter()setProxyOpTimestamp()addNewProxyOp()addNewProxyTraceOpImpl()private. These will not be called by other RCCL code directly and should not be publicaddNewProxyOp())Testing
Testing these changes requires a multi-node setup. We did this internally with a 2-node PyTorch Distributed job, as well as a 2-node MPI test. We have a somewhat convoluted Python script for doing this, but the final run arguments look like this
This is what the logs look like at the end when the communicator is dumped.
We also pass the unit tests
References
ProcessGroupNCCL dump call: https://github.com/pytorch/pytorch/blob/fcc78410a8e51107a7f4a15431e57da137741aee/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L400-L412