[RFE]: reduce unnecessary cuda memory alloc

I noticed that the code at https://github.com/NVIDIA/nccl/blob/master/src/init.cc#L1308 is always executed, but `shared_net_buffer` is only used when `comm->nNodes > 1`. Is this a waste?