-
Notifications
You must be signed in to change notification settings - Fork 79
Description
Hello, my code is running in the k8s environment. I started pytorch in two pods and tried to use torchrpc , but I encountered an error in the torch.distributed.rpc.init_rpc function. Hope to get some advice or inspiration.
Code
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='Test torch rpc')
parser.add_argument('--rank', type=int)
parser.add_argument('--init_method', type=str)
rank = args.rank
init_method = args.init_method
rpc_backend_options = rpc.TensorPipeRpcBackendOptions(num_worker_threads=8, init_method=init_method)
if rank == 1:
rpc_backend_options.set_device_map("B", {0: 0})
name = "A"
elif rank == 0:
rpc_backend_options.set_device_map("A", {0: 0})
name = "B"
logging.debug(torch.__version__)
logging.debug("name:{}, rank:{}, init_method:{}".format(name, rank, init_method))
rpc.init_rpc(name, rank=rank, world_size=world_size, rpc_backend_options=rpc_backend_options)
Error
I add TP_VERBOSE_LOGGING so it print tensorpipe debug info.
It can be seen that the fatal error is IBV_WC_RETRY_EXC_ERR: Transport Retry Counter Exceeded. This means that the remote side didn't send any Ack or Nack. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. But in fact the other process does not report an error, but waits until the init_rpc times out.
Here is the rank 0 log, name is B:
[09-21 07:24:27] DEBUG 1.12.1 func.py:67
[09-21 07:24:27] DEBUG name:B, rank:0, init_method:tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516 func.py:67
V0921 07:24:29.916767 89 tensorpipe/core/context_impl.cc:53] Context 89:c0 created
V0921 07:24:29.916805 89 tensorpipe/core/context_impl.cc:55] Context 89:c0 aliased as B
V0921 07:24:29.930452 89 tensorpipe/core/context_impl.cc:81] Context B is registering transport ibv
V0921 07:24:29.930539 89 tensorpipe/core/context_impl.cc:81] Context B is registering transport uv
V0921 07:24:29.932656 89 tensorpipe/core/context_impl.cc:81] Context B is registering transport shm
V0921 07:24:29.932819 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_basic
V0921 07:24:29.932829 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_basic
V0921 07:24:29.932836 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_basic.cpu
V0921 07:24:29.935908 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_xth
V0921 07:24:29.935921 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_xth
V0921 07:24:29.935981 89 tensorpipe/channel/cma/context_impl.cc:225] Couldn't detect the active Linux Security Modules
V0921 07:24:29.936031 89 tensorpipe/channel/cma/context_impl.cc:253] YAMA ptrace scope set to classic ptrace permissions
V0921 07:24:29.936044 89 tensorpipe/channel/cma/context_impl.cc:296] The domain descriptor for CMA is cma:6783c3f0-b9c5-408b-80b7-9fd2287768d8_1000d9_f0001d4c_1000d9_effffffd_0_0_a80425fb
V0921 07:24:29.936110 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cma
V0921 07:24:29.936117 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cma
V0921 07:24:29.964395 89 tensorpipe/channel/cuda_ipc/context_impl.cc:254] The UUIDs of all the GPUs found by the CUDA IPC channel are 53dd857c-914a-4f99-7bcc-0af7e3edc566
V0921 07:24:29.964409 89 tensorpipe/channel/cuda_ipc/context_impl.cc:256] The peer-to-peer support found by the CUDA IPC channel is {{1}}
V0921 07:24:29.964435 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel cuda_ipc
V0921 07:24:29.964445 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_cuda_ipc
V0921 07:24:29.964457 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel basic
V0921 07:24:29.964463 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_basic
V0921 07:24:29.995572 89 tensorpipe/core/context_impl.cc:104] Context B is registering channel mpt_uv
V0921 07:24:29.995585 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to B.ch_mpt_uv
V0921 07:24:29.995705 89 tensorpipe/channel/cuda_gdr/context_impl.cc:479] CUDA GDR channel is not viable because the nv_peer_mem kernel module isn't active
V0921 07:24:29.995721 89 tensorpipe/core/context_impl.cc:116] Context B is opening listener B[l0]
V0921 07:24:29.996051 89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport shm
V0921 07:24:29.996061 89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport ibv
V0921 07:24:29.996065 89 tensorpipe/core/listener_impl.cc:269] Listener B[l0] is accepting connection on transport uv
V0921 07:24:30.330784 89 tensorpipe/core/listener_impl.cc:102] Listener B[l0] received an accept request (#0)
V0921 07:24:30.331219 89 tensorpipe/core/context_impl.cc:128] Context B is opening pipe B.p0
V0921 07:24:30.331241 89 tensorpipe/core/context_impl.cc:132] Pipe B.p0 aliased as B_to_A
V0921 07:24:30.331300 89 tensorpipe/core/pipe_impl.cc:286] Pipe B_to_A is writing nop object (spontaneous connection)
V0921 07:24:30.331342 89 tensorpipe/core/pipe_impl.cc:311] Pipe B_to_A is writing nop object (brochure)
V0921 07:24:30.331349 89 tensorpipe/core/pipe_impl.cc:319] Pipe B_to_A is reading nop object (brochure answer)
V0921 07:24:30.331715 89 tensorpipe/core/pipe_impl.cc:290] Pipe B_to_A done writing nop object (spontaneous connection)
V0921 07:24:30.331771 89 tensorpipe/core/pipe_impl.cc:314] Pipe B_to_A done writing nop object (brochure)
V0921 07:24:30.332378 89 tensorpipe/core/pipe_impl.cc:322] Pipe B_to_A done reading nop object (brochure answer)
V0921 07:24:30.332393 89 tensorpipe/core/pipe_impl.cc:1113] Pipe B_to_A is opening connection (descriptor, as replacement)
V0921 07:24:30.332434 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332448 89 tensorpipe/core/pipe_impl.cc:1131] Pipe B_to_A is opening connection (descriptor_reply)
V0921 07:24:30.332456 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332527 89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 0/2 (for channel cuda_basic)
V0921 07:24:30.332538 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332542 89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 1/2 (for channel cuda_basic)
V0921 07:24:30.332550 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332565 89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_cuda_basic.cpu is opening channel B.ch_cuda_basic.cpu.c0
V0921 07:24:30.332585 89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_cuda_basic is opening channel B.ch_cuda_basic.c0
V0921 07:24:30.332608 89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_cuda_basic.c0 was renamed to B_to_A.ch_cuda_basic
V0921 07:24:30.332622 89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_cuda_basic.cpu.c0 was renamed to B_to_A.ch_cuda_basic.cpu
V0921 07:24:30.332631 89 tensorpipe/core/pipe_impl.cc:1186] Pipe B_to_A is opening connection 0/1 (for channel mpt_uv)
V0921 07:24:30.332638 89 tensorpipe/core/pipe_impl.cc:1217] Pipe B_to_A is writing nop object (requested connection)
V0921 07:24:30.332649 89 tensorpipe/channel/context_impl_boilerplate.h:125] Channel context B.ch_mpt_uv is opening channel B.ch_mpt_uv.c0
V0921 07:24:30.332680 89 tensorpipe/channel/channel_impl_boilerplate.h:250] Channel B.ch_mpt_uv.c0 was renamed to B_to_A.ch_mpt_uv
V0921 07:24:30.343321 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343622 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343782 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.343939 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:30.345877 89 tensorpipe/core/pipe_impl.cc:1221] Pipe B_to_A done writing nop object (requested connection)
V0921 07:24:31.525355 89 tensorpipe/core/pipe_impl.cc:550] Pipe B_to_A received a write request (#0, contaning 4 payloads and 0 tensors)
V0921 07:24:31.525398 89 tensorpipe/core/pipe_impl.cc:925] Pipe B_to_A is writing nop object (message descriptor #0)
V0921 07:24:31.525412 89 tensorpipe/core/pipe_impl.cc:942] Pipe B_to_A is writing payloads of message #0
V0921 07:24:31.525415 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.0
V0921 07:24:31.525421 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.1
V0921 07:24:31.525425 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.2
V0921 07:24:31.525428 89 tensorpipe/core/pipe_impl.cc:948] Pipe B_to_A is writing payload #0.3
V0921 07:24:31.525434 89 tensorpipe/core/pipe_impl.cc:880] Pipe B_to_A is sending tensors of message #0
V0921 07:24:31.525495 89 tensorpipe/core/pipe_impl.cc:931] Pipe B_to_A done writing nop object (message descriptor #0)
V0921 07:24:31.525575 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.0
V0921 07:24:31.525598 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.1
V0921 07:24:31.525622 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.2
V0921 07:24:31.525634 89 tensorpipe/core/pipe_impl.cc:954] Pipe B_to_A done writing payload #0.3
V0921 07:24:31.525640 89 tensorpipe/core/pipe_impl.cc:557] Pipe B_to_A is calling a write callback (#0)
V0921 07:24:31.525671 89 tensorpipe/core/pipe_impl.cc:560] Pipe B_to_A done calling a write callback (#0)
V0921 07:24:31.525692 89 tensorpipe/core/pipe_impl.cc:376] Pipe B_to_A received a readDescriptor request (#0)
V0921 07:24:31.525707 89 tensorpipe/core/pipe_impl.cc:845] Pipe B_to_A is reading nop object (message descriptor #0)
V0921 07:24:34.575529 89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_mpt_uv is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576221 89 tensorpipe/core/pipe_impl.cc:634] Pipe B_to_A is handling error transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576272 89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_mpt_uv is closing
V0921 07:24:34.576319 89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_cuda_basic is closing
V0921 07:24:34.576328 89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_cuda_basic is handling error channel closed (this error originated at tensorpipe/channel/channel_impl_boilerplate.h:265)
V0921 07:24:34.576354 89 tensorpipe/channel/channel_impl_boilerplate.h:264] Channel B_to_A.ch_cuda_basic.cpu is closing
V0921 07:24:34.576362 89 tensorpipe/channel/channel_impl_boilerplate.h:283] Channel B_to_A.ch_cuda_basic.cpu is handling error channel closed (this error originated at tensorpipe/channel/channel_impl_boilerplate.h:265)
V0921 07:24:34.576381 89 tensorpipe/core/pipe_impl.cc:849] Pipe B_to_A done reading nop object (message descriptor #0)
V0921 07:24:34.576391 89 tensorpipe/core/pipe_impl.cc:382] Pipe B_to_A is calling a readDescriptor callback (#0)
[W tensorpipe_agent.cpp:940] RPC agent for B encountered error when reading incoming response from A: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.576442 89 tensorpipe/core/pipe_impl.cc:385] Pipe B_to_A done calling a readDescriptor callback (#0)
Traceback (most recent call last):
File "test_dist.py", line 470, in <module>
setup(int(args.rank), 2, False, str(args.init_method))
File "test_dist.py", line 425, in setup
rpc.init_rpc(name, rank=rank, world_size=world_size, rpc_backend_options=rpc_backend_options)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 196, in init_rpc
_init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/__init__.py", line 237, in _init_rpc_backend
rpc_backend_options=rpc_backend_options,
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 101, in init_backend
return backend.value.init_backend_handler(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/backend_registry.py", line 359, in _tensorpipe_init_backend_handler
api._all_gather(None, timeout=rpc_backend_options.rpc_timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 226, in _all_gather
timeout=rpc_timeout,
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 83, in wrapper
return func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 800, in rpc_sync
return fut.wait()
RuntimeError: transport retry counter exceeded (this error originated at tensorpipe/transport/ibv/connection_impl.cc:478)
V0921 07:24:34.744256 89 tensorpipe/core/context_impl.cc:220] Context B is closing
V0921 07:24:34.744301 89 tensorpipe/core/context_impl.cc:238] Context B is handling error context closed (this error originated at tensorpipe/core/context_impl.cc:221)
Here is the rank 1 log, name is A:
tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516
[09-21 09:01:44] DEBUG not set NCCL_IB_GID_INDEX func.py:67
[09-21 09:01:44] DEBUG 1.12.1 func.py:67
[09-21 09:01:44] DEBUG name:A, rank:1, init_method:tcp://k8snccltest-serial-0.di-k8snccltest.xlab.svc.k8s.cluster:50516 func.py:67
V0921 09:01:44.481227 89 tensorpipe/core/context_impl.cc:53] Context 89:c0 created
V0921 09:01:44.481273 89 tensorpipe/core/context_impl.cc:55] Context 89:c0 aliased as A
V0921 09:01:44.494768 89 tensorpipe/core/context_impl.cc:81] Context A is registering transport ibv
V0921 09:01:44.494844 89 tensorpipe/core/context_impl.cc:81] Context A is registering transport uv
V0921 09:01:44.497126 89 tensorpipe/core/context_impl.cc:81] Context A is registering transport shm
V0921 09:01:44.497258 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_basic
V0921 09:01:44.497267 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_basic
V0921 09:01:44.497273 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_basic.cpu
V0921 09:01:44.502026 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_xth
V0921 09:01:44.502036 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_xth
V0921 09:01:44.502157 89 tensorpipe/channel/cma/context_impl.cc:225] Couldn't detect the active Linux Security Modules
V0921 09:01:44.502223 89 tensorpipe/channel/cma/context_impl.cc:253] YAMA ptrace scope set to classic ptrace permissions
V0921 09:01:44.502234 89 tensorpipe/channel/cma/context_impl.cc:296] The domain descriptor for CMA is cma:3d1e04ad-0931-4975-b822-467e2b3b1cee_100085_f000197b_100085_effffffd_0_0_a80425fb
V0921 09:01:44.502283 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cma
V0921 09:01:44.502290 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cma
V0921 09:01:44.538451 89 tensorpipe/channel/cuda_ipc/context_impl.cc:254] The UUIDs of all the GPUs found by the CUDA IPC channel are 19071872-e206-4660-2142-9651e3f148a8
V0921 09:01:44.538464 89 tensorpipe/channel/cuda_ipc/context_impl.cc:256] The peer-to-peer support found by the CUDA IPC channel is {{1}}
V0921 09:01:44.538490 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel cuda_ipc
V0921 09:01:44.538497 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_cuda_ipc
V0921 09:01:44.538509 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel basic
V0921 09:01:44.538515 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_basic
V0921 09:01:44.580634 89 tensorpipe/core/context_impl.cc:104] Context A is registering channel mpt_uv
V0921 09:01:44.580645 89 tensorpipe/channel/context_impl_boilerplate.h:191] Channel context N/A was renamed to A.ch_mpt_uv
V0921 09:01:44.580795 89 tensorpipe/channel/cuda_gdr/context_impl.cc:479] CUDA GDR channel is not viable because the nv_peer_mem kernel module isn't active
V0921 09:01:44.580809 89 tensorpipe/core/context_impl.cc:116] Context A is opening listener A[l0]
V0921 09:01:44.581122 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport shm
V0921 09:01:44.581132 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:44.581136 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport uv
V0921 09:01:44.754984 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport uv
V0921 09:01:44.755039 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:44.755054 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport uv
V0921 09:01:44.755066 89 tensorpipe/core/listener_impl.cc:102] Listener A[l0] received an accept request (#0)
V0921 09:01:44.755142 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:44.755171 89 tensorpipe/core/listener_impl.cc:293] Listener A[l0] got spontaneous connection
V0921 09:01:44.755259 89 tensorpipe/core/listener_impl.cc:295] Listener A[l0] is opening pipe A[l0].p0
V0921 09:01:44.755266 89 tensorpipe/core/listener_impl.cc:299] Pipe A[l0].p0 aliased as A[l0]_from_B
V0921 09:01:44.755295 89 tensorpipe/core/pipe_impl.cc:332] Pipe A[l0]_from_B is reading nop object (brochure)
V0921 09:01:44.755305 89 tensorpipe/core/listener_impl.cc:108] Listener A[l0] is calling an accept callback (#0)
V0921 09:01:44.755323 89 tensorpipe/core/listener_impl.cc:111] Listener A[l0] done calling an accept callback (#0)
V0921 09:01:44.755331 89 tensorpipe/core/listener_impl.cc:102] Listener A[l0] received an accept request (#1)
V0921 09:01:44.755341 89 tensorpipe/core/pipe_impl.cc:376] Pipe A[l0]_from_B received a readDescriptor request (#0)
V0921 09:01:44.755433 89 tensorpipe/core/pipe_impl.cc:335] Pipe A[l0]_from_B done reading nop object (brochure)
V0921 09:01:44.755483 89 tensorpipe/core/pipe_impl.cc:1046] Pipe A[l0]_from_B is requesting connection (as replacement)
V0921 09:01:44.755557 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#0)
V0921 09:01:44.755566 89 tensorpipe/core/pipe_impl.cc:1046] Pipe A[l0]_from_B is requesting connection (as replacement)
V0921 09:01:44.755571 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#1)
V0921 09:01:44.755706 89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 0/2 (for channel cuda_basic)
V0921 09:01:44.755761 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#2)
V0921 09:01:44.755765 89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 1/2 (for channel cuda_basic)
V0921 09:01:44.755768 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#3)
V0921 09:01:44.755774 89 tensorpipe/core/pipe_impl.cc:1073] Pipe A[l0]_from_B is requesting connection 0/1 (for channel mpt_uv)
V0921 09:01:44.755777 89 tensorpipe/core/listener_impl.cc:154] Listener A[l0] received a connection request registration (#4)
V0921 09:01:44.755781 89 tensorpipe/core/pipe_impl.cc:1028] Pipe A[l0]_from_B is writing nop object (brochure answer)
V0921 09:01:44.755965 89 tensorpipe/core/pipe_impl.cc:1031] Pipe A[l0]_from_B done writing nop object (brochure answer)
V0921 09:01:45.056188 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.056226 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.056237 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.256277 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.256319 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.256326 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.259347 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.259359 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.259363 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.261365 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.261375 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.261379 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:45.263591 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:45.263600 89 tensorpipe/core/listener_impl.cc:241] Listener A[l0] is reading nop object (spontaneous or requested connection)
V0921 09:01:45.263604 89 tensorpipe/core/listener_impl.cc:269] Listener A[l0] is accepting connection on transport ibv
V0921 09:01:49.328515 89 tensorpipe/core/listener_impl.cc:207] Listener A[l0] is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328582 89 tensorpipe/core/listener_impl.cc:108] Listener A[l0] is calling an accept callback (#1)
[W tensorpipe_agent.cpp:530] RPC agent for A encountered error when accepting incoming pipe: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328604 89 tensorpipe/core/listener_impl.cc:111] Listener A[l0] done calling an accept callback (#1)
V0921 09:01:49.328610 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#4)
V0921 09:01:49.328617 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#4)
V0921 09:01:49.328621 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#3)
V0921 09:01:49.328624 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#3)
V0921 09:01:49.328628 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#2)
V0921 09:01:49.328631 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#2)
V0921 09:01:49.328634 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#0)
V0921 09:01:49.328639 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#0)
V0921 09:01:49.328642 89 tensorpipe/core/listener_impl.cc:162] Listener A[l0] is calling a connection request registration callback (#1)
V0921 09:01:49.328658 89 tensorpipe/core/listener_impl.cc:166] Listener A[l0] done calling a connection request registration callback (#1)
V0921 09:01:49.328689 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.328698 89 tensorpipe/core/pipe_impl.cc:634] Pipe A[l0]_from_B is handling error eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328707 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#1)
V0921 09:01:49.328712 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#0)
V0921 09:01:49.328716 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#4)
V0921 09:01:49.328720 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#2)
V0921 09:01:49.328726 89 tensorpipe/core/listener_impl.cc:183] Listener A[l0] received a connection request de-registration (#3)
V0921 09:01:49.328733 89 tensorpipe/core/pipe_impl.cc:382] Pipe A[l0]_from_B is calling a readDescriptor callback (#0)
[W tensorpipe_agent.cpp:726] RPC agent for A encountered error when reading incoming request from B: eof (this error originated at tensorpipe/transport/ibv/connection_impl.cc:302)
V0921 09:01:49.328756 89 tensorpipe/core/pipe_impl.cc:385] Pipe A[l0]_from_B done calling a readDescriptor callback (#0)
V0921 09:01:49.328765 89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 0/1 (for channel mpt_uv)
V0921 09:01:49.328770 89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 1/2 (for channel cuda_basic)
V0921 09:01:49.328776 89 tensorpipe/core/pipe_impl.cc:1081] Pipe A[l0]_from_B done requesting connection 0/2 (for channel cuda_basic)
V0921 09:01:49.328780 89 tensorpipe/core/pipe_impl.cc:1052] Pipe A[l0]_from_B done requesting connection (as replacement)
V0921 09:01:49.328787 89 tensorpipe/core/pipe_impl.cc:1052] Pipe A[l0]_from_B done requesting connection (as replacement)
V0921 09:01:49.328790 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport uv
V0921 09:01:49.328797 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport shm
V0921 09:01:49.328802 89 tensorpipe/core/pipe_impl.cc:355] Pipe A[l0]_from_B is closing
V0921 09:01:49.329226 89 tensorpipe/core/listener_impl.cc:275] Listener A[l0] done accepting connection on transport ibv
V0921 09:01:49.329238 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.329491 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.329795 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
V0921 09:01:49.330076 89 tensorpipe/core/listener_impl.cc:248] Listener A[l0] done reading nop object (spontaneous or requested connection)
I found issue#452 very similar to my problem , but my network environment is infiniband, not ROCE. And checked the LID and GID of the running environment, as follows:
//ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.2006
Hardware version: 0
Node GUID: 0x043f720300fc827c
System image GUID: 0x043f720300fc827c
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 141
LMC: 0
SM lid: 1
Capability mask: 0x2651e848
Port GUID: 0x043f720300fc827c
Link layer: InfiniBand
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.28.2006
Hardware version: 0
Node GUID: 0x043f720300fc8288
System image GUID: 0x043f720300fc8288
Port 1:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x2651e848
Port GUID: 0x043f720300fc8288
Link layer: InfiniBand
//show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:043f:7203:00fc:827c v1
mlx5_1 1 0 fe80:0000:0000:0000:043f:7203:00fc:8288 v1
Because LID (Local ID) is a layer 2 attribute in the infiniband protocol stack, this attribute does not need to be set in the RoCE network. When this port is queried, this field is 0. And I read the src code of tensorpipe, if LID is not 0, GID index has no effect on qp_attr . So I think my error may be not caused by kGlobalIdentifierIndex.
tensorpipe/common/ibv.cc:117
// Global routing is only set up as far as needed to support RoCE.
attrMask |= IbvLib::QP_AV;
if (destinationInfo.localIdentifier != 0) {
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = destinationInfo.localIdentifier;
} else {
attr.ah_attr.is_global = 1;
attr.ah_attr.grh.dgid = destinationInfo.globalIdentifier;
attr.ah_attr.grh.sgid_index = selfAddr.globalIdentifierIndex;
attr.ah_attr.grh.hop_limit = 1;
}
In addition, I used pytorch's NCCL init_process_group API in the same environment and did not reproduce the above error.
And if I start two processes in the same Pod and run the same torch rpc code, everything works fine.
Container Env Versions
pytorch 1.12.1 py3.8_cuda11.3_cudnn8.3.2_0 pytorch
pytorch-mutex 1.0 cuda pytorch
torchaudio 0.12.1 py38_cu113 pytorch
torchelastic 0.2.1 pypi_0 pypi
torchvision 0.13.1 py38_cu113 pytorch
MLNX_OFED_LINUX-4.6-1.0.1.1
ubuntu 18.04