-
Python version (
|
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 4 replies
-
|
@ZiyueXu77 can you help check this, see if it the timeout due to model size ? |
Beta Was this translation helpful? Give feedback.
-
|
I tested and can confirm the issue, @YuanTingHsieh I think this is related to the quantization issue we are investigating regarding external process launcher. I can consistently observe the following before system erroring out: I replaced the external with in process and the problem is gone on my side. @oded-byte could you also try the below client config to see if it works? |
Beta Was this translation helpful? Give feedback.
-
|
Hi, |
Beta Was this translation helpful? Give feedback.
-
|
Hi @oded-byte, we found the root cause, it is these two timeouts: peer_read_timeout (Line36) and heartbeat_timeout (Line40), setting both to 300 and the issue's gone on my side, you can also test and adjust on your machine as the speed of each machine’s different, as you noticed, faster machine can work with defaults. Thanks for noticing and raising this! We will update our APIs accordingly and figure out a good way to have these timeouts set properly. |
Beta Was this translation helpful? Give feedback.
-
|
@oded-byte we increase the default timeout for main branch: #3671 something like: |
Beta Was this translation helpful? Give feedback.
Hi @oded-byte, we found the root cause, it is these two timeouts:
peer_read_timeout (Line36) and heartbeat_timeout (Line40),
https://github.com/NVIDIA/NVFlare/blob/main/nvflare/app_opt/pt/client_api_launcher_executor.py#L36-L40
setting both to 300 and the issue's gone on my side, you can also test and adjust on your machine as the speed of each machine’s different, as you noticed, faster machine can work with defaults.
Thanks for noticing and raising this! We will update our APIs accordingly and figure out a good way to have these timeouts set properly.