Skip to content

Commit 9f18581

Browse files
committed
deep copy state dict for checkpoint
Summary: deep copy the state dict for sending checkpoint because if the replica moves to the next step, the state dict can change before the checkpoint is sent
1 parent fef4abc commit 9f18581

File tree

1 file changed

+2
-1
lines changed

1 file changed

+2
-1
lines changed

torchft/manager.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
"""
2727

2828
import concurrent.futures
29+
import copy
2930
import logging
3031
import os
3132
import socket
@@ -646,7 +647,7 @@ def _async_quorum(
646647
self._checkpoint_transport.send_checkpoint(
647648
dst_ranks=quorum.recover_dst_replica_ranks,
648649
step=max_step,
649-
state_dict=self._manager_state_dict(),
650+
state_dict=copy.deepcopy(self._manager_state_dict()),
650651
timeout=self._timeout,
651652
)
652653

0 commit comments

Comments
 (0)