[Q&A] Best way to perform checkpointing in a production system #3731
-
Python version (
|
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
Beta Was this translation helpful? Give feedback.
-
|
Hi @virginiafdez, let me try to answer.
With the simulator, you need to define a new working directory to prevent overwriting the old output. However, you don't need to extract from job storage, as the simulator saves the raw training artifacts after the job is completed (unlike in a production environment). A more general "resume job" feature could be an interesting enhancement in the future @chesterxgchen @yanchengnv. |
Beta Was this translation helpful? Give feedback.
Hi @virginiafdez, let me try to answer.
Yes, the persistor can be configured to read checkpoints from the disk. NVFlare doesn't manage existing pretrained checkpoints. You could have a static mount on the server where these can be loaded from. Also, you can define different apps for the server and clients. So, only the server app can contain the ckpt if you want to upload it as part of the job. See a meta.json with a deployment map for that.
We don't have an auto-resume option at this moment. In a production environment, the approach would be to extract the job result from the job storage and use the global model as initialization for the next job. This wouldn't restore all the states…