[Q&A] Best way to perform checkpointing in a production system #3731

virginiafdez · 2025-09-30T10:19:31Z

virginiafdez
Sep 30, 2025

Python version (`python3 -V`)

3.12

NVFlare version (`python3 -m pip list | grep "nvflare"`)

2.5.2

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

No response

Operating system

Ubuntu 22.04

Have you successfully run any of the following examples?

hello-numpy-sag with simulator
hello-pt with simulator
hello-numpy-sag with POC
hello-pt with POC

Please describe your question

We would like to add the following features to our FL production system:

Provide checkpoints at the beginning of a training optionally.
Load from checkpoint if the training is stopped and resumed

Our system supports the actual normal running of jobs, and also the possibility to test the apps via the nvflare simulator locally.

I have found some relevant discussions like #1973, where 1. has been addressed. In this case, my question is not that related to how to load the weights within the app, which I guess is done by passing the path to the weights to the persistor, but what is the best practice in terms of where these weights should be stocked in the server container. I guess it's better not to put this inside of the custom app, as the clients would also get the weights and this isn't necessary. Is there a folder that would be more dedicated for this type of files in a standard nvflare setup?

2.1 In a production setup, is it possible to resume a training if it has been stopped before finishing? This means, making use of an app folder that has already been created? Is there any best practice to do so? This seems to be relevant: https://nvflare.readthedocs.io/en/2.2/apidocs/nvflare.apis.fl_snapshot.html but I couldn't find any tutorial or example that used this so I was unsure.
2.2 In the simulator, would this even be possible? As far as I'm aware, the simulate_job folders are reset if you restart a training, so I'm not sure if it's even possible to bypass it.

Thank you for any bit of knowledge on the matter!

Answered by holgerroth

Oct 2, 2025

Hi @virginiafdez, let me try to answer.

Yes, the persistor can be configured to read checkpoints from the disk. NVFlare doesn't manage existing pretrained checkpoints. You could have a static mount on the server where these can be loaded from. Also, you can define different apps for the server and clients. So, only the server app can contain the ckpt if you want to upload it as part of the job. See a meta.json with a deployment map for that.
We don't have an auto-resume option at this moment. In a production environment, the approach would be to extract the job result from the job storage and use the global model as initialization for the next job. This wouldn't restore all the states…

View full answer

chesterxgchen · 2025-10-01T00:09:49Z

chesterxgchen
Oct 1, 2025
Maintainer

@ZiyueXu77 @holgerroth @YuanTingHsieh

0 replies

holgerroth · 2025-10-02T22:52:08Z

holgerroth
Oct 2, 2025
Maintainer

Hi @virginiafdez, let me try to answer.

Yes, the persistor can be configured to read checkpoints from the disk. NVFlare doesn't manage existing pretrained checkpoints. You could have a static mount on the server where these can be loaded from. Also, you can define different apps for the server and clients. So, only the server app can contain the ckpt if you want to upload it as part of the job. See a meta.json with a deployment map for that.
We don't have an auto-resume option at this moment. In a production environment, the approach would be to extract the job result from the job storage and use the global model as initialization for the next job. This wouldn't restore all the states of the server and clients, but it would allow you to resume the training from the last successful global model.

With the simulator, you need to define a new working directory to prevent overwriting the old output. However, you don't need to extract from job storage, as the simulator saves the raw training artifacts after the job is completed (unlike in a production environment).

A more general "resume job" feature could be an interesting enhancement in the future @chesterxgchen @yanchengnv.

2 replies

virginiafdez Oct 3, 2025
Author

Thanks for the answer! I guess it'll be fine for a simple setup.
We will follow this approach for now then!
We were actually looking into FedOpt, for which the state dict of the server's optimizer / scheduler would be cool to keep somewhere as well, not sure if that's already doable.

holgerroth Oct 3, 2025
Maintainer

yes, you can save the optimizer state dict as part of the global checkpoint. Might need some customization of the persistor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Q&A] Best way to perform checkpointing in a production system #3731

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Q&A] Best way to perform checkpointing in a production system #3731

Uh oh!

virginiafdez Sep 30, 2025

Python version (python3 -V)

NVFlare version (python3 -m pip list | grep "nvflare")

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, git branch)

Operating system

Have you successfully run any of the following examples?

Please describe your question

Replies: 2 comments · 2 replies

Uh oh!

Uh oh!

chesterxgchen Oct 1, 2025 Maintainer

Uh oh!

holgerroth Oct 2, 2025 Maintainer

Uh oh!

virginiafdez Oct 3, 2025 Author

Uh oh!

holgerroth Oct 3, 2025 Maintainer

virginiafdez
Sep 30, 2025

Python version (`python3 -V`)

NVFlare version (`python3 -m pip list | grep "nvflare"`)

NVFlare branch (if running examples, please use the branch that corresponds to the NVFlare version, `git branch`)

Replies: 2 comments 2 replies

chesterxgchen
Oct 1, 2025
Maintainer

holgerroth
Oct 2, 2025
Maintainer

virginiafdez Oct 3, 2025
Author

holgerroth Oct 3, 2025
Maintainer