EMA should be sharded across each card. Some references: https://github.com/PKU-YuanGroup/Open-Sora-Plan/pull/492 https://github.com/hpcaitech/Open-Sora/blob/476b6dc79720e5d9ddfb3cd589680b2308871926/scripts/train.py#L229 https://github.com/microsoft/DeepSpeedExamples/blob/dafeb2b3be3a085214faa2f59a8979c051424938/applications/DeepSpeed-Chat/training/utils/utils.py#L128