Add support for saving HF format tensors with DCP #1351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

ankitageorge wants to merge 1 commit into main from dcp-hf

ankitageorge commented Jun 27, 2025 •

edited

Loading

If checkpoint.enable_hf_safetensors_format is set, then save the checkpoint with DCP HF components that will save the checkpoint in .safetensors files instead of regular DCP format.

facebook-github-bot added the CLA Signed label


          add hf support

c0c4448

ankitageorge force-pushed the dcp-hf branch from dfc1f38 to c0c4448 Compare

June 27, 2025 18:07

ankitageorge changed the title ~~Dcp hf~~ Add support for saving HF format tensors with DCP

Contributor

fegin commented Jun 27, 2025

@Saiteja64 This will conflict with your PR.

fegin reviewed

View reviewed changes

Contributor

fegin left a comment

Overall the logic LGTM, please address comments and ensure that this PR doesn't conflict with the PR from @Saiteja64. Please also add a test result -- save a hf checkpoint and load one back and check the accuracy.

torchtitan/components/checkpoint.py

@@ @@ -12,14 +12,19 @@ @@
               import shutil
               import threading
               import time
-              from typing import Any
+              from concurrent.futures import Future
+              from typing import Any, Optional

Contributor

fegin Jun 27, 2025

We use 3.10 type checking, so you don't need Optional.

torchtitan/components/checkpoint.py

+                  checkpoint_id: str,
+                  is_async: bool,
+                  hf_safetensors_format: bool,
+                  pg: Optional[dist.ProcessGroup] = None,

Contributor

fegin Jun 27, 2025

Suggested change

      
                pg: Optional[dist.ProcessGroup] = None,
          
                pg: dist.ProcessGroup | None = None,

torchtitan/components/checkpoint.py

+              def dcp_save(
+                  state_dict: dict[str, Any],
+                  checkpoint_id: str,
+                  is_async: bool,

Contributor

fegin Jun 27, 2025

@Saiteja64 do we also need anther argument for ZOC?

torchtitan/components/checkpoint.py

+                  is_async: bool,
+                  hf_safetensors_format: bool,
+                  pg: Optional[dist.ProcessGroup] = None,
+              ) -> Optional[Future]:

Contributor

fegin Jun 27, 2025

Suggested change

      
            ) -> Optional[Future]:
          
            ) -> Future | None:

torchtitan/components/checkpoint.py

+                  hf_safetensors_format: bool,
+                  pg: Optional[dist.ProcessGroup] = None,
+              ) -> Optional[Future]:
+                  """Save the checkpoint with dcp.

Contributor

fegin Jun 27, 2025

Add one empty line

torchtitan/components/checkpoint.py

+                      checkpoint_id (str): The checkpoint id to save.
+                      is_async (bool): Whether the checkpoint is async.
+                      hf_safetensors_format (bool): Whether to use the HuggingFace safetensors format.
+                      pg (Optional[dist.ProcessGroup]): The process group to use.

Contributor

fegin Jun 27, 2025

Add the return value as well.

torchtitan/components/checkpoint.py

Comment on lines +116 to +130

+                  if hf_safetensors_format:
+                      storage_writer = HuggingFaceStorageWriter(path=checkpoint_id, save_sharded=True)
+                      if is_async:
+                          return dcp.async_save(
+                              state_dict, storage_writer=storage_writer, process_group=pg
+                          )
+                      else:
+                          return dcp.save(state_dict, storage_writer=storage_writer)
+                  else:
+                      if is_async:
+                          return dcp.async_save(
+                              state_dict, checkpoint_id=checkpoint_id, process_group=pg
+                          )
+                      else:
+                          return dcp.save(state_dict, checkpoint_id=checkpoint_id)

Contributor

fegin Jun 27, 2025

We should simplify the function as follow

Suggested change

      
                if hf_safetensors_format:
          
                    storage_writer = HuggingFaceStorageWriter(path=checkpoint_id, save_sharded=True)
          
                    if is_async:
          
                        return dcp.async_save(
          
                            state_dict, storage_writer=storage_writer, process_group=pg
          
                        )
          
                    else:
          
                        return dcp.save(state_dict, storage_writer=storage_writer)
          
                else:
          
                    if is_async:
          
                        return dcp.async_save(
          
                            state_dict, checkpoint_id=checkpoint_id, process_group=pg
          
                        )
          
                    else:
          
                        return dcp.save(state_dict, checkpoint_id=checkpoint_id)
          
                storage_writer = HuggingFaceStorageWriter(path=checkpoint_id, save_sharded=True) if hf_safetensors_format else None
          
                checkpoint_id = checkpoint_id if not hf_safetensors_format else None
          
                if is_async:
          
                     return dcp.async_save(
          
                        state_dict, storage_writer=storage_writer, checkpoint_id=checkpoint_id, process_group=pg
          
                     )
          
                else:
          
                     return dcp.save(state_dict, storage_writer=storage_writer, checkpoint_id=checkpoint_id)

torchtitan/components/checkpoint.py

+              def dcp_load(
+                  state_dict: dict[str, Any], checkpoint_id: str, hf_safetensors_format: bool
+              ) -> None:
+                  """Load the checkpoint with dcp.

Contributor

fegin Jun 27, 2025

Add one empty line below

torchtitan/config_manager.py

+                  enable_hf_safetensors_format: bool = False
+                  """
+                  Enable the use of safetensors format for checkpointing. This will save checkpoints
+                  in safetensors format instead of the default DCP format. The default value is False.

Contributor

fegin Jun 27, 2025

Can we also mention the possible performance penalty? It's not cost free, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

fegin fegin left review comments

tianyu-l Awaiting requested review from tianyu-l tianyu-l will be requested when the pull request is marked ready for review tianyu-l is a code owner

wwwjn Awaiting requested review from wwwjn wwwjn will be requested when the pull request is marked ready for review wwwjn is a code owner

wconstab Awaiting requested review from wconstab wconstab will be requested when the pull request is marked ready for review wconstab is a code owner

At least 1 approving review is required to merge this pull request.

Labels