-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Labels
enhancementNew feature or requestNew feature or request
Description
We want to be able to leverage torchft's fast quorum implementation for Lighthouse in order to do faster dynamic rendezvous for torchelastic.
Torchelastic has an entrypoints based mechanism for registering new backends at https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64
Key features we want to support:
- flexible lighthouse config: external lighthouse support + automatically starting lighthouse similar to c10d's TCPStore using the address
- scale up / scale down operations
- hot spares for fast restarts
References:
- https://packaging.python.org/en/latest/specifications/entry-points/
- https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64
- https://pytorch.org/docs/stable/elastic/rendezvous.html
- c10d rendezvous https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py#L214
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request