Skip to content

torchelastic Rendezvous Backend #172

@d4l3k

Description

@d4l3k

We want to be able to leverage torchft's fast quorum implementation for Lighthouse in order to do faster dynamic rendezvous for torchelastic.

Torchelastic has an entrypoints based mechanism for registering new backends at https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/rendezvous/registry.py#L63-L64

Key features we want to support:

  • flexible lighthouse config: external lighthouse support + automatically starting lighthouse similar to c10d's TCPStore using the address
  • scale up / scale down operations
  • hot spares for fast restarts

References:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions