Skip to content

Have the nodes ping out their dataloader state before the all-reduce. #98

@Jackmin801

Description

@Jackmin801
  • If a node leaves by crashing, we cannot exactly recover its dataloader state.
  • This forces us to manually skip shards to avoid duplicates
  • The ideal state is that they can resume automatically from a remote dataloader state
  • The dataloader state is not that big and this should not cost too much overhead
  • We could interleave it with the all-reduce, completing the all-reduce validates the dataloader state as latest

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions