Skip to content

V1.5.0 Release Plan #115

@yukirora

Description

@yukirora

Endgame

  • Code freeze: Dec. 26
  • Bug Bash date:
  • Testing start date:
  • Release date:

Main Features

Platform Features & Stability

    • LTP prometheus data auth
    • Containers losing access to NVIDIA GPUs
    • local storage ssh key gen consistency enhancement
    • Job exporter support on arm64 architecture GPUs

Inference

    • statistic about inference usage
    • Inference user data auth design and implementation

Security

    • Add support of Azure email communication service

TODO

Automatic Endpoint Deployment/Upgrade CI/CD
Distributed prometheus design
Job Reliability & Monitoring
Automatic Failure Detector
Interactive gracefully job exit/retry
Hardware Failure Detector
Software Failure Detector
Cluster and Job utilization metrics
LTP-Megatron Support for Project Level Metrics
Project Level Metrics Monitoring

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions