-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Description
Endgame
- Code freeze: Dec. 26
- Bug Bash date:
- Testing start date:
- Release date:
Main Features
Platform Features & Stability
-
- LTP rest-server data auth (Ruigao/list disappeared vc jobs #112 )
-
- LTP prometheus data auth
-
- Containers losing access to NVIDIA GPUs
-
- local storage ssh key gen consistency enhancement
-
- Job exporter support on arm64 architecture GPUs
Inference
-
- statistic about inference usage
-
- Inference user data auth design and implementation
Security
-
- Add support of Azure email communication service
TODO
Automatic Endpoint Deployment/Upgrade CI/CD
Distributed prometheus design
Job Reliability & Monitoring
Automatic Failure Detector
Interactive gracefully job exit/retry
Hardware Failure Detector
Software Failure Detector
Cluster and Job utilization metrics
LTP-Megatron Support for Project Level Metrics
Project Level Metrics Monitoring
Metadata
Metadata
Assignees
Labels
No labels