eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.
English | 中文
⭐ [News] Our work has been accepted by IEEE/ACM IWQoS 2025 (CCF-B)!
eACGM provides zero-intrusive, low-overhead, full-stack observability for both hardware (GPU, NCCL) and software (CUDA, Python, PyTorch) layers in modern AI/ML workloads.
- Event tracing for CUDA Runtime based on eBPF
- Event tracing for NCCL GPU communication library based on eBPF
- Function call tracing for Python virtual machine based on eBPF
- Operator tracing for PyTorch based on eBPF
- Process-level GPU information monitoring based on
libnvml - Global GPU information monitoring based on
libnvml - Automatic eBPF program generation
- Comprehensive analysis of all traced events and operators
- Flexible integration for multi-level tracing (CUDA, NCCL, PyTorch, Python, GPU)
- Visualization-ready data output for monitoring platforms
To visualize monitoring data, deploy Grafana and MySQL using Docker. Access the Grafana dashboard at http://127.0.0.1:3000.
cd grafana/
sh ./launch.shStart the monitoring service with:
./service.shStop the monitoring service with:
./stop.shThe demo folder provides example programs to showcase the capabilities of eACGM:
pytorch_example.py: Multi-node, multi-GPU PyTorch training demosampler_cuda.py: Trace CUDA Runtime events using eBPFsampler_nccl.py: Trace NCCL GPU communication events using eBPFsampler_torch.py: Trace PyTorch operator events using eBPFsampler_python.py: Trace Python VM function calls using eBPFsampler_gpu.py: Monitor global GPU information usinglibnvmlsampler_nccl.py: Monitor process-level GPU information usinglibnvmlsampler_eacg.py: Combined monitoring of all supported sourceswebui.py: Automatically visualize captured data in Grafana
If you find this project helpful, please consider citing our IWQoS 2025 paper:
@misc{xu2025eacgmnoninstrumentedperformancetracing,
title={eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems},
author={Ruilin Xu and Zongxuan Xie and Pengfei Chen},
year={2025},
eprint={2506.02007},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2506.02007},
}
