Skip to content

[IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.

License

Notifications You must be signed in to change notification settings

shady1543/eACGM

Repository files navigation

eACGM

eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.

English | 中文


[News] Our work has been accepted by IEEE/ACM IWQoS 2025 (CCF-B)!

[arXiv]


eACGM provides zero-intrusive, low-overhead, full-stack observability for both hardware (GPU, NCCL) and software (CUDA, Python, PyTorch) layers in modern AI/ML workloads.

Architecture

Features

  • Event tracing for CUDA Runtime based on eBPF
  • Event tracing for NCCL GPU communication library based on eBPF
  • Function call tracing for Python virtual machine based on eBPF
  • Operator tracing for PyTorch based on eBPF
  • Process-level GPU information monitoring based on libnvml
  • Global GPU information monitoring based on libnvml
  • Automatic eBPF program generation
  • Comprehensive analysis of all traced events and operators
  • Flexible integration for multi-level tracing (CUDA, NCCL, PyTorch, Python, GPU)
  • Visualization-ready data output for monitoring platforms

Visualization

To visualize monitoring data, deploy Grafana and MySQL using Docker. Access the Grafana dashboard at http://127.0.0.1:3000.

cd grafana/
sh ./launch.sh

Start the monitoring service with:

./service.sh

Stop the monitoring service with:

./stop.sh

Case Demonstration

The demo folder provides example programs to showcase the capabilities of eACGM:

  • pytorch_example.py: Multi-node, multi-GPU PyTorch training demo
  • sampler_cuda.py: Trace CUDA Runtime events using eBPF
  • sampler_nccl.py: Trace NCCL GPU communication events using eBPF
  • sampler_torch.py: Trace PyTorch operator events using eBPF
  • sampler_python.py: Trace Python VM function calls using eBPF
  • sampler_gpu.py: Monitor global GPU information using libnvml
  • sampler_nccl.py: Monitor process-level GPU information using libnvml
  • sampler_eacg.py: Combined monitoring of all supported sources
  • webui.py: Automatically visualize captured data in Grafana

Citation

If you find this project helpful, please consider citing our IWQoS 2025 paper:

@misc{xu2025eacgmnoninstrumentedperformancetracing,
      title={eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems}, 
      author={Ruilin Xu and Zongxuan Xie and Pengfei Chen},
      year={2025},
      eprint={2506.02007},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2506.02007}, 
}

About

[IWQoS 2025] eACGM: An eBPF-based Automated Comprehensive Governance and Monitoring framework for AI/ML systems.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •