Skip to content

Conversation

@ramonfigueiredo
Copy link
Collaborator

@ramonfigueiredo ramonfigueiredo commented Oct 9, 2025

Link the Issue(s) this Pull Request is related to.

Summarize your change.

PHASE 1:

A) Protobuf

Protobuf schema extensions:

  • Add GpuDevice message with vendor, model, memory, PCI bus, driver version, and CUDA/Metal version fields to host.proto
  • Add GpuUsage message for per-device utilization tracking (util %, memory used)
  • Extend Host and NestedHost messages with gpu_devices repeated field
  • Extend RenderHost with gpu_devices for detailed GPU inventory reporting
  • Extend RunningFrameInfo with gpu_usage for per-frame GPU metrics
  • Add GPU constraint fields to Layer: gpu_vendor, gpu_models_allowed, min_gpu_memory_bytes for scheduler filtering
  • Add gpu_usage to Frame and UpdatedFrame messages for accounting

B) Python RQD

[rqd/rust_rqd/proto] Add robust GPU support with cross-platform discovery and per-device tracking

RQD GPU discovery implementation:

  • Implement GpuDiscovery abstract base class for pluggable GPU backends
  • Implement NvidiaGpuDiscovery with NVML (pynvml) support and nvidia-smi fallback for detailed NVIDIA GPU metadata collection
  • Implement AppleMetalGpuDiscovery for macOS Apple Silicon GPU detection via system_profiler JSON parsing
  • Update Machine class with platform-specific GPU discovery initialization (Linux - NVIDIA, Darwin - Apple Metal, Windows - NVIDIA)
  • Populate gpu_devices in RenderHost for all platforms (Linux, macOS, Windows)

GPU isolation and monitoring:

  • Set CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES environment variables in rqcore.py for proper GPU isolation in launched frames
  • Collect per-device GPU utilization in __updateGpuAndLlu() using new getGpuUtilization() method
  • Add gpuUsage list to RunningFrame class for tracking per-frame GPU metrics
  • Extend runningFrameInfo() to include gpu_usage in RunningFrameInfo proto

Dependencies:

  • Add pynvml > = 11.5.0 to rqd/pyproject.toml for NVML GPU querying

All changes maintain backward compatibility via optional/repeated proto fields.

Legacy num_gpus and gpu_memory fields preserved for existing clients.

B) Rust RQD

[rust_rqd] Add GPU discovery infrastructure to Rust RQD

Implement cross-platform GPU discovery and monitoring infrastructure for Rust RQD, mirroring the Python RQD architecture to enable robust GPU support across both implementations.

  1. New module: system/gpu.rs
  • Add GpuDiscovery trait defining abstract GPU discovery interface with detect_devices() and get_utilization() methods
  • Implement NvidiaGpuDiscovery with NVML support (via optional nvml-wrapper crate) and nvidia-smi fallback for detailed NVIDIA GPU metadata collection
  • Implement AppleMetalGpuDiscovery for macOS Apple Silicon GPU detection via system_profiler JSON parsing
  • Add create_gpu_discovery() factory function for platform-specific backend selection (Linux - NVIDIA, macOS - Apple - Metal, Windows - NVIDIA)
  1. system/manager.rs:
  • Import GpuDevice and GpuUsage from opencue_proto::host
  • Extend MachineGpuStats with gpu_devices: Vec<GpuDevice> for detailed GPU inventory alongside legacy count/memory fields
  • Extend ProcessStats with gpu_usage: Vec<GpuUsage> for per-device utilization tracking in running frames
  • Update ProcessStats::default() and ProcessStats::update() to handle new gpu_usage field
  1. system/mod.rs:
  • Expose gpu module with pub mod gpu
  1. Cargo.toml:
  • Add optional nvml feature flag for NVML support
  • Add nvml-wrapper = { version = "0.10", optional = true } dependency

Architecture:

  • Trait-based abstraction matches Python class hierarchy for consistency
  • Optional NVML dependency via Cargo features allows compilation without NVIDIA-specific dependencies
  • Cross-platform design supports Linux (NVIDIA), macOS (Apple Metal), and Windows (NVIDIA) from the start
  • Backward compatible: retains legacy GPU fields in MachineGpuStats
  • Reuses opencue_proto::host::{GpuDevice, GpuUsage} proto messages directly

Build with NVML: cargo build --release --features nvml
Build without NVML: cargo build --release (fallback to nvidia-smi)

Remaining integration work tracked in RUST_GPU_IMPLEMENTATION_SUMMARY.md:

  • Integrate GPU discovery into MachineMonitor
  • Populate gpu_devices in RenderHost reports
  • Add CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES environment variables
  • Collect per-frame GPU utilization during stats collection

@ramonfigueiredo ramonfigueiredo changed the title [rqd/proto] Add robust GPU support with cross-platform discovery and per-device tracking feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs Oct 9, 2025
@ramonfigueiredo ramonfigueiredo changed the title feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs [Proto/RQD/Cuebot/RESTGateway/CueAdmin/CueGUI] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs Oct 9, 2025
…per-device tracking

Implement Phase 1 of comprehensive GPU support enhancement:
- Issue: [GPU/Proto/RQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Support - Comprehensive Audit and Implementation Plan - AcademySoftwareFoundation#2035

Protobuf schema extensions:
- Add `GpuDevice` message with `vendor`, `model`, `memory`, `PCI bus`, `driver version`, and `CUDA`/`Metal` version fields to `host.proto`
- Add `GpuUsage` message for per-device utilization tracking (`util %`, `memory used`)
- Extend `Host` and `NestedHost` messages with `gpu_devices` repeated field
- Extend `RenderHost` with `gpu_devices` for detailed GPU inventory reporting
- Extend `RunningFrameInfo` with `gpu_usage` for per-frame GPU metrics
- Add GPU constraint fields to Layer: `gpu_vendor`, `gpu_models_allowed`, `min_gpu_memory_bytes` for scheduler filtering
- Add `gpu_usage` to `Frame` and `UpdatedFrame` messages for accounting

RQD GPU discovery implementation:
- Implement `GpuDiscovery` abstract base class for pluggable GPU backends
- Implement `NvidiaGpuDiscovery` with `NVML` (`pynvml`) support and `nvidia-smi` fallback for detailed NVIDIA GPU metadata collection
- Implement `AppleMetalGpuDiscovery` for macOS Apple Silicon GPU detection via `system_profiler` JSON parsing
- Update Machine class with platform-specific GPU discovery initialization (Linux - NVIDIA, Darwin - Apple Metal, Windows - NVIDIA)
- Populate `gpu_devices` in `RenderHost` for all platforms (`Linux`, `macOS`, `Windows`)

GPU isolation and monitoring:
- Set `CUDA_VISIBLE_DEVICES` and `NVIDIA_VISIBLE_DEVICES` environment variables in `rqcore.py` for proper GPU isolation in launched frames
- Collect per-device GPU utilization in `__updateGpuAndLlu()` using new `getGpuUtilization()` method
- Add `gpuUsage` list to `RunningFrame` class for tracking per-frame GPU metrics
- Extend `runningFrameInfo()` to include `gpu_usage` in `RunningFrameInfo` proto

Update VERSION.in

Dependencies:
- Add pynvml > = 11.5.0 to `rqd/pyproject.toml` for `NVML` GPU querying

All changes maintain backward compatibility via optional/repeated proto fields.

Legacy `num_gpus` and `gpu_memory` fields preserved for existing clients.
@ramonfigueiredo ramonfigueiredo force-pushed the feature/robust-gpu-support branch from d698c39 to ae28d59 Compare October 9, 2025 01:41
@ramonfigueiredo ramonfigueiredo changed the title [Proto/RQD/Cuebot/RESTGateway/CueAdmin/CueGUI] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs [proto/rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs Oct 9, 2025
@ramonfigueiredo ramonfigueiredo changed the title [proto/rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs [proto/rqd/rust_rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs Oct 9, 2025
Implement cross-platform GPU discovery and monitoring infrastructure for Rust RQD, mirroring the Python RQD architecture to enable robust GPU support across both implementations.

1) New module: `system/gpu.rs`
- Add `GpuDiscovery` trait defining abstract GPU discovery interface with `detect_devices()` and `get_utilization()` methods
- Implement `NvidiaGpuDiscovery` with NVML support (via optional `nvml-wrapper` crate) and `nvidia-smi` fallback for detailed NVIDIA GPU metadata collection
- Implement `AppleMetalGpuDiscovery` for macOS Apple Silicon GPU detection via `system_profiler` JSON parsing
- Add `create_gpu_discovery()` factory function for platform-specific backend selection (Linux - NVIDIA, macOS - Apple - Metal, Windows - NVIDIA)

2) `system/manager.rs`:
- Import `GpuDevice` and `GpuUsage` from `opencue_proto::host`
- Extend `MachineGpuStats` with `gpu_devices`: `Vec<GpuDevice>` for detailed GPU inventory alongside legacy `count`/`memory` fields
- Extend `ProcessStats` with `gpu_usage`: `Vec<GpuUsage>` for `per-device` utilization tracking in running frames
- Update `ProcessStats::default()` and `ProcessStats::update()` to handle new `gpu_usage` field

3) `system/mod.rs`:
- Expose gpu module with pub mod gpu

4) `Cargo.toml`:
- Add optional `nvml` feature flag for NVML support
- Add `nvml-wrapper = { version = "0.10", optional = true }` dependency

Architecture:
- Trait-based abstraction matches Python class hierarchy for consistency
- Optional NVML dependency via Cargo features allows compilation without NVIDIA-specific dependencies
- Cross-platform design supports Linux (NVIDIA), macOS (Apple Metal), and Windows (NVIDIA) from the start
- Backward compatible: retains legacy GPU fields in `MachineGpuStats`
- Reuses `opencue_proto::host::{GpuDevice, GpuUsage}` proto messages directly

Build with NVML: `cargo build --release --features nvml`
Build without NVML: `cargo build --release (fallback to nvidia-smi)`

Remaining integration work tracked in RUST_GPU_IMPLEMENTATION_SUMMARY.md:
- Integrate GPU discovery into `MachineMonitor`
- Populate `gpu_devices` in `RenderHost` reports
- Add `CUDA_VISIBLE_DEVICES`/`NVIDIA_VISIBLE_DEVICES` environment variables
- Collect per-frame GPU utilization during stats collection
@ramonfigueiredo ramonfigueiredo force-pushed the feature/robust-gpu-support branch from be760b3 to 6d3e8cc Compare October 9, 2025 02:12
@ramonfigueiredo ramonfigueiredo self-assigned this Oct 9, 2025
@DiegoTavares
Copy link
Collaborator

Great work so far. My only comment is that pynvml dependency should be optional.

@DiegoTavares
Copy link
Collaborator

DiegoTavares commented Oct 14, 2025

Besides the protobuf errors:


File recursively imports itself: job.proto -> host.proto -> job.proto  
host.proto:10:1: Import "job.proto" was not found or had errors.  
host.proto:736:5: "job.Frame" is not defined.  
host.proto:754:5: "job.Job" is not defined.  
host.proto:763:5: "job.Layer" is not defined.  
host.proto:825:5: "job.JobSeq" is not defined.  
host.proto:836:5: "job.Group" is not defined.  
job.proto:12:1: Import "host.proto" was not found or had errors.  
job.proto:526:14: "host.GpuUsage" is not defined.  
job.proto:573:14: "host.GpuUsage" is not defined.  
filter.proto:8:1: Import "job.proto" was not found or had errors.  
filter.proto:301:5: "job.Group" is not defined.  
filter.proto:309:5: "job.JobSeq" is not defined.

Co-authored-by: Diego Tavares <[email protected]>
Signed-off-by: Ramon Figueiredo <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants