[proto/rqd/rust_rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs #2036

ramonfigueiredo · 2025-10-09T01:39:20Z

Link the Issue(s) this Pull Request is related to.

[Proto/RQD/RustRQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Modernization - Audit, Design and Production Rollout #2035

Summarize your change.

PHASE 1:

A) Protobuf

Protobuf schema extensions:

Add GpuDevice message with vendor, model, memory, PCI bus, driver version, and CUDA/Metal version fields to host.proto
Add GpuUsage message for per-device utilization tracking (util %, memory used)
Extend Host and NestedHost messages with gpu_devices repeated field
Extend RenderHost with gpu_devices for detailed GPU inventory reporting
Extend RunningFrameInfo with gpu_usage for per-frame GPU metrics
Add GPU constraint fields to Layer: gpu_vendor, gpu_models_allowed, min_gpu_memory_bytes for scheduler filtering
Add gpu_usage to Frame and UpdatedFrame messages for accounting

B) Python RQD

[rqd/rust_rqd/proto] Add robust GPU support with cross-platform discovery and per-device tracking

RQD GPU discovery implementation:

Implement GpuDiscovery abstract base class for pluggable GPU backends
Implement NvidiaGpuDiscovery with NVML (pynvml) support and nvidia-smi fallback for detailed NVIDIA GPU metadata collection
Implement AppleMetalGpuDiscovery for macOS Apple Silicon GPU detection via system_profiler JSON parsing
Update Machine class with platform-specific GPU discovery initialization (Linux - NVIDIA, Darwin - Apple Metal, Windows - NVIDIA)
Populate gpu_devices in RenderHost for all platforms (Linux, macOS, Windows)

GPU isolation and monitoring:

Set CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES environment variables in rqcore.py for proper GPU isolation in launched frames
Collect per-device GPU utilization in __updateGpuAndLlu() using new getGpuUtilization() method
Add gpuUsage list to RunningFrame class for tracking per-frame GPU metrics
Extend runningFrameInfo() to include gpu_usage in RunningFrameInfo proto

Dependencies:

Add pynvml > = 11.5.0 to rqd/pyproject.toml for NVML GPU querying

All changes maintain backward compatibility via optional/repeated proto fields.

Legacy num_gpus and gpu_memory fields preserved for existing clients.

B) Rust RQD

[rust_rqd] Add GPU discovery infrastructure to Rust RQD

Implement cross-platform GPU discovery and monitoring infrastructure for Rust RQD, mirroring the Python RQD architecture to enable robust GPU support across both implementations.

New module: system/gpu.rs

Add GpuDiscovery trait defining abstract GPU discovery interface with detect_devices() and get_utilization() methods
Implement NvidiaGpuDiscovery with NVML support (via optional nvml-wrapper crate) and nvidia-smi fallback for detailed NVIDIA GPU metadata collection
Implement AppleMetalGpuDiscovery for macOS Apple Silicon GPU detection via system_profiler JSON parsing
Add create_gpu_discovery() factory function for platform-specific backend selection (Linux - NVIDIA, macOS - Apple - Metal, Windows - NVIDIA)

system/manager.rs:

Import GpuDevice and GpuUsage from opencue_proto::host
Extend MachineGpuStats with gpu_devices: Vec<GpuDevice> for detailed GPU inventory alongside legacy count/memory fields
Extend ProcessStats with gpu_usage: Vec<GpuUsage> for per-device utilization tracking in running frames
Update ProcessStats::default() and ProcessStats::update() to handle new gpu_usage field

system/mod.rs:

Expose gpu module with pub mod gpu

Cargo.toml:

Add optional nvml feature flag for NVML support
Add nvml-wrapper = { version = "0.10", optional = true } dependency

Architecture:

Trait-based abstraction matches Python class hierarchy for consistency
Optional NVML dependency via Cargo features allows compilation without NVIDIA-specific dependencies
Cross-platform design supports Linux (NVIDIA), macOS (Apple Metal), and Windows (NVIDIA) from the start
Backward compatible: retains legacy GPU fields in MachineGpuStats
Reuses opencue_proto::host::{GpuDevice, GpuUsage} proto messages directly

Build with NVML: cargo build --release --features nvml
Build without NVML: cargo build --release (fallback to nvidia-smi)

Remaining integration work tracked in RUST_GPU_IMPLEMENTATION_SUMMARY.md:

Integrate GPU discovery into MachineMonitor
Populate gpu_devices in RenderHost reports
Add CUDA_VISIBLE_DEVICES/NVIDIA_VISIBLE_DEVICES environment variables
Collect per-frame GPU utilization during stats collection

…per-device tracking Implement Phase 1 of comprehensive GPU support enhancement: - Issue: [GPU/Proto/RQD/Cuebot/RESTGateway/CueAdmin/CueGUI] OpenCue GPU Support - Comprehensive Audit and Implementation Plan - AcademySoftwareFoundation#2035 Protobuf schema extensions: - Add `GpuDevice` message with `vendor`, `model`, `memory`, `PCI bus`, `driver version`, and `CUDA`/`Metal` version fields to `host.proto` - Add `GpuUsage` message for per-device utilization tracking (`util %`, `memory used`) - Extend `Host` and `NestedHost` messages with `gpu_devices` repeated field - Extend `RenderHost` with `gpu_devices` for detailed GPU inventory reporting - Extend `RunningFrameInfo` with `gpu_usage` for per-frame GPU metrics - Add GPU constraint fields to Layer: `gpu_vendor`, `gpu_models_allowed`, `min_gpu_memory_bytes` for scheduler filtering - Add `gpu_usage` to `Frame` and `UpdatedFrame` messages for accounting RQD GPU discovery implementation: - Implement `GpuDiscovery` abstract base class for pluggable GPU backends - Implement `NvidiaGpuDiscovery` with `NVML` (`pynvml`) support and `nvidia-smi` fallback for detailed NVIDIA GPU metadata collection - Implement `AppleMetalGpuDiscovery` for macOS Apple Silicon GPU detection via `system_profiler` JSON parsing - Update Machine class with platform-specific GPU discovery initialization (Linux - NVIDIA, Darwin - Apple Metal, Windows - NVIDIA) - Populate `gpu_devices` in `RenderHost` for all platforms (`Linux`, `macOS`, `Windows`) GPU isolation and monitoring: - Set `CUDA_VISIBLE_DEVICES` and `NVIDIA_VISIBLE_DEVICES` environment variables in `rqcore.py` for proper GPU isolation in launched frames - Collect per-device GPU utilization in `__updateGpuAndLlu()` using new `getGpuUtilization()` method - Add `gpuUsage` list to `RunningFrame` class for tracking per-frame GPU metrics - Extend `runningFrameInfo()` to include `gpu_usage` in `RunningFrameInfo` proto Update VERSION.in Dependencies: - Add pynvml > = 11.5.0 to `rqd/pyproject.toml` for `NVML` GPU querying All changes maintain backward compatibility via optional/repeated proto fields. Legacy `num_gpus` and `gpu_memory` fields preserved for existing clients.

Implement cross-platform GPU discovery and monitoring infrastructure for Rust RQD, mirroring the Python RQD architecture to enable robust GPU support across both implementations. 1) New module: `system/gpu.rs` - Add `GpuDiscovery` trait defining abstract GPU discovery interface with `detect_devices()` and `get_utilization()` methods - Implement `NvidiaGpuDiscovery` with NVML support (via optional `nvml-wrapper` crate) and `nvidia-smi` fallback for detailed NVIDIA GPU metadata collection - Implement `AppleMetalGpuDiscovery` for macOS Apple Silicon GPU detection via `system_profiler` JSON parsing - Add `create_gpu_discovery()` factory function for platform-specific backend selection (Linux - NVIDIA, macOS - Apple - Metal, Windows - NVIDIA) 2) `system/manager.rs`: - Import `GpuDevice` and `GpuUsage` from `opencue_proto::host` - Extend `MachineGpuStats` with `gpu_devices`: `Vec<GpuDevice>` for detailed GPU inventory alongside legacy `count`/`memory` fields - Extend `ProcessStats` with `gpu_usage`: `Vec<GpuUsage>` for `per-device` utilization tracking in running frames - Update `ProcessStats::default()` and `ProcessStats::update()` to handle new `gpu_usage` field 3) `system/mod.rs`: - Expose gpu module with pub mod gpu 4) `Cargo.toml`: - Add optional `nvml` feature flag for NVML support - Add `nvml-wrapper = { version = "0.10", optional = true }` dependency Architecture: - Trait-based abstraction matches Python class hierarchy for consistency - Optional NVML dependency via Cargo features allows compilation without NVIDIA-specific dependencies - Cross-platform design supports Linux (NVIDIA), macOS (Apple Metal), and Windows (NVIDIA) from the start - Backward compatible: retains legacy GPU fields in `MachineGpuStats` - Reuses `opencue_proto::host::{GpuDevice, GpuUsage}` proto messages directly Build with NVML: `cargo build --release --features nvml` Build without NVML: `cargo build --release (fallback to nvidia-smi)` Remaining integration work tracked in RUST_GPU_IMPLEMENTATION_SUMMARY.md: - Integrate GPU discovery into `MachineMonitor` - Populate `gpu_devices` in `RenderHost` reports - Add `CUDA_VISIBLE_DEVICES`/`NVIDIA_VISIBLE_DEVICES` environment variables - Collect per-frame GPU utilization during stats collection

rust/crates/rqd/src/system/linux.rs

DiegoTavares · 2025-10-14T17:46:24Z

Great work so far. My only comment is that pynvml dependency should be optional.

DiegoTavares · 2025-10-14T17:47:56Z

Besides the protobuf errors:


File recursively imports itself: job.proto -> host.proto -> job.proto  
host.proto:10:1: Import "job.proto" was not found or had errors.  
host.proto:736:5: "job.Frame" is not defined.  
host.proto:754:5: "job.Job" is not defined.  
host.proto:763:5: "job.Layer" is not defined.  
host.proto:825:5: "job.JobSeq" is not defined.  
host.proto:836:5: "job.Group" is not defined.  
job.proto:12:1: Import "host.proto" was not found or had errors.  
job.proto:526:14: "host.GpuUsage" is not defined.  
job.proto:573:14: "host.GpuUsage" is not defined.  
filter.proto:8:1: Import "job.proto" was not found or had errors.  
filter.proto:301:5: "job.Group" is not defined.  
filter.proto:309:5: "job.JobSeq" is not defined.

Co-authored-by: Diego Tavares <[email protected]> Signed-off-by: Ramon Figueiredo <[email protected]>

ramonfigueiredo changed the title ~~[rqd/proto] Add robust GPU support with cross-platform discovery and per-device tracking~~ feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs Oct 9, 2025

ramonfigueiredo force-pushed the feature/robust-gpu-support branch from d698c39 to ae28d59 Compare October 9, 2025 01:41

ramonfigueiredo force-pushed the feature/robust-gpu-support branch from be760b3 to 6d3e8cc Compare October 9, 2025 02:12

ramonfigueiredo self-assigned this Oct 9, 2025

DiegoTavares reviewed Oct 14, 2025

View reviewed changes

rust/crates/rqd/src/system/linux.rs Outdated Show resolved Hide resolved

Update rust/crates/rqd/src/system/linux.rs

d46cf01

Co-authored-by: Diego Tavares <[email protected]> Signed-off-by: Ramon Figueiredo <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[proto/rqd/rust_rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs #2036

[proto/rqd/rust_rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs #2036

Uh oh!

ramonfigueiredo commented Oct 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

DiegoTavares commented Oct 14, 2025

Uh oh!

DiegoTavares commented Oct 14, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[proto/rqd/rust_rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs #2036

Are you sure you want to change the base?

[proto/rqd/rust_rqd/cuebot/rest_gateway/cueadmin/cuegui] Feat(GPU): Production-grade, cross-platform GPU support - vendor-aware scheduling, per-device telemetry, isolation, and docs #2036

Uh oh!

Conversation

ramonfigueiredo commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

DiegoTavares commented Oct 14, 2025

Uh oh!

DiegoTavares commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ramonfigueiredo commented Oct 9, 2025 •

edited

Loading

DiegoTavares commented Oct 14, 2025 •

edited

Loading