Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

@ArangoGutierrez ArangoGutierrez commented Oct 16, 2025

Overview

⚠️ Being split into multiple PR's ⚠️

Problem Statement

Previous limitations:

  • C-based mock library was difficult to maintain and extend
  • Hardcoded 8-GPU configuration with no flexibility
  • Required manual code changes for different GPU topologies
  • Duplicate implementation logic vs. upstream go-nvml mocks

What this PR solves:

  • ✅ Leverages go-nvml mocks
  • ✅ Provides zero-config quick start
  • ✅ Enables declarative GPU topology via industry-standard CDI specs
  • ✅ Type-safe Go implementation with CGo bridge
  • ✅ Runtime behavior customization without code changes

🏗️ Architecture

High-Level Design

┌────────────────────────────────────────────────────────────┐
│               User / Helm Deployment                        │
├────────────────────────────────────────────────────────────┤
│  Default Mode          │         CDI Mode                   │
│  (zero-config)         │  (declarative topology)            │
│                        │                                    │
│  helm install          │  kubectl create configmap my-spec  │
│    gpu-mock            │    --from-file=spec.yaml=...       │
│                        │  helm install --set cdi.enabled    │
└───────────┬────────────┴────────────┬───────────────────────┘
            │                         │
            └──────────┬──────────────┘
                       │
            ┌──────────▼──────────┐
            │   entrypoint.sh     │  ← Mode Detection
            │  (orchestration)    │
            └──────────┬──────────┘
                       │
        ┌──────────────┴──────────────┐
        │                             │
  ┌─────▼──────┐            ┌────────▼────────┐
  │  Default   │            │   CDI Parser    │
  │   Mode     │            │  (gpu-mockctl)  │
  │            │            │                 │
  │ MOCK_GPU_  │            │ Parse CDI spec  │
  │  ARCH=     │            │ Create /dev/*   │
  │  dgxa100   │            │ Create /proc/*  │
  │ NUM=8      │            │ Set env vars    │
  └─────┬──────┘            └────────┬────────┘
        │                            │
        └──────────┬─────────────────┘
                   │
          ┌────────▼─────────┐
          │  gpu-mockctl     │
          │     driver       │
          │                  │
          │ Deploy library   │
          │ Create devices   │
          └────────┬─────────┘
                   │
          ┌────────▼─────────┐
          │ libnvidia-ml.so  │  ← Go-based CGo bridge
          │  (Go + CGo)      │
          └────────┬─────────┘
                   │
          ┌────────▼─────────┐
          │   Mock Engine    │  ← go-nvml dgxa100.Server
          │  (Go runtime)    │
          │                  │
          │ • 8 A100 GPUs    │
          │ • Handle table   │
          │ • Config via env │
          └──────────────────┘

Component Layers

1. Go-Based NVML Library (Commit 1)

Files: pkg/gpu/mocknvml/{bridge/, engine/}

  • Bridge Layer (bridge/): CGo exports exposing 49 NVML C functions

    • bridge.go: Init, shutdown, system info, error handling
    • device.go: Device enumeration, properties, UUID
    • memory.go: Memory queries (device, BAR1)
    • pci.go: PCI bus information
    • process.go: Process enumeration
    • events.go: Event monitoring stubs
    • mig.go: MIG stubs (not supported)
  • Engine Layer (engine/): Mock server using go-nvml

    • Singleton pattern for lifecycle management
    • Handle table: Maps C pointers ↔ Go objects
    • Configuration via environment variables
    • Reference counting for init/shutdown

Key Benefits:

  • Reuses go-nvml/pkg/nvml/mock/dgxa100 (8 A100 GPUs)
  • Type-safe Go with memory safety
  • Runtime behavior customization via function pointers
  • Easier to maintain than C implementation

2. CDI Parser (Commit 2)

Files: cmd/gpu-mockctl/commands/cdi.go

Parses CDI specifications and generates mock infrastructure configuration:

gpu-mockctl cdi --spec cdi-spec.yaml --output mock-config.json

Features:

  • Supports CDI v0.5.0 (YAML/JSON)
  • Auto-detects GPU architecture from nvidia.com/gpu.model annotation
  • Generates device nodes (nvidia0-N, nvidiactl, nvidia-uvm*)
  • Creates /proc/driver/nvidia/gpus/* entries
  • Outputs JSON configuration for infrastructure setup

3. Entrypoint Orchestration (Commit 3)

Files: deployments/devel/gpu-mock/container/entrypoint.sh

Intelligent mode detection and setup:

# Detects CDI spec presence
if [ -f /config/cdi-spec.yaml ]; then
    # CDI mode: Parse spec, create infrastructure
    gpu-mockctl cdi --spec /config/cdi-spec.yaml
    create_device_nodes
    create_proc_entries
else
    # Default mode: Use built-in dgxa100
    export MOCK_GPU_ARCH=dgxa100
    export MOCK_NVML_NUM_DEVICES=8
fi

# Execute gpu-mockctl driver
exec gpu-mockctl driver --driver-root ...

Logs clearly indicate mode:

[INFO] Operating mode: default
[INFO] Default mode: Using built-in dgxa100 configuration
[INFO] Environment: MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8

4. Helm Chart Enhancements (Commit 4)

Files: deployments/devel/gpu-mock/helm/gpu-mock/{values.yaml, templates/}

Adds CDI configuration options while maintaining backward compatibility:

# values.yaml
cdi:
  enabled: false                  # Enable CDI mode
  configMapName: ""               # External ConfigMap with CDI spec
  inlineSpec: ""                  # Inline CDI spec in Helm values
  architectureOverride: ""        # Manual arch override

mockDriver:
  architecture: dgxa100           # Default mode architecture
  gpuCount: 8                     # Default mode GPU count

DaemonSet changes:

  • Mounts CDI ConfigMap when enabled
  • Sets environment variables (MOCK_GPU_ARCH, MOCK_NVML_NUM_DEVICES, CDI_SPEC_PATH)
  • Uses entrypoint.sh instead of direct gpu-mockctl call

5. CDI Spec Examples (Commit 5)

Files: deployments/devel/gpu-mock/examples/cdi-spec-*.yaml

  • cdi-spec-a100-2gpu.yaml: Minimal 2-GPU setup for testing
  • cdi-spec-a100-8gpu.yaml: Full DGX A100 simulation (8 GPUs)

Both follow CDI v0.5.0 specification with proper device nodes, annotations, and metadata.


Usage

Quick Start (Default Mode - Zero Config)

# Deploy with Helm - 8 A100 GPUs ready instantly!
helm install gpu-mock ./deployments/devel/gpu-mock/helm/gpu-mock

# Verify GPUs discovered
kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'
# Output: 8

What happens:

  1. Entrypoint detects no CDI spec → default mode
  2. Sets MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8
  3. Deploys Go-based library
  4. NVIDIA Device Plugin discovers 8 GPUs

CDI Mode (Custom Topology)

# Step 1: Create ConfigMap from CDI spec file
kubectl create configmap my-gpu-spec \
  --from-file=spec.yaml=examples/cdi-spec-a100-2gpu.yaml

# Step 2: Deploy with CDI enabled
helm install gpu-mock ./deployments/devel/gpu-mock/helm/gpu-mock \
  --set cdi.enabled=true \
  --set cdi.configMapName=my-gpu-spec

# Result: 2 GPUs discovered (from CDI spec)
kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'
# Output: 2

What happens:

  1. Entrypoint detects CDI spec at /config/cdi-spec.yaml → CDI mode
  2. Calls gpu-mockctl cdi to parse spec
  3. Creates device nodes and /proc entries dynamically
  4. Sets environment based on CDI content
  5. NVIDIA Device Plugin discovers custom GPU count

Testing

Validated Scenarios

Test Result Evidence
Build ✅ Pass Clean compilation, no errors
CDI Parser ✅ Pass 2-GPU and 8-GPU specs parsed correctly
Default Mode E2E ✅ Pass 8 GPUs discovered in Kubernetes
Device Plugin Integration ✅ Pass Production NVIDIA tooling works
Entrypoint Logs ✅ Pass Clear mode indication and config

Test Commands

# Build and test
cd pkg/gpu/mocknvml
make build-go
make test-go

# E2E test
kind create cluster --name gpu-mock
make -C deployments/devel/gpu-mock build-image load-image
helm install gpu-mock ./deployments/devel/gpu-mock/helm/gpu-mock
kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'

Commit Breakdown

Commit Purpose Files
1. GPU mock infrastructure Complete baseline All infrastructure code
2. CDI parser Spec parsing command cmd/gpu-mockctl/commands/cdi.go
3. Entrypoint Mode detection entrypoint.sh, Dockerfile
4. CDI examples Reference specs examples/cdi-spec-*.yaml

Each commit is self-contained and tells a clear story of incremental enhancement.


Reviewer Guide

Recommended review order:

  1. Commit 1 - Understand the complete infrastructure

    • Focus on pkg/gpu/mocknvml/{bridge,engine}/ for NVML library
    • Review cmd/gpu-mockctl/ for CLI tools
    • Check deployments/ for Kubernetes integration
    • See how CGo bridges C ↔ Go
    • Understand handle table pattern
  2. Commit 2 - CDI parser logic

    • See how CDI specs are parsed
    • Understand architecture detection
  3. Commit 3 - Integration layer

    • Mode detection in entrypoint
    • Container image updates
  4. Commit 4 - Examples (optional, for testing)

Key files to review:

  • pkg/gpu/mocknvml/bridge/bridge.go - CGo layer
  • pkg/gpu/mocknvml/engine/engine.go - Mock server
  • cmd/gpu-mockctl/commands/driver.go - Driver deployment
  • deployments/.../entrypoint.sh - Mode detection
  • cmd/gpu-mockctl/commands/cdi.go - CDI parser

@ArangoGutierrez ArangoGutierrez self-assigned this Oct 16, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a mock NVIDIA GPU environment to support testing without physical GPUs, along with CI wiring. It adds a mock NVML C library, a Go-based “gpu-mockctl” tool to generate a mock driver filesystem, Kubernetes deployment artifacts (Helm chart and static manifests), documentation, and GitHub Actions to test the mock NVML.

  • Adds a production-like mock NVML (C) plus comprehensive tests and build system
  • Introduces gpu-mockctl CLI and Kubernetes deployment (Helm + manifests) to set up mock driver, toolkit, and tests
  • Updates CI workflows and dependencies to build/test these components

Reviewed Changes

Copilot reviewed 73 out of 170 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
go.mod Sets module Go version and adds dependencies (cli, go-nvml, x/sys)
pkg/gpu/mocknvml/src/*.c Implements core NVML APIs and stubs; provides system/device/memory/MIG logic
pkg/gpu/mocknvml/data/devices.h Mock A100 device inventory and attributes
pkg/gpu/mocknvml/Makefile Build, test, and info targets for the mock NVML library
pkg/gpu/mocktopo/* Adds a dgxa100 mock topology provider with an NVML wrapper
pkg/gpu/mockfs/* and pkg/gpu/mockdriver/* Utilities to create mock /dev nodes, driver tree, and copy NVML libs
cmd/gpu-mockctl/* Adds a CLI with logging, config, and tests
deployments/devel/gpu-mock/**/* Dockerfile, Helm chart, static manifests, and test pods
docs/**/* User guide, API reference, architecture docs
.github/workflows/* Adds a workflow to compile and test mock NVML and updates CI

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +37 to +39
RUN CGO_ENABLED=1 \
go build -a -ldflags '-linkmode external -extldflags "-static"' \
-o /out/gpu-mockctl ./cmd/gpu-mockctl
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a cgo-enabled binary with fully static linking typically fails on Debian-based images without static libc (libc6-dev-static) and related static archives. Either remove '-extldflags "-static"' and build dynamically, or install static libc (or use a musl-based image), or set CGO_ENABLED=0 if the binary doesn’t require cgo.

Copilot uses AI. Check for mistakes.
}

// Get driver branch info
nvmlReturn_t DECLDIR nvmlSystemGetDriverBranch(nvmlSystemDriverBranchInfo_t *branchInfo, unsigned int length) {
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.

Copilot uses AI. Check for mistakes.
Comment on lines +192 to +198

if (length < strlen(branch) + 1) {
return NVML_ERROR_INSUFFICIENT_SIZE;
}

strncpy((char*)branchInfo, branch, length);
((char*)branchInfo)[length - 1] = '\0';
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.

Suggested change
if (length < strlen(branch) + 1) {
return NVML_ERROR_INSUFFICIENT_SIZE;
}
strncpy((char*)branchInfo, branch, length);
((char*)branchInfo)[length - 1] = '\0';
size_t branch_field_size = sizeof(branchInfo->branch);
// Use the minimum of the provided length and the branch field size for safety
size_t copy_size = (length < branch_field_size) ? length : branch_field_size;
if (copy_size < strlen(branch) + 1) {
return NVML_ERROR_INSUFFICIENT_SIZE;
}
strncpy(branchInfo->branch, branch, copy_size);
branchInfo->branch[copy_size - 1] = '\0';

Copilot uses AI. Check for mistakes.
Comment on lines +187 to +198
if (branchInfo == NULL || length == 0) {
return NVML_ERROR_INVALID_ARGUMENT;
}

const char *branch = "r550_00";

if (length < strlen(branch) + 1) {
return NVML_ERROR_INSUFFICIENT_SIZE;
}

strncpy((char*)branchInfo, branch, length);
((char*)branchInfo)[length - 1] = '\0';
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.

Suggested change
if (branchInfo == NULL || length == 0) {
return NVML_ERROR_INVALID_ARGUMENT;
}
const char *branch = "r550_00";
if (length < strlen(branch) + 1) {
return NVML_ERROR_INSUFFICIENT_SIZE;
}
strncpy((char*)branchInfo, branch, length);
((char*)branchInfo)[length - 1] = '\0';
if (branchInfo == NULL) {
return NVML_ERROR_INVALID_ARGUMENT;
}
const char *branch = "r550_00";
size_t branch_field_size = sizeof(branchInfo->branch);
if (branch_field_size < strlen(branch) + 1) {
return NVML_ERROR_INSUFFICIENT_SIZE;
}
strncpy(branchInfo->branch, branch, branch_field_size);
branchInfo->branch[branch_field_size - 1] = '\0';

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +186
const mock_device_info_t *dev = &mock_devices[index];

pci->domain = dev->pci_domain;
pci->bus = dev->pci_bus;
pci->device = dev->pci_device;
pci->pciDeviceId = dev->pci_device_id;
pci->pciSubSystemId = dev->pci_subsystem_id;

strncpy(pci->busId, dev->pci_bus_id_legacy, sizeof(pci->busId));
pci->busId[sizeof(pci->busId) - 1] = '\0';

strncpy(pci->busIdLegacy, dev->pci_bus_id_legacy, sizeof(pci->busIdLegacy));
pci->busIdLegacy[sizeof(pci->busIdLegacy) - 1] = '\0';
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pci->busId is being populated with the legacy bus ID. Use dev->pci_bus_id (extended format) for pci->busId, and keep dev->pci_bus_id_legacy for pci->busIdLegacy to align with NVML’s expected fields.

Copilot uses AI. Check for mistakes.
Comment on lines +24 to +27
func TestLogger(t *testing.T) {
// Capture log output
var buf bytes.Buffer
log.SetOutput(&buf)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test redirects the stdlib 'log' output, but the logger writes to its own configured io.Writer (default os.Stderr). As a result, the buffer never captures Info/Warning/Error output. Construct the logger with NewWithConfig and set Output to &buf to correctly capture logs in the test.

Copilot uses AI. Check for mistakes.
defer log.SetOutput(os.Stderr)

// Test non-verbose logger
l := New("test", false)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test redirects the stdlib 'log' output, but the logger writes to its own configured io.Writer (default os.Stderr). As a result, the buffer never captures Info/Warning/Error output. Construct the logger with NewWithConfig and set Output to &buf to correctly capture logs in the test.

Copilot uses AI. Check for mistakes.
Comment on lines +37 to +39
if len(cmd.Commands) != len(expectedCommands) {
t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands))
}
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This asserts an exact command count of 2, but the root command also includes 'driver' and 'all', causing the test to fail despite being correct. Remove the strict length equality check or expand 'expectedCommands' to include all registered subcommands.

Suggested change
if len(cmd.Commands) != len(expectedCommands) {
t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands))
}

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +48
driverVer := "550.54.15"
files := []string{
"libnvidia-ml.so." + driverVer,
"libnvidia-ml.so.1",
"libnvidia-ml.so",
}
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The driver version string '550.54.15' is duplicated across the codebase (e.g., Makefiles, tree.go, docs). Consider centralizing this into a single constant or configuration to avoid drift and ease upgrades.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to +40
// Match dgxa100 mock driver version
driverVer := "550.54.15"
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Hardcoding the driver version here duplicates the same string used elsewhere. Prefer a shared constant or config to keep all references in sync.

Copilot uses AI. Check for mistakes.
@ArangoGutierrez ArangoGutierrez changed the title Update GitHub Actions Add GPU Mock Infrastructure Oct 16, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 73 out of 170 changed files in this pull request and generated 9 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +58 to +64
static const mock_device_info_t mock_devices[8] = {
{
.uuid = "GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c",
.name = "NVIDIA A100-SXM4-40GB",
.pci_bus_id = "00000000:00:00.0",
.pci_bus_id_legacy = "0000:00:00.0",
.serial = "1563221000001",
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.

Copilot uses AI. Check for mistakes.
Comment on lines +82 to +86
.power_limit = 400000, // 400W
.clock_graphics = 1410, // 1410 MHz
.clock_sm = 1410, // 1410 MHz
.clock_memory = 1593, // 1593 MHz
},
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.

Copilot uses AI. Check for mistakes.
Comment on lines +290 to +294
.power_limit = 400000,
.clock_graphics = 1410,
.clock_sm = 1410,
.clock_memory = 1593,
}
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.

Copilot uses AI. Check for mistakes.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <nvml.h>
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test includes nvml.h via system include paths, but the Makefile compiles it without adding the repository's include directory. On hosts without system NVML headers, compilation will fail. Either change the include to "../include/nvml.h" (like the comprehensive test) or add an include path (e.g., -I./include) in the test build rule.

Suggested change
#include <nvml.h>
#include "../include/nvml.h"

Copilot uses AI. Check for mistakes.

// Enumerate devices
for (unsigned int i = 0; i < deviceCount && i < 8; i++) {
nvmlDevice_t device;
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test includes nvml.h via system include paths, but the Makefile compiles it without adding the repository's include directory. On hosts without system NVML headers, compilation will fail. Either change the include to "../include/nvml.h" (like the comprehensive test) or add an include path (e.g., -I./include) in the test build rule.

Copilot uses AI. Check for mistakes.
Comment on lines 76 to 90
$(CC) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR)
@echo "Running basic test..."
$(BUILD_DIR)/test/test_nvml

test-comprehensive: all
@echo "Building comprehensive test suite..."
@mkdir -p $(BUILD_DIR)/test
$(CC) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)
@echo "Running comprehensive tests..."
$(BUILD_DIR)/test/test_nvml_comprehensive

test-valgrind: all
@echo "Running tests with valgrind..."
@mkdir -p $(BUILD_DIR)/test
$(CC) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compilation of test/test_nvml.c does not add the mock NVML header path, causing include <nvml.h> to fail on systems without system headers. Add -I./include (or reuse $(CFLAGS)) to the compile invocation, e.g., '$(CC) $(CFLAGS) -I./include -o ...'.

Suggested change
$(CC) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR)
@echo "Running basic test..."
$(BUILD_DIR)/test/test_nvml
test-comprehensive: all
@echo "Building comprehensive test suite..."
@mkdir -p $(BUILD_DIR)/test
$(CC) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)
@echo "Running comprehensive tests..."
$(BUILD_DIR)/test/test_nvml_comprehensive
test-valgrind: all
@echo "Running tests with valgrind..."
@mkdir -p $(BUILD_DIR)/test
$(CC) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)
$(CC) $(CFLAGS) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR)
@echo "Running basic test..."
$(BUILD_DIR)/test/test_nvml
test-comprehensive: all
@echo "Building comprehensive test suite..."
@mkdir -p $(BUILD_DIR)/test
$(CC) $(CFLAGS) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)
@echo "Running comprehensive tests..."
$(BUILD_DIR)/test/test_nvml_comprehensive
test-valgrind: all
@echo "Running tests with valgrind..."
@mkdir -p $(BUILD_DIR)/test
$(CC) $(CFLAGS) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +90
"unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+
"ALLOW_UNSUPPORTED=true to use fallback",
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error suggests setting ALLOW_UNSUPPORTED=true to use a fallback, but this function does not implement any fallback behavior. Either implement the fallback (e.g., calling NewFallback with a default GPU count) when ALLOW_UNSUPPORTED is set, or remove the guidance from the error message to avoid misleading users.

Suggested change
"unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+
"ALLOW_UNSUPPORTED=true to use fallback",
"unsupported MACHINE_TYPE %q (only 'dgxa100' is supported)",

Copilot uses AI. Check for mistakes.
Comment on lines +23 to +30
{{- with .Values.containerToolkit.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.containerToolkit.requireMockDriver }}
nodeSelector:
nvidia.com/gpu.present: "true"
{{- end }}
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeSelector is rendered twice; the latter block will overwrite the former or produce duplicate keys in YAML. Merge selectors instead (e.g., always render a single nodeSelector map and conditionally add nvidia.com/gpu.present when required).

Suggested change
{{- with .Values.containerToolkit.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- if .Values.containerToolkit.requireMockDriver }}
nodeSelector:
nvidia.com/gpu.present: "true"
{{- end }}
{{- /*
Merge .Values.containerToolkit.nodeSelector (if any) with
nvidia.com/gpu.present: "true" if .Values.containerToolkit.requireMockDriver is true.
This ensures only one nodeSelector block is rendered.
*/ -}}
{{- $baseSelector := .Values.containerToolkit.nodeSelector | default (dict) -}}
{{- if .Values.containerToolkit.requireMockDriver -}}
{{- $mergedSelector := merge $baseSelector (dict "nvidia.com/gpu.present" "true") -}}
nodeSelector:
{{- toYaml $mergedSelector | nindent 8 }}
{{- else if $baseSelector -}}
nodeSelector:
{{- toYaml $baseSelector | nindent 8 }}
{{- end }}

Copilot uses AI. Check for mistakes.

# Label node for GPU support
- name: label-node
image: bitnami/kubectl:latest
Copy link

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the 'latest' tag makes deployments non-reproducible and can break unexpectedly. Pin to a specific, tested kubectl image tag (e.g., bitnami/kubectl:1.30.2) and keep it updated intentionally.

Suggested change
image: bitnami/kubectl:latest
image: bitnami/kubectl:1.30.2

Copilot uses AI. Check for mistakes.
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite difficult to review without a high-level overview of how the pieces fit together (PR Description?).

Also, in the case of the "mock" libnvidia-ml.so library, I don't think we should be reimplementing a mock in C. We should take our existing mocks implemented in GO and build these as a C library exposing the "correct" API. This library should also have SOME mechanism to be changed to handle different infrastructures and behaviours. This need not be the initial implementation, but some thought should be given as to how we would inject failures etc into the library.

On the filesystem changes. Does it make sense to take a CDI spec as input and use this to populate the filesystem in some way? This should in theory mean that if we use the mock filessystem (and mocked libnvidia-ml.so) as the intput to CDI spec generation (as we would do in the device plugin), we get the input CDI spec.

On the proc path: If we limit ourselve to running in a container, then we may not need to worry about the "driver root" and where we create "/proc"? In this case we could manipulate the procfs in the container to look the way we need it to and ALSO create the required device nodes. (this is not something that I've validated though).

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 93 out of 186 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +114 to +115
.cuda_compute_capability_major = 8,
.cuda_compute_capability_minor = 0
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA compute capability fields are missing for device 0 (lines 54-86) while being present for other devices (e.g., lines 114-115). This inconsistency could lead to unexpected behavior if these fields are accessed for device 0.

Copilot uses AI. Check for mistakes.
.clock_sm = 1410,
.clock_memory = 1593,
.cuda_compute_capability_major = 8,
.cuda_compute_capability_minor = 0
Copy link

Copilot AI Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CUDA compute capability fields are missing for the last device (device 7, lines 268-295). This inconsistency with other devices should be corrected to ensure uniform device properties.

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not remove the C source files?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the built artifacts in the repo? Should there be a Make file to build them instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to add them to the gitignore, this is just remaining from my testing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why this is a submodule?

This commit introduces a complete Go-based mock NVML library that
replaces the previous C implementation. Key improvements:

Architecture:
- Bridge layer (pkg/gpu/mocknvml/bridge/): CGo exports exposing 49
  NVML C functions with proper type conversions and error handling
- Engine layer (pkg/gpu/mocknvml/engine/): Mock server using
  go-nvml's dgxa100.Server with singleton pattern and handle table
- Test suite: C test programs verifying library functionality

Implementation highlights:
- Leverages upstream go-nvml mock implementations (dgxa100)
- Type-safe Go code with memory-safe handle management
- Runtime configuration via environment variables
- Supports NVIDIA Device Plugin and DRA Driver requirements

Build system:
- Unified Makefile with both C and Go targets
- CGo-based shared library generation (libnvidia-ml.so)
- Proper library versioning and symlinking
- Cross-platform support (Linux, macOS via Docker)

Testing:
- Integration with existing gpu-mockctl infrastructure
- Kubernetes deployment via Helm charts
- E2E validation with NVIDIA Device Plugin

This provides a maintainable foundation for mock GPU testing in
Kubernetes environments without requiring physical hardware.
Implements the 'gpu-mockctl cdi' command to parse Container Device
Interface (CDI) specifications and extract mock GPU infrastructure
configuration.

Features:
- Parses CDI v0.5.0 specifications (YAML/JSON)
- Auto-detects GPU architecture from nvidia.com/gpu.model annotation
- Extracts device nodes (nvidia0-N, nvidiactl, nvidia-uvm*)
- Generates /proc/driver/nvidia/gpus/* entries
- Outputs simplified JSON configuration for infrastructure setup

Command usage:
  gpu-mockctl cdi --spec cdi-spec.yaml --output mock-config.json

This enables declarative GPU topology definition via industry-standard
CDI format, providing flexibility for testing different hardware
configurations without code changes.
Adds an intelligent entrypoint that detects and configures the mock
GPU infrastructure based on the operating mode.

Operating modes:
1. Default mode: Uses built-in dgxa100 configuration (8 A100 GPUs)
   - Zero configuration required
   - Activated when no CDI spec is present
   - Sets MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8

2. CDI mode: Uses declarative CDI specification
   - Activated when CDI spec file is detected
   - Calls gpu-mockctl cdi to parse specification
   - Creates device nodes dynamically
   - Sets up /proc filesystem entries
   - Configures environment from CDI content

The entrypoint provides clear logging of the selected mode and
configuration, then executes the gpu-mockctl driver command.

Container updates:
- Modified Dockerfile to include jq for JSON parsing
- Set entrypoint.sh as container ENTRYPOINT
- Added infrastructure setup logic before driver execution
Provides reference CDI specifications for testing different GPU
configurations.

Files:
- cdi-spec-a100-2gpu.yaml: Minimal 2-GPU A100 configuration
  * Useful for quick testing and validation
  * Lower resource footprint
  * Demonstrates minimal CDI structure

- cdi-spec-a100-8gpu.yaml: Full DGX A100 configuration
  * Simulates complete DGX A100 system (8 GPUs)
  * Matches default dgxa100 behavior
  * Comprehensive device node setup

Both specifications follow CDI v0.5.0 standard and include:
- Proper device node definitions (nvidia0-N, nvidiactl, nvidia-uvm*)
- GPU model annotations for architecture detection
- Container edit specifications
- Required metadata

Usage:
kubectl create configmap my-spec \
  --from-file=spec.yaml=examples/cdi-spec-a100-2gpu.yaml
helm install gpu-mock ./helm/gpu-mock \
  --set cdi.enabled=true \
  --set cdi.configMapName=my-spec
@elezar
Copy link
Member

elezar commented Nov 12, 2025

As a general note here, it feels as if we're trying to do too many thingsin a single PR.

Am I correct in stating the core problem of this project as "providing a representation of a GPU-enabled system to allow us to test our own components without actually having access to GPUs"? What are the use cases that we would like to cover here?

For this PR specifically, there seem to be two core work items here (not in order of priority):

  1. Implementing a "mock" libnvidia-ml.so that we can add to (for example) a kind worker node and expect the k8s-device-plugin to "work" fo some subset of examples.
  2. Generating a filesystem that is equivalent to one where the driver is installed so that we can, for example, test CDI spec generation.

Other non-core questions include:

  • How to make these consumable by systems that we are looking to test.

I would argue that each of these are large enough to be their own PR.

@ArangoGutierrez
Copy link
Collaborator Author

As a general note here, it feels as if we're trying to do too many thingsin a single PR.

Am I correct in stating the core problem of this project as "providing a representation of a GPU-enabled system to allow us to test our own components without actually having access to GPUs"? What are the use cases that we would like to cover here?

For this PR specifically, there seem to be two core work items here (not in order of priority):

  1. Implementing a "mock" libnvidia-ml.so that we can add to (for example) a kind worker node and expect the k8s-device-plugin to "work" fo some subset of examples.
  2. Generating a filesystem that is equivalent to one where the driver is installed so that we can, for example, test CDI spec generation.

Other non-core questions include:

  • How to make these consumable by systems that we are looking to test.

I would argue that each of these are large enough to be their own PR.

This is good feedback, I'll break this down into multiple PR's

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants