Add GPU Mock Infrastructure #182

ArangoGutierrez · 2025-10-16T12:54:41Z

Overview

⚠️ Being split into multiple PR's ⚠️

Problem Statement

Previous limitations:

C-based mock library was difficult to maintain and extend
Hardcoded 8-GPU configuration with no flexibility
Required manual code changes for different GPU topologies
Duplicate implementation logic vs. upstream go-nvml mocks

What this PR solves:

✅ Leverages go-nvml mocks
✅ Provides zero-config quick start
✅ Enables declarative GPU topology via industry-standard CDI specs
✅ Type-safe Go implementation with CGo bridge
✅ Runtime behavior customization without code changes

🏗️ Architecture

High-Level Design

┌────────────────────────────────────────────────────────────┐
│               User / Helm Deployment                        │
├────────────────────────────────────────────────────────────┤
│  Default Mode          │         CDI Mode                   │
│  (zero-config)         │  (declarative topology)            │
│                        │                                    │
│  helm install          │  kubectl create configmap my-spec  │
│    gpu-mock            │    --from-file=spec.yaml=...       │
│                        │  helm install --set cdi.enabled    │
└───────────┬────────────┴────────────┬───────────────────────┘
            │                         │
            └──────────┬──────────────┘
                       │
            ┌──────────▼──────────┐
            │   entrypoint.sh     │  ← Mode Detection
            │  (orchestration)    │
            └──────────┬──────────┘
                       │
        ┌──────────────┴──────────────┐
        │                             │
  ┌─────▼──────┐            ┌────────▼────────┐
  │  Default   │            │   CDI Parser    │
  │   Mode     │            │  (gpu-mockctl)  │
  │            │            │                 │
  │ MOCK_GPU_  │            │ Parse CDI spec  │
  │  ARCH=     │            │ Create /dev/*   │
  │  dgxa100   │            │ Create /proc/*  │
  │ NUM=8      │            │ Set env vars    │
  └─────┬──────┘            └────────┬────────┘
        │                            │
        └──────────┬─────────────────┘
                   │
          ┌────────▼─────────┐
          │  gpu-mockctl     │
          │     driver       │
          │                  │
          │ Deploy library   │
          │ Create devices   │
          └────────┬─────────┘
                   │
          ┌────────▼─────────┐
          │ libnvidia-ml.so  │  ← Go-based CGo bridge
          │  (Go + CGo)      │
          └────────┬─────────┘
                   │
          ┌────────▼─────────┐
          │   Mock Engine    │  ← go-nvml dgxa100.Server
          │  (Go runtime)    │
          │                  │
          │ • 8 A100 GPUs    │
          │ • Handle table   │
          │ • Config via env │
          └──────────────────┘

Component Layers

1. Go-Based NVML Library (Commit 1)

Files: pkg/gpu/mocknvml/{bridge/, engine/}

Bridge Layer (bridge/): CGo exports exposing 49 NVML C functions
- bridge.go: Init, shutdown, system info, error handling
- device.go: Device enumeration, properties, UUID
- memory.go: Memory queries (device, BAR1)
- pci.go: PCI bus information
- process.go: Process enumeration
- events.go: Event monitoring stubs
- mig.go: MIG stubs (not supported)
Engine Layer (engine/): Mock server using go-nvml
- Singleton pattern for lifecycle management
- Handle table: Maps C pointers ↔ Go objects
- Configuration via environment variables
- Reference counting for init/shutdown

Key Benefits:

Reuses go-nvml/pkg/nvml/mock/dgxa100 (8 A100 GPUs)
Type-safe Go with memory safety
Runtime behavior customization via function pointers
Easier to maintain than C implementation

2. CDI Parser (Commit 2)

Files: cmd/gpu-mockctl/commands/cdi.go

Parses CDI specifications and generates mock infrastructure configuration:

gpu-mockctl cdi --spec cdi-spec.yaml --output mock-config.json

Features:

Supports CDI v0.5.0 (YAML/JSON)
Auto-detects GPU architecture from nvidia.com/gpu.model annotation
Generates device nodes (nvidia0-N, nvidiactl, nvidia-uvm*)
Creates /proc/driver/nvidia/gpus/* entries
Outputs JSON configuration for infrastructure setup

3. Entrypoint Orchestration (Commit 3)

Files: deployments/devel/gpu-mock/container/entrypoint.sh

Intelligent mode detection and setup:

# Detects CDI spec presence
if [ -f /config/cdi-spec.yaml ]; then
    # CDI mode: Parse spec, create infrastructure
    gpu-mockctl cdi --spec /config/cdi-spec.yaml
    create_device_nodes
    create_proc_entries
else
    # Default mode: Use built-in dgxa100
    export MOCK_GPU_ARCH=dgxa100
    export MOCK_NVML_NUM_DEVICES=8
fi

# Execute gpu-mockctl driver
exec gpu-mockctl driver --driver-root ...

Logs clearly indicate mode:

[INFO] Operating mode: default
[INFO] Default mode: Using built-in dgxa100 configuration
[INFO] Environment: MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8

4. Helm Chart Enhancements (Commit 4)

Files: deployments/devel/gpu-mock/helm/gpu-mock/{values.yaml, templates/}

Adds CDI configuration options while maintaining backward compatibility:

# values.yaml
cdi:
  enabled: false                  # Enable CDI mode
  configMapName: ""               # External ConfigMap with CDI spec
  inlineSpec: ""                  # Inline CDI spec in Helm values
  architectureOverride: ""        # Manual arch override

mockDriver:
  architecture: dgxa100           # Default mode architecture
  gpuCount: 8                     # Default mode GPU count

DaemonSet changes:

Mounts CDI ConfigMap when enabled
Sets environment variables (MOCK_GPU_ARCH, MOCK_NVML_NUM_DEVICES, CDI_SPEC_PATH)
Uses entrypoint.sh instead of direct gpu-mockctl call

5. CDI Spec Examples (Commit 5)

Files: deployments/devel/gpu-mock/examples/cdi-spec-*.yaml

cdi-spec-a100-2gpu.yaml: Minimal 2-GPU setup for testing
cdi-spec-a100-8gpu.yaml: Full DGX A100 simulation (8 GPUs)

Both follow CDI v0.5.0 specification with proper device nodes, annotations, and metadata.

Usage

Quick Start (Default Mode - Zero Config)

# Deploy with Helm - 8 A100 GPUs ready instantly!
helm install gpu-mock ./deployments/devel/gpu-mock/helm/gpu-mock

# Verify GPUs discovered
kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'
# Output: 8

What happens:

Entrypoint detects no CDI spec → default mode
Sets MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8
Deploys Go-based library
NVIDIA Device Plugin discovers 8 GPUs

CDI Mode (Custom Topology)

# Step 1: Create ConfigMap from CDI spec file
kubectl create configmap my-gpu-spec \
  --from-file=spec.yaml=examples/cdi-spec-a100-2gpu.yaml

# Step 2: Deploy with CDI enabled
helm install gpu-mock ./deployments/devel/gpu-mock/helm/gpu-mock \
  --set cdi.enabled=true \
  --set cdi.configMapName=my-gpu-spec

# Result: 2 GPUs discovered (from CDI spec)
kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'
# Output: 2

What happens:

Entrypoint detects CDI spec at /config/cdi-spec.yaml → CDI mode
Calls gpu-mockctl cdi to parse spec
Creates device nodes and /proc entries dynamically
Sets environment based on CDI content
NVIDIA Device Plugin discovers custom GPU count

Testing

Validated Scenarios

Test	Result	Evidence
Build	✅ Pass	Clean compilation, no errors
CDI Parser	✅ Pass	2-GPU and 8-GPU specs parsed correctly
Default Mode E2E	✅ Pass	8 GPUs discovered in Kubernetes
Device Plugin Integration	✅ Pass	Production NVIDIA tooling works
Entrypoint Logs	✅ Pass	Clear mode indication and config

Test Commands

# Build and test
cd pkg/gpu/mocknvml
make build-go
make test-go

# E2E test
kind create cluster --name gpu-mock
make -C deployments/devel/gpu-mock build-image load-image
helm install gpu-mock ./deployments/devel/gpu-mock/helm/gpu-mock
kubectl get nodes -o jsonpath='{.items[0].status.capacity.nvidia\.com/gpu}'

Commit Breakdown

Commit	Purpose	Files
1. GPU mock infrastructure	Complete baseline	All infrastructure code
2. CDI parser	Spec parsing command	`cmd/gpu-mockctl/commands/cdi.go`
3. Entrypoint	Mode detection	`entrypoint.sh`, `Dockerfile`
4. CDI examples	Reference specs	`examples/cdi-spec-*.yaml`

Each commit is self-contained and tells a clear story of incremental enhancement.

Reviewer Guide

Recommended review order:

Commit 1 - Understand the complete infrastructure
- Focus on pkg/gpu/mocknvml/{bridge,engine}/ for NVML library
- Review cmd/gpu-mockctl/ for CLI tools
- Check deployments/ for Kubernetes integration
- See how CGo bridges C ↔ Go
- Understand handle table pattern
Commit 2 - CDI parser logic
- See how CDI specs are parsed
- Understand architecture detection
Commit 3 - Integration layer
- Mode detection in entrypoint
- Container image updates
Commit 4 - Examples (optional, for testing)

Key files to review:

pkg/gpu/mocknvml/bridge/bridge.go - CGo layer
pkg/gpu/mocknvml/engine/engine.go - Mock server
cmd/gpu-mockctl/commands/driver.go - Driver deployment
deployments/.../entrypoint.sh - Mode detection
cmd/gpu-mockctl/commands/cdi.go - CDI parser

Copilot

Pull Request Overview

This PR introduces a mock NVIDIA GPU environment to support testing without physical GPUs, along with CI wiring. It adds a mock NVML C library, a Go-based “gpu-mockctl” tool to generate a mock driver filesystem, Kubernetes deployment artifacts (Helm chart and static manifests), documentation, and GitHub Actions to test the mock NVML.

Adds a production-like mock NVML (C) plus comprehensive tests and build system
Introduces gpu-mockctl CLI and Kubernetes deployment (Helm + manifests) to set up mock driver, toolkit, and tests
Updates CI workflows and dependencies to build/test these components

Reviewed Changes

Copilot reviewed 73 out of 170 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
go.mod	Sets module Go version and adds dependencies (cli, go-nvml, x/sys)
pkg/gpu/mocknvml/src/*.c	Implements core NVML APIs and stubs; provides system/device/memory/MIG logic
pkg/gpu/mocknvml/data/devices.h	Mock A100 device inventory and attributes
pkg/gpu/mocknvml/Makefile	Build, test, and info targets for the mock NVML library
pkg/gpu/mocktopo/*	Adds a dgxa100 mock topology provider with an NVML wrapper
pkg/gpu/mockfs/* and pkg/gpu/mockdriver/*	Utilities to create mock /dev nodes, driver tree, and copy NVML libs
cmd/gpu-mockctl/*	Adds a CLI with logging, config, and tests
deployments/devel/gpu-mock/*/	Dockerfile, Helm chart, static manifests, and test pods
docs/*/	User guide, API reference, architecture docs
.github/workflows/*	Adds a workflow to compile and test mock NVML and updates CI

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T12:59:10Z

deployments/devel/gpu-mock/container/Dockerfile

+RUN CGO_ENABLED=1 \
+    go build -a -ldflags '-linkmode external -extldflags "-static"' \
+    -o /out/gpu-mockctl ./cmd/gpu-mockctl


Building a cgo-enabled binary with fully static linking typically fails on Debian-based images without static libc (libc6-dev-static) and related static archives. Either remove '-extldflags "-static"' and build dynamically, or install static libc (or use a musl-based image), or set CGO_ENABLED=0 if the binary doesn’t require cgo.

Copilot · 2025-10-16T12:59:10Z

pkg/gpu/mocknvml/src/nvml_system.c

+}
+
+// Get driver branch info
+nvmlReturn_t DECLDIR nvmlSystemGetDriverBranch(nvmlSystemDriverBranchInfo_t *branchInfo, unsigned int length) {


nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.

Copilot · 2025-10-16T12:59:10Z

pkg/gpu/mocknvml/src/nvml_system.c

+
+    if (length < strlen(branch) + 1) {
+        return NVML_ERROR_INSUFFICIENT_SIZE;
+    }
+
+    strncpy((char*)branchInfo, branch, length);
+    ((char*)branchInfo)[length - 1] = '\0';


nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.

Suggested change

if (length < strlen(branch) + 1) {

return NVML_ERROR_INSUFFICIENT_SIZE;

}

strncpy((char*)branchInfo, branch, length);

((char*)branchInfo)[length - 1] = '\0';

size_t branch_field_size = sizeof(branchInfo->branch);

// Use the minimum of the provided length and the branch field size for safety

size_t copy_size = (length < branch_field_size) ? length : branch_field_size;

if (copy_size < strlen(branch) + 1) {

return NVML_ERROR_INSUFFICIENT_SIZE;

}

strncpy(branchInfo->branch, branch, copy_size);

branchInfo->branch[copy_size - 1] = '\0';

Copilot · 2025-10-16T12:59:11Z

pkg/gpu/mocknvml/src/nvml_system.c

+    if (branchInfo == NULL || length == 0) {
+        return NVML_ERROR_INVALID_ARGUMENT;
+    }
+
+    const char *branch = "r550_00";
+
+    if (length < strlen(branch) + 1) {
+        return NVML_ERROR_INSUFFICIENT_SIZE;
+    }
+
+    strncpy((char*)branchInfo, branch, length);
+    ((char*)branchInfo)[length - 1] = '\0';


nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.

Suggested change

if (branchInfo == NULL || length == 0) {

return NVML_ERROR_INVALID_ARGUMENT;

}

const char *branch = "r550_00";

if (length < strlen(branch) + 1) {

return NVML_ERROR_INSUFFICIENT_SIZE;

}

strncpy((char*)branchInfo, branch, length);

((char*)branchInfo)[length - 1] = '\0';

if (branchInfo == NULL) {

return NVML_ERROR_INVALID_ARGUMENT;

}

const char *branch = "r550_00";

size_t branch_field_size = sizeof(branchInfo->branch);

if (branch_field_size < strlen(branch) + 1) {

return NVML_ERROR_INSUFFICIENT_SIZE;

}

strncpy(branchInfo->branch, branch, branch_field_size);

branchInfo->branch[branch_field_size - 1] = '\0';

Copilot · 2025-10-16T12:59:12Z

pkg/gpu/mocknvml/src/nvml_device.c

+    const mock_device_info_t *dev = &mock_devices[index];
+
+    pci->domain = dev->pci_domain;
+    pci->bus = dev->pci_bus;
+    pci->device = dev->pci_device;
+    pci->pciDeviceId = dev->pci_device_id;
+    pci->pciSubSystemId = dev->pci_subsystem_id;
+
+    strncpy(pci->busId, dev->pci_bus_id_legacy, sizeof(pci->busId));
+    pci->busId[sizeof(pci->busId) - 1] = '\0';
+
+    strncpy(pci->busIdLegacy, dev->pci_bus_id_legacy, sizeof(pci->busIdLegacy));
+    pci->busIdLegacy[sizeof(pci->busIdLegacy) - 1] = '\0';


pci->busId is being populated with the legacy bus ID. Use dev->pci_bus_id (extended format) for pci->busId, and keep dev->pci_bus_id_legacy for pci->busIdLegacy to align with NVML’s expected fields.

Copilot · 2025-10-16T12:59:13Z

cmd/gpu-mockctl/internal/logger/logger_test.go

+func TestLogger(t *testing.T) {
+	// Capture log output
+	var buf bytes.Buffer
+	log.SetOutput(&buf)


The test redirects the stdlib 'log' output, but the logger writes to its own configured io.Writer (default os.Stderr). As a result, the buffer never captures Info/Warning/Error output. Construct the logger with NewWithConfig and set Output to &buf to correctly capture logs in the test.

Copilot · 2025-10-16T12:59:13Z

cmd/gpu-mockctl/internal/logger/logger_test.go

+	defer log.SetOutput(os.Stderr)
+
+	// Test non-verbose logger
+	l := New("test", false)


The test redirects the stdlib 'log' output, but the logger writes to its own configured io.Writer (default os.Stderr). As a result, the buffer never captures Info/Warning/Error output. Construct the logger with NewWithConfig and set Output to &buf to correctly capture logs in the test.

Copilot · 2025-10-16T12:59:14Z

cmd/gpu-mockctl/root_test.go

+	if len(cmd.Commands) != len(expectedCommands) {
+		t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands))
+	}


This asserts an exact command count of 2, but the root command also includes 'driver' and 'all', causing the test to fail despite being correct. Remove the strict length equality check or expand 'expectedCommands' to include all registered subcommands.

Suggested change

if len(cmd.Commands) != len(expectedCommands) {

t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands))

}

Copilot · 2025-10-16T12:59:14Z

pkg/gpu/mockdriver/nvml.go

+	driverVer := "550.54.15"
+	files := []string{
+		"libnvidia-ml.so." + driverVer,
+		"libnvidia-ml.so.1",
+		"libnvidia-ml.so",
+	}


[nitpick] The driver version string '550.54.15' is duplicated across the codebase (e.g., Makefiles, tree.go, docs). Consider centralizing this into a single constant or configuration to avoid drift and ease upgrades.

Copilot · 2025-10-16T12:59:14Z

pkg/gpu/mockdriver/tree.go

+	// Match dgxa100 mock driver version
+	driverVer := "550.54.15"


[nitpick] Hardcoding the driver version here duplicates the same string used elsewhere. Prefer a shared constant or config to keep all references in sync.

Copilot

Pull Request Overview

Copilot reviewed 73 out of 170 changed files in this pull request and generated 9 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T13:21:44Z

pkg/gpu/mocknvml/data/devices.h

+static const mock_device_info_t mock_devices[8] = {
+    {
+        .uuid = "GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c",
+        .name = "NVIDIA A100-SXM4-40GB",
+        .pci_bus_id = "00000000:00:00.0",
+        .pci_bus_id_legacy = "0000:00:00.0",
+        .serial = "1563221000001",


nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.

Copilot · 2025-10-16T13:21:44Z

pkg/gpu/mocknvml/data/devices.h

+        .power_limit = 400000,         // 400W
+        .clock_graphics = 1410,        // 1410 MHz
+        .clock_sm = 1410,              // 1410 MHz
+        .clock_memory = 1593,          // 1593 MHz
+    },


nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.

Copilot · 2025-10-16T13:21:45Z

pkg/gpu/mocknvml/data/devices.h

+        .power_limit = 400000,
+        .clock_graphics = 1410,
+        .clock_sm = 1410,
+        .clock_memory = 1593,
+    }


nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.

Copilot · 2025-10-16T13:21:45Z

pkg/gpu/mocknvml/test/test_nvml.c

+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <nvml.h>


This test includes nvml.h via system include paths, but the Makefile compiles it without adding the repository's include directory. On hosts without system NVML headers, compilation will fail. Either change the include to "../include/nvml.h" (like the comprehensive test) or add an include path (e.g., -I./include) in the test build rule.

Suggested change

#include <nvml.h>

#include "../include/nvml.h"

Copilot · 2025-10-16T13:21:45Z

pkg/gpu/mocknvml/test/test_nvml.c

+
+    // Enumerate devices
+    for (unsigned int i = 0; i < deviceCount && i < 8; i++) {
+        nvmlDevice_t device;


This test includes nvml.h via system include paths, but the Makefile compiles it without adding the repository's include directory. On hosts without system NVML headers, compilation will fail. Either change the include to "../include/nvml.h" (like the comprehensive test) or add an include path (e.g., -I./include) in the test build rule.

Copilot · 2025-10-16T13:21:46Z

pkg/gpu/mocknvml/Makefile

+	$(CC) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR)
+	@echo "Running basic test..."
+	$(BUILD_DIR)/test/test_nvml
+
+test-comprehensive: all
+	@echo "Building comprehensive test suite..."
+	@mkdir -p $(BUILD_DIR)/test
+	$(CC) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)
+	@echo "Running comprehensive tests..."
+	$(BUILD_DIR)/test/test_nvml_comprehensive
+
+test-valgrind: all
+	@echo "Running tests with valgrind..."
+	@mkdir -p $(BUILD_DIR)/test
+	$(CC) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)


Compilation of test/test_nvml.c does not add the mock NVML header path, causing include <nvml.h> to fail on systems without system headers. Add -I./include (or reuse $(CFLAGS)) to the compile invocation, e.g., '$(CC) $(CFLAGS) -I./include -o ...'.

Suggested change

$(CC) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR)

@echo "Running basic test..."

$(BUILD_DIR)/test/test_nvml

test-comprehensive: all

@echo "Building comprehensive test suite..."

@mkdir -p $(BUILD_DIR)/test

$(CC) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)

@echo "Running comprehensive tests..."

$(BUILD_DIR)/test/test_nvml_comprehensive

test-valgrind: all

@echo "Running tests with valgrind..."

@mkdir -p $(BUILD_DIR)/test

$(CC) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)

$(CC) $(CFLAGS) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR)

@echo "Running basic test..."

$(BUILD_DIR)/test/test_nvml

test-comprehensive: all

@echo "Building comprehensive test suite..."

@mkdir -p $(BUILD_DIR)/test

$(CC) $(CFLAGS) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)

@echo "Running comprehensive tests..."

$(BUILD_DIR)/test/test_nvml_comprehensive

test-valgrind: all

@echo "Running tests with valgrind..."

@mkdir -p $(BUILD_DIR)/test

$(CC) $(CFLAGS) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR)

Copilot · 2025-10-16T13:21:46Z

pkg/gpu/mocktopo/provider.go

+		"unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+
+			"ALLOW_UNSUPPORTED=true to use fallback",


The error suggests setting ALLOW_UNSUPPORTED=true to use a fallback, but this function does not implement any fallback behavior. Either implement the fallback (e.g., calling NewFallback with a default GPU count) when ALLOW_UNSUPPORTED is set, or remove the guidance from the error message to avoid misleading users.

Suggested change

"unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+

"ALLOW_UNSUPPORTED=true to use fallback",

"unsupported MACHINE_TYPE %q (only 'dgxa100' is supported)",

Copilot · 2025-10-16T13:21:46Z

deployments/devel/gpu-mock/helm/gpu-mock/templates/container-toolkit-daemonset.yaml

+      {{- with .Values.containerToolkit.nodeSelector }}
+      nodeSelector:
+        {{- toYaml . | nindent 8 }}
+      {{- end }}
+      {{- if .Values.containerToolkit.requireMockDriver }}
+      nodeSelector:
+        nvidia.com/gpu.present: "true"
+      {{- end }}


nodeSelector is rendered twice; the latter block will overwrite the former or produce duplicate keys in YAML. Merge selectors instead (e.g., always render a single nodeSelector map and conditionally add nvidia.com/gpu.present when required).

Suggested change

{{- with .Values.containerToolkit.nodeSelector }}

nodeSelector:

{{- toYaml . | nindent 8 }}

{{- end }}

{{- if .Values.containerToolkit.requireMockDriver }}

nodeSelector:

nvidia.com/gpu.present: "true"

{{- end }}

{{- /*

Merge .Values.containerToolkit.nodeSelector (if any) with

nvidia.com/gpu.present: "true" if .Values.containerToolkit.requireMockDriver is true.

This ensures only one nodeSelector block is rendered.

*/ -}}

{{- $baseSelector := .Values.containerToolkit.nodeSelector | default (dict) -}}

{{- if .Values.containerToolkit.requireMockDriver -}}

{{- $mergedSelector := merge $baseSelector (dict "nvidia.com/gpu.present" "true") -}}

nodeSelector:

{{- toYaml $mergedSelector | nindent 8 }}

{{- else if $baseSelector -}}

nodeSelector:

{{- toYaml $baseSelector | nindent 8 }}

{{- end }}

Copilot · 2025-10-16T13:21:47Z

deployments/devel/gpu-mock/static/30-daemonset-mock-driver.yaml

+
+        # Label node for GPU support
+        - name: label-node
+          image: bitnami/kubectl:latest


Using the 'latest' tag makes deployments non-reproducible and can break unexpectedly. Pin to a specific, tested kubectl image tag (e.g., bitnami/kubectl:1.30.2) and keep it updated intentionally.

Suggested change

image: bitnami/kubectl:latest

image: bitnami/kubectl:1.30.2

elezar

This is quite difficult to review without a high-level overview of how the pieces fit together (PR Description?).

Also, in the case of the "mock" libnvidia-ml.so library, I don't think we should be reimplementing a mock in C. We should take our existing mocks implemented in GO and build these as a C library exposing the "correct" API. This library should also have SOME mechanism to be changed to handle different infrastructures and behaviours. This need not be the initial implementation, but some thought should be given as to how we would inject failures etc into the library.

On the filesystem changes. Does it make sense to take a CDI spec as input and use this to populate the filesystem in some way? This should in theory mean that if we use the mock filessystem (and mocked libnvidia-ml.so) as the intput to CDI spec generation (as we would do in the device plugin), we get the input CDI spec.

On the proc path: If we limit ourselve to running in a container, then we may not need to worry about the "driver root" and where we create "/proc"? In this case we could manipulate the procfs in the container to look the way we need it to and ALSO create the required device nodes. (this is not something that I've validated though).

Copilot

Pull Request Overview

Copilot reviewed 93 out of 186 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-11T19:24:31Z

pkg/gpu/mocknvml/data/devices.h

+        .cuda_compute_capability_major = 8,
+        .cuda_compute_capability_minor = 0


The CUDA compute capability fields are missing for device 0 (lines 54-86) while being present for other devices (e.g., lines 114-115). This inconsistency could lead to unexpected behavior if these fields are accessed for device 0.

Copilot · 2025-11-11T19:24:32Z

pkg/gpu/mocknvml/data/devices.h

+        .clock_sm = 1410,
+        .clock_memory = 1593,
+        .cuda_compute_capability_major = 8,
+        .cuda_compute_capability_minor = 0


The CUDA compute capability fields are missing for the last device (device 7, lines 268-295). This inconsistency with other devices should be corrected to ensure uniform device properties.

elezar · 2025-11-12T11:21:08Z

pkg/gpu/mocknvml/src/nvml_device.c

Should we not remove the C source files?

elezar · 2025-11-12T11:23:44Z

pkg/gpu/mocknvml/build-go/lib64/libnvidia-ml.so

Why are the built artifacts in the repo? Should there be a Make file to build them instead?

I need to add them to the gitignore, this is just remaining from my testing

elezar · 2025-11-12T11:26:23Z

pkg/gpu/mocknvml/engine/go.mod

Is there a specific reason why this is a submodule?

This commit introduces a complete Go-based mock NVML library that replaces the previous C implementation. Key improvements: Architecture: - Bridge layer (pkg/gpu/mocknvml/bridge/): CGo exports exposing 49 NVML C functions with proper type conversions and error handling - Engine layer (pkg/gpu/mocknvml/engine/): Mock server using go-nvml's dgxa100.Server with singleton pattern and handle table - Test suite: C test programs verifying library functionality Implementation highlights: - Leverages upstream go-nvml mock implementations (dgxa100) - Type-safe Go code with memory-safe handle management - Runtime configuration via environment variables - Supports NVIDIA Device Plugin and DRA Driver requirements Build system: - Unified Makefile with both C and Go targets - CGo-based shared library generation (libnvidia-ml.so) - Proper library versioning and symlinking - Cross-platform support (Linux, macOS via Docker) Testing: - Integration with existing gpu-mockctl infrastructure - Kubernetes deployment via Helm charts - E2E validation with NVIDIA Device Plugin This provides a maintainable foundation for mock GPU testing in Kubernetes environments without requiring physical hardware.

Implements the 'gpu-mockctl cdi' command to parse Container Device Interface (CDI) specifications and extract mock GPU infrastructure configuration. Features: - Parses CDI v0.5.0 specifications (YAML/JSON) - Auto-detects GPU architecture from nvidia.com/gpu.model annotation - Extracts device nodes (nvidia0-N, nvidiactl, nvidia-uvm*) - Generates /proc/driver/nvidia/gpus/* entries - Outputs simplified JSON configuration for infrastructure setup Command usage: gpu-mockctl cdi --spec cdi-spec.yaml --output mock-config.json This enables declarative GPU topology definition via industry-standard CDI format, providing flexibility for testing different hardware configurations without code changes.

Adds an intelligent entrypoint that detects and configures the mock GPU infrastructure based on the operating mode. Operating modes: 1. Default mode: Uses built-in dgxa100 configuration (8 A100 GPUs) - Zero configuration required - Activated when no CDI spec is present - Sets MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8 2. CDI mode: Uses declarative CDI specification - Activated when CDI spec file is detected - Calls gpu-mockctl cdi to parse specification - Creates device nodes dynamically - Sets up /proc filesystem entries - Configures environment from CDI content The entrypoint provides clear logging of the selected mode and configuration, then executes the gpu-mockctl driver command. Container updates: - Modified Dockerfile to include jq for JSON parsing - Set entrypoint.sh as container ENTRYPOINT - Added infrastructure setup logic before driver execution

Provides reference CDI specifications for testing different GPU configurations. Files: - cdi-spec-a100-2gpu.yaml: Minimal 2-GPU A100 configuration * Useful for quick testing and validation * Lower resource footprint * Demonstrates minimal CDI structure - cdi-spec-a100-8gpu.yaml: Full DGX A100 configuration * Simulates complete DGX A100 system (8 GPUs) * Matches default dgxa100 behavior * Comprehensive device node setup Both specifications follow CDI v0.5.0 standard and include: - Proper device node definitions (nvidia0-N, nvidiactl, nvidia-uvm*) - GPU model annotations for architecture detection - Container edit specifications - Required metadata Usage: kubectl create configmap my-spec \ --from-file=spec.yaml=examples/cdi-spec-a100-2gpu.yaml helm install gpu-mock ./helm/gpu-mock \ --set cdi.enabled=true \ --set cdi.configMapName=my-spec

elezar · 2025-11-12T13:02:28Z

As a general note here, it feels as if we're trying to do too many thingsin a single PR.

Am I correct in stating the core problem of this project as "providing a representation of a GPU-enabled system to allow us to test our own components without actually having access to GPUs"? What are the use cases that we would like to cover here?

For this PR specifically, there seem to be two core work items here (not in order of priority):

Implementing a "mock" libnvidia-ml.so that we can add to (for example) a kind worker node and expect the k8s-device-plugin to "work" fo some subset of examples.
Generating a filesystem that is equivalent to one where the driver is installed so that we can, for example, test CDI spec generation.

-    if (length < strlen(branch) + 1) {
-        return NVML_ERROR_INSUFFICIENT_SIZE;
-    }
-    strncpy((char*)branchInfo, branch, length);
-    ((char*)branchInfo)[length - 1] = '\0';
+    size_t branch_field_size = sizeof(branchInfo->branch);
+    // Use the minimum of the provided length and the branch field size for safety
+    size_t copy_size = (length < branch_field_size) ? length : branch_field_size;
+    if (copy_size < strlen(branch) + 1) {
+        return NVML_ERROR_INSUFFICIENT_SIZE;
+    }
+    strncpy(branchInfo->branch, branch, copy_size);
+    branchInfo->branch[copy_size - 1] = '\0';

	if len(cmd.Commands) != len(expectedCommands) {
	t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands))
	}

		// Match dgxa100 mock driver version
		driverVer := "550.54.15"

		"unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+
		"ALLOW_UNSUPPORTED=true to use fallback",

		.cuda_compute_capability_major = 8,
		.cuda_compute_capability_minor = 0

Add GPU Mock Infrastructure #182

Are you sure you want to change the base?

Add GPU Mock Infrastructure #182

Conversation

ArangoGutierrez commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Problem Statement

🏗️ Architecture

High-Level Design

Component Layers

1. Go-Based NVML Library (Commit 1)

2. CDI Parser (Commit 2)

3. Entrypoint Orchestration (Commit 3)

4. Helm Chart Enhancements (Commit 4)

5. CDI Spec Examples (Commit 5)

Usage

Quick Start (Default Mode - Zero Config)

CDI Mode (Custom Topology)

Testing

Validated Scenarios

Test Commands

Commit Breakdown

Reviewer Guide

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

ArangoGutierrez commented Oct 16, 2025 •

edited

Loading