Add GPU Mock Infrastructure #180

ArangoGutierrez · 2025-10-08T12:38:00Z

This pull request introduces a new GPU mock driver and CDI (Container Device Interface) test/development environment, including a new CLI tool (gpu-mockctl) and a full Kubernetes deployment manifest set for local testing. The changes provide a way to generate a mock NVIDIA GPU driver filesystem, create CDI specs, and verify/test them in a Kubernetes cluster (e.g., kind). This enables development and CI testing of GPU-related features without requiring real GPU hardware.

These changes together provide a robust, reproducible environment for simulating NVIDIA GPU hardware and CDI integration for development and CI workflows, without requiring access to real GPUs.

- Implement pkg/gpu/mockfs for NVIDIA driver filesystem mocking - Implement pkg/gpu/mocktopo with dgxa100 support via go-nvml - Add cmd/gpu-mockctl CLI tool for mock generation - Add Kubernetes Job for verification on kind - Support extensibility for future H100/H200/B200 flavors Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

Copilot

Pull Request Overview

This PR introduces a comprehensive GPU mock infrastructure for testing GPU-related Kubernetes features without physical hardware. The implementation includes mock NVIDIA driver filesystems, CDI (Container Device Interface) generation, and a full testing environment deployable on local kind clusters.

Key changes:

New gpu-mockctl CLI tool for generating mock GPU driver environments
Mock topology provider using go-nvml dgxa100 simulation for 8x A100 GPUs
Production-grade CDI specification generation via nvidia-container-toolkit integration
Complete Kubernetes deployment manifests for local testing on kind clusters

Reviewed Changes

Copilot reviewed 22 out of 364 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/gpu/mocktopo/provider.go	Core topology provider with dgxa100 machine type registry
pkg/gpu/mocktopo/nvmlwrapper.go	NVML wrapper adding MIG support stubs for CDI compatibility
pkg/gpu/mocktopo/localmock.go	Fallback topology generator for unsupported machine types
pkg/gpu/mockfs/util.go	Utility functions for PCI bus ID normalization
pkg/gpu/mockfs/procfs.go	Mock filesystem layout generation for proc/dev structures
pkg/gpu/mockfs/devnodes.go	Character device node creation utilities
pkg/gpu/mockdriver/write.go	File specification writer for mock driver trees
pkg/gpu/mockdriver/tree.go	Mock driver file specifications with versioned libraries
pkg/gpu/cdi/validate.go	CDI specification validation utilities
pkg/gpu/cdi/specgen.go	CDI specification generation using nvidia-container-toolkit
pkg/gpu/cdi/paths.go	CDI path constants and defaults
hack/kind-up-gpu-mock.sh	Build automation script for GPU mock deployment
go.mod	Dependency additions for NVML, CDI, and container toolkit integration
deployments/devel/gpu-mock/README.md	Comprehensive documentation for the GPU mock infrastructure
deployments/devel/gpu-mock/Makefile	Build and deployment automation
deployments/devel/gpu-mock/Dockerfile	Container image build configuration
deployments/devel/gpu-mock/40-job-cdi-smoke.yaml	CDI smoke test job manifest
deployments/devel/gpu-mock/30-daemonset-cdi-mock.yaml	DaemonSet for node-level CDI mock setup
deployments/devel/gpu-mock/20-job-gpu-mock-verify.yaml	Mock filesystem verification job
deployments/devel/gpu-mock/10-configmap-verify-script.yaml	Verification script ConfigMap
deployments/devel/gpu-mock/00-namespace.yaml	Namespace resource for GPU mock deployment
cmd/gpu-mockctl/main.go	Main CLI application with filesystem and CDI generation modes

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

hack/kind-up-gpu-mock.sh

- Add mock driver tree generation (pkg/gpu/mockdriver) - Integrate nvidia-container-toolkit nvcdi for CDI spec generation - Extend go-nvml dgxa100 mock with MIG stubs for nvcdi compatibility - Add DaemonSet to deploy CDI mock on all nodes - Add comprehensive smoke tests for CDI verification - Update gpu-mockctl CLI with fs/cdi/all modes - Use __NVCT_TESTING_DEVICES_ARE_FILES from nvidia-container-toolkit tests - Generate empty files with versioned naming Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 22 out of 364 changed files in this pull request and generated no new comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ArangoGutierrez · 2025-10-08T12:43:23Z

Once NVIDIA/go-nvml#163 is merged, I'll add the necessary changes to this to allow selection of any of the supported go-nvml mock-systems

Copilot

Pull Request Overview

Copilot reviewed 22 out of 364 changed files in this pull request and generated no new comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

ArangoGutierrez · 2025-10-08T12:45:16Z

cmd/gpu-mockctl/main.go

+}
+
+func run(cfg *config) error {
+	// Get topology (A100-only for now)


Once NVIDIA/go-nvml#163 is merged I'll expand this section

ArangoGutierrez · 2025-10-08T12:48:16Z

deployments/devel/gpu-mock/README.md

+From the repository root:
+
+```bash
+make -C deployments/devel/gpu-mock all


Tested on a macOS system

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

elezar

Some initial comments ... more next week.

elezar · 2025-10-10T13:58:37Z

pkg/gpu/mocktopo/nvmlwrapper.go

+}
+
+// newNVMLWrapper creates an NVML interface wrapper that adds missing
+// methods required by nvidia-container-toolkit's nvcdi library.


Why are we making these changes here instead of implementing them "upstream"?

elezar · 2025-10-10T14:00:34Z

cmd/gpu-mockctl/commands/cdi.go

+	// Get topology
+	topo, err := mocktopo.New(cfg.Machine)
+	if err != nil {
+		if os.Getenv("ALLOW_UNSUPPORTED") == "true" {


I want to understand why this is useful? Does having this option not complicate the implementation? Why not just default to the one machine type that we have and figure out how to expose new ones as they become available?

elezar · 2025-10-10T14:01:52Z

cmd/gpu-mockctl/commands/cdi.go

+
+	// Validate before writing
+	log.Debugf("Validating CDI specification")
+	if err := cdi.Validate(specYAML); err != nil {


The SPEC is validated on save. We should not be validating this on our own -- this should either be done in the nvcdi package or upstream.

elezar · 2025-10-10T14:03:12Z

cmd/gpu-mockctl/commands/cdi.go

+		return fmt.Errorf("failed to create CDI directory: %w", err)
+	}
+
+	if err := os.WriteFile(cfg.CDIOutput, specYAML, 0o644); err != nil {


I don't believe that this is how we output the CDI spec in the tooling that we release. See https://github.com/NVIDIA/nvidia-container-toolkit/blob/5bb032d60486da9b441a208f225f911efbad35f2/cmd/nvidia-ctk/cdi/generate/generate.go#L332

elezar · 2025-10-10T14:04:58Z

cmd/gpu-mockctl/commands/cdi.go

+
+	// Also create under driverRoot/dev for completeness
+	log.Debugf("Creating device nodes under %s/dev", cfg.DriverRoot)
+	driverDevNodes := mockdriver.DeviceNodes(cfg.DriverRoot, gpuCount, cfg.WithDRI)


Why do we create these under two locations?

elezar · 2025-10-10T14:07:32Z

pkg/gpu/cdi/specgen.go

+	if o.Vendor != "" {
+		opts = append(opts, nvcdi.WithVendor(o.Vendor))
+	}
+	if o.Class != "" {
+		opts = append(opts, nvcdi.WithClass(o.Class))
+	}


nvcdi already handles the empty case.

elezar · 2025-10-10T14:08:14Z

pkg/gpu/cdi/specgen.go

+
+	// Get the raw spec and marshal to YAML
+	rawSpec := spec.Raw()
+	return yaml.Marshal(rawSpec)


Why not return a typed variable? (i.e just spec)

elezar · 2025-10-10T14:08:59Z

pkg/gpu/cdi/paths.go

+
+const (
+	// DefaultCDIRoot is the standard location for CDI specifications.
+	DefaultCDIRoot = "/etc/cdi"


Is there a specific reason to use /etc/cdi? I would argue that we should use /var/run/cdi.

elezar · 2025-10-10T14:09:27Z

pkg/gpu/cdi/paths.go

+	DefaultCDIRoot = "/etc/cdi"
+
+	// DefaultSpecPath is the default output path for the NVIDIA CDI spec.
+	DefaultSpecPath = "/etc/cdi/nvidia.yaml"


Should we give this a different path? (and also a different vendor and / or class)?

elezar · 2025-10-10T14:11:11Z

pkg/gpu/mockdriver/tree.go

+	// Match dgxa100 mock driver version
+	driverVer := "550.54.15"
+
+	files := []FileSpec{


Question: How about STARTING with a CDI spec from an actual system and GENERATING this list instead?

ArangoGutierrez · 2025-10-16T13:21:25Z

#182 is a refactored version of this PR

ArangoGutierrez requested review from Copilot, elezar, guptaNswati, jgehrcke and klueska October 8, 2025 12:38

ArangoGutierrez self-assigned this Oct 8, 2025

Copilot AI reviewed Oct 8, 2025

View reviewed changes

hack/kind-up-gpu-mock.sh Outdated Show resolved Hide resolved

ArangoGutierrez force-pushed the devel/01 branch from 975f57a to ca4e484 Compare October 8, 2025 12:39

ArangoGutierrez requested a review from Copilot October 8, 2025 12:42

Copilot AI reviewed Oct 8, 2025

View reviewed changes

ArangoGutierrez requested a review from Copilot October 8, 2025 12:43

Copilot AI reviewed Oct 8, 2025

View reviewed changes

ArangoGutierrez commented Oct 8, 2025

View reviewed changes

Refactor to align with team patterns

384f79d

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>

elezar reviewed Oct 10, 2025

View reviewed changes

ArangoGutierrez merged commit 384f79d into NVIDIA:main Oct 15, 2025
8 checks passed

Add GPU Mock Infrastructure #180

Add GPU Mock Infrastructure #180

Uh oh!

Conversation

ArangoGutierrez commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

ArangoGutierrez commented Oct 8, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elezar left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArangoGutierrez commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants