Skip to content

Conversation

@ArangoGutierrez
Copy link
Collaborator

This pull request introduces a new GPU mock driver and CDI (Container Device Interface) test/development environment, including a new CLI tool (gpu-mockctl) and a full Kubernetes deployment manifest set for local testing. The changes provide a way to generate a mock NVIDIA GPU driver filesystem, create CDI specs, and verify/test them in a Kubernetes cluster (e.g., kind). This enables development and CI testing of GPU-related features without requiring real GPU hardware.

These changes together provide a robust, reproducible environment for simulating NVIDIA GPU hardware and CDI integration for development and CI workflows, without requiring access to real GPUs.

    - Implement pkg/gpu/mockfs for NVIDIA driver filesystem mocking
    - Implement pkg/gpu/mocktopo with dgxa100 support via go-nvml
    - Add cmd/gpu-mockctl CLI tool for mock generation
    - Add Kubernetes Job for verification on kind
    - Support extensibility for future H100/H200/B200 flavors

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a comprehensive GPU mock infrastructure for testing GPU-related Kubernetes features without physical hardware. The implementation includes mock NVIDIA driver filesystems, CDI (Container Device Interface) generation, and a full testing environment deployable on local kind clusters.

Key changes:

  • New gpu-mockctl CLI tool for generating mock GPU driver environments
  • Mock topology provider using go-nvml dgxa100 simulation for 8x A100 GPUs
  • Production-grade CDI specification generation via nvidia-container-toolkit integration
  • Complete Kubernetes deployment manifests for local testing on kind clusters

Reviewed Changes

Copilot reviewed 22 out of 364 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/gpu/mocktopo/provider.go Core topology provider with dgxa100 machine type registry
pkg/gpu/mocktopo/nvmlwrapper.go NVML wrapper adding MIG support stubs for CDI compatibility
pkg/gpu/mocktopo/localmock.go Fallback topology generator for unsupported machine types
pkg/gpu/mockfs/util.go Utility functions for PCI bus ID normalization
pkg/gpu/mockfs/procfs.go Mock filesystem layout generation for proc/dev structures
pkg/gpu/mockfs/devnodes.go Character device node creation utilities
pkg/gpu/mockdriver/write.go File specification writer for mock driver trees
pkg/gpu/mockdriver/tree.go Mock driver file specifications with versioned libraries
pkg/gpu/cdi/validate.go CDI specification validation utilities
pkg/gpu/cdi/specgen.go CDI specification generation using nvidia-container-toolkit
pkg/gpu/cdi/paths.go CDI path constants and defaults
hack/kind-up-gpu-mock.sh Build automation script for GPU mock deployment
go.mod Dependency additions for NVML, CDI, and container toolkit integration
deployments/devel/gpu-mock/README.md Comprehensive documentation for the GPU mock infrastructure
deployments/devel/gpu-mock/Makefile Build and deployment automation
deployments/devel/gpu-mock/Dockerfile Container image build configuration
deployments/devel/gpu-mock/40-job-cdi-smoke.yaml CDI smoke test job manifest
deployments/devel/gpu-mock/30-daemonset-cdi-mock.yaml DaemonSet for node-level CDI mock setup
deployments/devel/gpu-mock/20-job-gpu-mock-verify.yaml Mock filesystem verification job
deployments/devel/gpu-mock/10-configmap-verify-script.yaml Verification script ConfigMap
deployments/devel/gpu-mock/00-namespace.yaml Namespace resource for GPU mock deployment
cmd/gpu-mockctl/main.go Main CLI application with filesystem and CDI generation modes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

  - Add mock driver tree generation (pkg/gpu/mockdriver)
  - Integrate nvidia-container-toolkit nvcdi for CDI spec generation
  - Extend go-nvml dgxa100 mock with MIG stubs for nvcdi compatibility
  - Add DaemonSet to deploy CDI mock on all nodes
  - Add comprehensive smoke tests for CDI verification
  - Update gpu-mockctl CLI with fs/cdi/all modes
  - Use __NVCT_TESTING_DEVICES_ARE_FILES from nvidia-container-toolkit tests
  - Generate empty files with versioned naming

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 22 out of 364 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ArangoGutierrez
Copy link
Collaborator Author

Once NVIDIA/go-nvml#163 is merged, I'll add the necessary changes to this to allow selection of any of the supported go-nvml mock-systems

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 22 out of 364 changed files in this pull request and generated no new comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

}

func run(cfg *config) error {
// Get topology (A100-only for now)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once NVIDIA/go-nvml#163 is merged I'll expand this section

From the repository root:

```bash
make -C deployments/devel/gpu-mock all
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested on a macOS system

Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
Copy link
Member

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments ... more next week.

}

// newNVMLWrapper creates an NVML interface wrapper that adds missing
// methods required by nvidia-container-toolkit's nvcdi library.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we making these changes here instead of implementing them "upstream"?

// Get topology
topo, err := mocktopo.New(cfg.Machine)
if err != nil {
if os.Getenv("ALLOW_UNSUPPORTED") == "true" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to understand why this is useful? Does having this option not complicate the implementation? Why not just default to the one machine type that we have and figure out how to expose new ones as they become available?


// Validate before writing
log.Debugf("Validating CDI specification")
if err := cdi.Validate(specYAML); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPEC is validated on save. We should not be validating this on our own -- this should either be done in the nvcdi package or upstream.

return fmt.Errorf("failed to create CDI directory: %w", err)
}

if err := os.WriteFile(cfg.CDIOutput, specYAML, 0o644); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe that this is how we output the CDI spec in the tooling that we release. See https://github.com/NVIDIA/nvidia-container-toolkit/blob/5bb032d60486da9b441a208f225f911efbad35f2/cmd/nvidia-ctk/cdi/generate/generate.go#L332


// Also create under driverRoot/dev for completeness
log.Debugf("Creating device nodes under %s/dev", cfg.DriverRoot)
driverDevNodes := mockdriver.DeviceNodes(cfg.DriverRoot, gpuCount, cfg.WithDRI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we create these under two locations?

Comment on lines +81 to +86
if o.Vendor != "" {
opts = append(opts, nvcdi.WithVendor(o.Vendor))
}
if o.Class != "" {
opts = append(opts, nvcdi.WithClass(o.Class))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvcdi already handles the empty case.


// Get the raw spec and marshal to YAML
rawSpec := spec.Raw()
return yaml.Marshal(rawSpec)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return a typed variable? (i.e just spec)


const (
// DefaultCDIRoot is the standard location for CDI specifications.
DefaultCDIRoot = "/etc/cdi"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason to use /etc/cdi? I would argue that we should use /var/run/cdi.

DefaultCDIRoot = "/etc/cdi"

// DefaultSpecPath is the default output path for the NVIDIA CDI spec.
DefaultSpecPath = "/etc/cdi/nvidia.yaml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we give this a different path? (and also a different vendor and / or class)?

// Match dgxa100 mock driver version
driverVer := "550.54.15"

files := []FileSpec{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How about STARTING with a CDI spec from an actual system and GENERATING this list instead?

@ArangoGutierrez ArangoGutierrez merged commit 384f79d into NVIDIA:main Oct 15, 2025
8 checks passed
@ArangoGutierrez
Copy link
Collaborator Author

#182 is a refactored version of this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants