-
Notifications
You must be signed in to change notification settings - Fork 8
Add GPU Mock Infrastructure #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Implement pkg/gpu/mockfs for NVIDIA driver filesystem mocking
- Implement pkg/gpu/mocktopo with dgxa100 support via go-nvml
- Add cmd/gpu-mockctl CLI tool for mock generation
- Add Kubernetes Job for verification on kind
- Support extensibility for future H100/H200/B200 flavors
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a comprehensive GPU mock infrastructure for testing GPU-related Kubernetes features without physical hardware. The implementation includes mock NVIDIA driver filesystems, CDI (Container Device Interface) generation, and a full testing environment deployable on local kind clusters.
Key changes:
- New
gpu-mockctlCLI tool for generating mock GPU driver environments - Mock topology provider using go-nvml dgxa100 simulation for 8x A100 GPUs
- Production-grade CDI specification generation via nvidia-container-toolkit integration
- Complete Kubernetes deployment manifests for local testing on kind clusters
Reviewed Changes
Copilot reviewed 22 out of 364 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/gpu/mocktopo/provider.go | Core topology provider with dgxa100 machine type registry |
| pkg/gpu/mocktopo/nvmlwrapper.go | NVML wrapper adding MIG support stubs for CDI compatibility |
| pkg/gpu/mocktopo/localmock.go | Fallback topology generator for unsupported machine types |
| pkg/gpu/mockfs/util.go | Utility functions for PCI bus ID normalization |
| pkg/gpu/mockfs/procfs.go | Mock filesystem layout generation for proc/dev structures |
| pkg/gpu/mockfs/devnodes.go | Character device node creation utilities |
| pkg/gpu/mockdriver/write.go | File specification writer for mock driver trees |
| pkg/gpu/mockdriver/tree.go | Mock driver file specifications with versioned libraries |
| pkg/gpu/cdi/validate.go | CDI specification validation utilities |
| pkg/gpu/cdi/specgen.go | CDI specification generation using nvidia-container-toolkit |
| pkg/gpu/cdi/paths.go | CDI path constants and defaults |
| hack/kind-up-gpu-mock.sh | Build automation script for GPU mock deployment |
| go.mod | Dependency additions for NVML, CDI, and container toolkit integration |
| deployments/devel/gpu-mock/README.md | Comprehensive documentation for the GPU mock infrastructure |
| deployments/devel/gpu-mock/Makefile | Build and deployment automation |
| deployments/devel/gpu-mock/Dockerfile | Container image build configuration |
| deployments/devel/gpu-mock/40-job-cdi-smoke.yaml | CDI smoke test job manifest |
| deployments/devel/gpu-mock/30-daemonset-cdi-mock.yaml | DaemonSet for node-level CDI mock setup |
| deployments/devel/gpu-mock/20-job-gpu-mock-verify.yaml | Mock filesystem verification job |
| deployments/devel/gpu-mock/10-configmap-verify-script.yaml | Verification script ConfigMap |
| deployments/devel/gpu-mock/00-namespace.yaml | Namespace resource for GPU mock deployment |
| cmd/gpu-mockctl/main.go | Main CLI application with filesystem and CDI generation modes |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
- Add mock driver tree generation (pkg/gpu/mockdriver) - Integrate nvidia-container-toolkit nvcdi for CDI spec generation - Extend go-nvml dgxa100 mock with MIG stubs for nvcdi compatibility - Add DaemonSet to deploy CDI mock on all nodes - Add comprehensive smoke tests for CDI verification - Update gpu-mockctl CLI with fs/cdi/all modes - Use __NVCT_TESTING_DEVICES_ARE_FILES from nvidia-container-toolkit tests - Generate empty files with versioned naming Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
975f57a to
ca4e484
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 22 out of 364 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
Once NVIDIA/go-nvml#163 is merged, I'll add the necessary changes to this to allow selection of any of the supported go-nvml mock-systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 22 out of 364 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
cmd/gpu-mockctl/main.go
Outdated
| } | ||
|
|
||
| func run(cfg *config) error { | ||
| // Get topology (A100-only for now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once NVIDIA/go-nvml#163 is merged I'll expand this section
| From the repository root: | ||
|
|
||
| ```bash | ||
| make -C deployments/devel/gpu-mock all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on a macOS system
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments ... more next week.
| } | ||
|
|
||
| // newNVMLWrapper creates an NVML interface wrapper that adds missing | ||
| // methods required by nvidia-container-toolkit's nvcdi library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we making these changes here instead of implementing them "upstream"?
| // Get topology | ||
| topo, err := mocktopo.New(cfg.Machine) | ||
| if err != nil { | ||
| if os.Getenv("ALLOW_UNSUPPORTED") == "true" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to understand why this is useful? Does having this option not complicate the implementation? Why not just default to the one machine type that we have and figure out how to expose new ones as they become available?
|
|
||
| // Validate before writing | ||
| log.Debugf("Validating CDI specification") | ||
| if err := cdi.Validate(specYAML); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SPEC is validated on save. We should not be validating this on our own -- this should either be done in the nvcdi package or upstream.
| return fmt.Errorf("failed to create CDI directory: %w", err) | ||
| } | ||
|
|
||
| if err := os.WriteFile(cfg.CDIOutput, specYAML, 0o644); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe that this is how we output the CDI spec in the tooling that we release. See https://github.com/NVIDIA/nvidia-container-toolkit/blob/5bb032d60486da9b441a208f225f911efbad35f2/cmd/nvidia-ctk/cdi/generate/generate.go#L332
|
|
||
| // Also create under driverRoot/dev for completeness | ||
| log.Debugf("Creating device nodes under %s/dev", cfg.DriverRoot) | ||
| driverDevNodes := mockdriver.DeviceNodes(cfg.DriverRoot, gpuCount, cfg.WithDRI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we create these under two locations?
| if o.Vendor != "" { | ||
| opts = append(opts, nvcdi.WithVendor(o.Vendor)) | ||
| } | ||
| if o.Class != "" { | ||
| opts = append(opts, nvcdi.WithClass(o.Class)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvcdi already handles the empty case.
|
|
||
| // Get the raw spec and marshal to YAML | ||
| rawSpec := spec.Raw() | ||
| return yaml.Marshal(rawSpec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not return a typed variable? (i.e just spec)
|
|
||
| const ( | ||
| // DefaultCDIRoot is the standard location for CDI specifications. | ||
| DefaultCDIRoot = "/etc/cdi" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to use /etc/cdi? I would argue that we should use /var/run/cdi.
| DefaultCDIRoot = "/etc/cdi" | ||
|
|
||
| // DefaultSpecPath is the default output path for the NVIDIA CDI spec. | ||
| DefaultSpecPath = "/etc/cdi/nvidia.yaml" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we give this a different path? (and also a different vendor and / or class)?
| // Match dgxa100 mock driver version | ||
| driverVer := "550.54.15" | ||
|
|
||
| files := []FileSpec{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: How about STARTING with a CDI spec from an actual system and GENERATING this list instead?
|
#182 is a refactored version of this PR |
This pull request introduces a new GPU mock driver and CDI (Container Device Interface) test/development environment, including a new CLI tool (
gpu-mockctl) and a full Kubernetes deployment manifest set for local testing. The changes provide a way to generate a mock NVIDIA GPU driver filesystem, create CDI specs, and verify/test them in a Kubernetes cluster (e.g., kind). This enables development and CI testing of GPU-related features without requiring real GPU hardware.These changes together provide a robust, reproducible environment for simulating NVIDIA GPU hardware and CDI integration for development and CI workflows, without requiring access to real GPUs.