-
Notifications
You must be signed in to change notification settings - Fork 8
Add GPU Mock Infrastructure #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a mock NVIDIA GPU environment to support testing without physical GPUs, along with CI wiring. It adds a mock NVML C library, a Go-based “gpu-mockctl” tool to generate a mock driver filesystem, Kubernetes deployment artifacts (Helm chart and static manifests), documentation, and GitHub Actions to test the mock NVML.
- Adds a production-like mock NVML (C) plus comprehensive tests and build system
- Introduces gpu-mockctl CLI and Kubernetes deployment (Helm + manifests) to set up mock driver, toolkit, and tests
- Updates CI workflows and dependencies to build/test these components
Reviewed Changes
Copilot reviewed 73 out of 170 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| go.mod | Sets module Go version and adds dependencies (cli, go-nvml, x/sys) |
| pkg/gpu/mocknvml/src/*.c | Implements core NVML APIs and stubs; provides system/device/memory/MIG logic |
| pkg/gpu/mocknvml/data/devices.h | Mock A100 device inventory and attributes |
| pkg/gpu/mocknvml/Makefile | Build, test, and info targets for the mock NVML library |
| pkg/gpu/mocktopo/* | Adds a dgxa100 mock topology provider with an NVML wrapper |
| pkg/gpu/mockfs/* and pkg/gpu/mockdriver/* | Utilities to create mock /dev nodes, driver tree, and copy NVML libs |
| cmd/gpu-mockctl/* | Adds a CLI with logging, config, and tests |
| deployments/devel/gpu-mock/**/* | Dockerfile, Helm chart, static manifests, and test pods |
| docs/**/* | User guide, API reference, architecture docs |
| .github/workflows/* | Adds a workflow to compile and test mock NVML and updates CI |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| RUN CGO_ENABLED=1 \ | ||
| go build -a -ldflags '-linkmode external -extldflags "-static"' \ | ||
| -o /out/gpu-mockctl ./cmd/gpu-mockctl |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building a cgo-enabled binary with fully static linking typically fails on Debian-based images without static libc (libc6-dev-static) and related static archives. Either remove '-extldflags "-static"' and build dynamically, or install static libc (or use a musl-based image), or set CGO_ENABLED=0 if the binary doesn’t require cgo.
| } | ||
|
|
||
| // Get driver branch info | ||
| nvmlReturn_t DECLDIR nvmlSystemGetDriverBranch(nvmlSystemDriverBranchInfo_t *branchInfo, unsigned int length) { |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.
|
|
||
| if (length < strlen(branch) + 1) { | ||
| return NVML_ERROR_INSUFFICIENT_SIZE; | ||
| } | ||
|
|
||
| strncpy((char*)branchInfo, branch, length); | ||
| ((char*)branchInfo)[length - 1] = '\0'; |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.
| if (length < strlen(branch) + 1) { | |
| return NVML_ERROR_INSUFFICIENT_SIZE; | |
| } | |
| strncpy((char*)branchInfo, branch, length); | |
| ((char*)branchInfo)[length - 1] = '\0'; | |
| size_t branch_field_size = sizeof(branchInfo->branch); | |
| // Use the minimum of the provided length and the branch field size for safety | |
| size_t copy_size = (length < branch_field_size) ? length : branch_field_size; | |
| if (copy_size < strlen(branch) + 1) { | |
| return NVML_ERROR_INSUFFICIENT_SIZE; | |
| } | |
| strncpy(branchInfo->branch, branch, copy_size); | |
| branchInfo->branch[copy_size - 1] = '\0'; |
| if (branchInfo == NULL || length == 0) { | ||
| return NVML_ERROR_INVALID_ARGUMENT; | ||
| } | ||
|
|
||
| const char *branch = "r550_00"; | ||
|
|
||
| if (length < strlen(branch) + 1) { | ||
| return NVML_ERROR_INSUFFICIENT_SIZE; | ||
| } | ||
|
|
||
| strncpy((char*)branchInfo, branch, length); | ||
| ((char*)branchInfo)[length - 1] = '\0'; |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlSystemGetDriverBranch writes into the struct pointer directly by casting to char*, which does not respect the nvmlSystemDriverBranchInfo_t layout. It should copy into the struct field (e.g., branchInfo->branch) and bound checks should use the field’s size (or NVML buffer macro), not the raw struct pointer.
| if (branchInfo == NULL || length == 0) { | |
| return NVML_ERROR_INVALID_ARGUMENT; | |
| } | |
| const char *branch = "r550_00"; | |
| if (length < strlen(branch) + 1) { | |
| return NVML_ERROR_INSUFFICIENT_SIZE; | |
| } | |
| strncpy((char*)branchInfo, branch, length); | |
| ((char*)branchInfo)[length - 1] = '\0'; | |
| if (branchInfo == NULL) { | |
| return NVML_ERROR_INVALID_ARGUMENT; | |
| } | |
| const char *branch = "r550_00"; | |
| size_t branch_field_size = sizeof(branchInfo->branch); | |
| if (branch_field_size < strlen(branch) + 1) { | |
| return NVML_ERROR_INSUFFICIENT_SIZE; | |
| } | |
| strncpy(branchInfo->branch, branch, branch_field_size); | |
| branchInfo->branch[branch_field_size - 1] = '\0'; |
| const mock_device_info_t *dev = &mock_devices[index]; | ||
|
|
||
| pci->domain = dev->pci_domain; | ||
| pci->bus = dev->pci_bus; | ||
| pci->device = dev->pci_device; | ||
| pci->pciDeviceId = dev->pci_device_id; | ||
| pci->pciSubSystemId = dev->pci_subsystem_id; | ||
|
|
||
| strncpy(pci->busId, dev->pci_bus_id_legacy, sizeof(pci->busId)); | ||
| pci->busId[sizeof(pci->busId) - 1] = '\0'; | ||
|
|
||
| strncpy(pci->busIdLegacy, dev->pci_bus_id_legacy, sizeof(pci->busIdLegacy)); | ||
| pci->busIdLegacy[sizeof(pci->busIdLegacy) - 1] = '\0'; |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pci->busId is being populated with the legacy bus ID. Use dev->pci_bus_id (extended format) for pci->busId, and keep dev->pci_bus_id_legacy for pci->busIdLegacy to align with NVML’s expected fields.
| func TestLogger(t *testing.T) { | ||
| // Capture log output | ||
| var buf bytes.Buffer | ||
| log.SetOutput(&buf) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test redirects the stdlib 'log' output, but the logger writes to its own configured io.Writer (default os.Stderr). As a result, the buffer never captures Info/Warning/Error output. Construct the logger with NewWithConfig and set Output to &buf to correctly capture logs in the test.
| defer log.SetOutput(os.Stderr) | ||
|
|
||
| // Test non-verbose logger | ||
| l := New("test", false) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test redirects the stdlib 'log' output, but the logger writes to its own configured io.Writer (default os.Stderr). As a result, the buffer never captures Info/Warning/Error output. Construct the logger with NewWithConfig and set Output to &buf to correctly capture logs in the test.
| if len(cmd.Commands) != len(expectedCommands) { | ||
| t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands)) | ||
| } |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This asserts an exact command count of 2, but the root command also includes 'driver' and 'all', causing the test to fail despite being correct. Remove the strict length equality check or expand 'expectedCommands' to include all registered subcommands.
| if len(cmd.Commands) != len(expectedCommands) { | |
| t.Errorf("Expected %d commands, got %d", len(expectedCommands), len(cmd.Commands)) | |
| } |
| driverVer := "550.54.15" | ||
| files := []string{ | ||
| "libnvidia-ml.so." + driverVer, | ||
| "libnvidia-ml.so.1", | ||
| "libnvidia-ml.so", | ||
| } |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The driver version string '550.54.15' is duplicated across the codebase (e.g., Makefiles, tree.go, docs). Consider centralizing this into a single constant or configuration to avoid drift and ease upgrades.
| // Match dgxa100 mock driver version | ||
| driverVer := "550.54.15" |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Hardcoding the driver version here duplicates the same string used elsewhere. Prefer a shared constant or config to keep all references in sync.
df93a19 to
50e6c5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 73 out of 170 changed files in this pull request and generated 9 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| static const mock_device_info_t mock_devices[8] = { | ||
| { | ||
| .uuid = "GPU-4404041a-04cf-1ccf-9e70-f139a9b1e23c", | ||
| .name = "NVIDIA A100-SXM4-40GB", | ||
| .pci_bus_id = "00000000:00:00.0", | ||
| .pci_bus_id_legacy = "0000:00:00.0", | ||
| .serial = "1563221000001", |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.
| .power_limit = 400000, // 400W | ||
| .clock_graphics = 1410, // 1410 MHz | ||
| .clock_sm = 1410, // 1410 MHz | ||
| .clock_memory = 1593, // 1593 MHz | ||
| }, |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.
| .power_limit = 400000, | ||
| .clock_graphics = 1410, | ||
| .clock_sm = 1410, | ||
| .clock_memory = 1593, | ||
| } |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nvmlDeviceGetCudaComputeCapability reads mock_devices[idx].cuda_compute_capability_major/minor, but device 0 and device 7 do not initialize these fields here. That will return 0,0 (and break tests expecting 8.0). Initialize cuda_compute_capability_major=8 and cuda_compute_capability_minor=0 for all 8 devices.
| #include <stdio.h> | ||
| #include <stdlib.h> | ||
| #include <string.h> | ||
| #include <nvml.h> |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test includes nvml.h via system include paths, but the Makefile compiles it without adding the repository's include directory. On hosts without system NVML headers, compilation will fail. Either change the include to "../include/nvml.h" (like the comprehensive test) or add an include path (e.g., -I./include) in the test build rule.
| #include <nvml.h> | |
| #include "../include/nvml.h" |
|
|
||
| // Enumerate devices | ||
| for (unsigned int i = 0; i < deviceCount && i < 8; i++) { | ||
| nvmlDevice_t device; |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test includes nvml.h via system include paths, but the Makefile compiles it without adding the repository's include directory. On hosts without system NVML headers, compilation will fail. Either change the include to "../include/nvml.h" (like the comprehensive test) or add an include path (e.g., -I./include) in the test build rule.
pkg/gpu/mocknvml/Makefile
Outdated
| $(CC) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR) | ||
| @echo "Running basic test..." | ||
| $(BUILD_DIR)/test/test_nvml | ||
|
|
||
| test-comprehensive: all | ||
| @echo "Building comprehensive test suite..." | ||
| @mkdir -p $(BUILD_DIR)/test | ||
| $(CC) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR) | ||
| @echo "Running comprehensive tests..." | ||
| $(BUILD_DIR)/test/test_nvml_comprehensive | ||
|
|
||
| test-valgrind: all | ||
| @echo "Running tests with valgrind..." | ||
| @mkdir -p $(BUILD_DIR)/test | ||
| $(CC) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compilation of test/test_nvml.c does not add the mock NVML header path, causing include <nvml.h> to fail on systems without system headers. Add -I./include (or reuse
| $(CC) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR) | |
| @echo "Running basic test..." | |
| $(BUILD_DIR)/test/test_nvml | |
| test-comprehensive: all | |
| @echo "Building comprehensive test suite..." | |
| @mkdir -p $(BUILD_DIR)/test | |
| $(CC) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR) | |
| @echo "Running comprehensive tests..." | |
| $(BUILD_DIR)/test/test_nvml_comprehensive | |
| test-valgrind: all | |
| @echo "Running tests with valgrind..." | |
| @mkdir -p $(BUILD_DIR)/test | |
| $(CC) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR) | |
| $(CC) $(CFLAGS) -o $(BUILD_DIR)/test/test_nvml test/test_nvml.c -L$(LIB_DIR) -lnvidia-ml -Wl,-rpath,$(LIB_DIR) | |
| @echo "Running basic test..." | |
| $(BUILD_DIR)/test/test_nvml | |
| test-comprehensive: all | |
| @echo "Building comprehensive test suite..." | |
| @mkdir -p $(BUILD_DIR)/test | |
| $(CC) $(CFLAGS) -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR) | |
| @echo "Running comprehensive tests..." | |
| $(BUILD_DIR)/test/test_nvml_comprehensive | |
| test-valgrind: all | |
| @echo "Running tests with valgrind..." | |
| @mkdir -p $(BUILD_DIR)/test | |
| $(CC) $(CFLAGS) -g -o $(BUILD_DIR)/test/test_nvml_comprehensive test/test_nvml_comprehensive.c -L$(LIB_DIR) -lnvidia-ml -lpthread -Wl,-rpath,$(LIB_DIR) |
| "unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+ | ||
| "ALLOW_UNSUPPORTED=true to use fallback", |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error suggests setting ALLOW_UNSUPPORTED=true to use a fallback, but this function does not implement any fallback behavior. Either implement the fallback (e.g., calling NewFallback with a default GPU count) when ALLOW_UNSUPPORTED is set, or remove the guidance from the error message to avoid misleading users.
| "unsupported MACHINE_TYPE %q (only 'dgxa100'); set "+ | |
| "ALLOW_UNSUPPORTED=true to use fallback", | |
| "unsupported MACHINE_TYPE %q (only 'dgxa100' is supported)", |
| {{- with .Values.containerToolkit.nodeSelector }} | ||
| nodeSelector: | ||
| {{- toYaml . | nindent 8 }} | ||
| {{- end }} | ||
| {{- if .Values.containerToolkit.requireMockDriver }} | ||
| nodeSelector: | ||
| nvidia.com/gpu.present: "true" | ||
| {{- end }} |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeSelector is rendered twice; the latter block will overwrite the former or produce duplicate keys in YAML. Merge selectors instead (e.g., always render a single nodeSelector map and conditionally add nvidia.com/gpu.present when required).
| {{- with .Values.containerToolkit.nodeSelector }} | |
| nodeSelector: | |
| {{- toYaml . | nindent 8 }} | |
| {{- end }} | |
| {{- if .Values.containerToolkit.requireMockDriver }} | |
| nodeSelector: | |
| nvidia.com/gpu.present: "true" | |
| {{- end }} | |
| {{- /* | |
| Merge .Values.containerToolkit.nodeSelector (if any) with | |
| nvidia.com/gpu.present: "true" if .Values.containerToolkit.requireMockDriver is true. | |
| This ensures only one nodeSelector block is rendered. | |
| */ -}} | |
| {{- $baseSelector := .Values.containerToolkit.nodeSelector | default (dict) -}} | |
| {{- if .Values.containerToolkit.requireMockDriver -}} | |
| {{- $mergedSelector := merge $baseSelector (dict "nvidia.com/gpu.present" "true") -}} | |
| nodeSelector: | |
| {{- toYaml $mergedSelector | nindent 8 }} | |
| {{- else if $baseSelector -}} | |
| nodeSelector: | |
| {{- toYaml $baseSelector | nindent 8 }} | |
| {{- end }} |
|
|
||
| # Label node for GPU support | ||
| - name: label-node | ||
| image: bitnami/kubectl:latest |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the 'latest' tag makes deployments non-reproducible and can break unexpectedly. Pin to a specific, tested kubectl image tag (e.g., bitnami/kubectl:1.30.2) and keep it updated intentionally.
| image: bitnami/kubectl:latest | |
| image: bitnami/kubectl:1.30.2 |
elezar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite difficult to review without a high-level overview of how the pieces fit together (PR Description?).
Also, in the case of the "mock" libnvidia-ml.so library, I don't think we should be reimplementing a mock in C. We should take our existing mocks implemented in GO and build these as a C library exposing the "correct" API. This library should also have SOME mechanism to be changed to handle different infrastructures and behaviours. This need not be the initial implementation, but some thought should be given as to how we would inject failures etc into the library.
On the filesystem changes. Does it make sense to take a CDI spec as input and use this to populate the filesystem in some way? This should in theory mean that if we use the mock filessystem (and mocked libnvidia-ml.so) as the intput to CDI spec generation (as we would do in the device plugin), we get the input CDI spec.
On the proc path: If we limit ourselve to running in a container, then we may not need to worry about the "driver root" and where we create "/proc"? In this case we could manipulate the procfs in the container to look the way we need it to and ALSO create the required device nodes. (this is not something that I've validated though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 93 out of 186 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .cuda_compute_capability_major = 8, | ||
| .cuda_compute_capability_minor = 0 |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA compute capability fields are missing for device 0 (lines 54-86) while being present for other devices (e.g., lines 114-115). This inconsistency could lead to unexpected behavior if these fields are accessed for device 0.
| .clock_sm = 1410, | ||
| .clock_memory = 1593, | ||
| .cuda_compute_capability_major = 8, | ||
| .cuda_compute_capability_minor = 0 |
Copilot
AI
Nov 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CUDA compute capability fields are missing for the last device (device 7, lines 268-295). This inconsistency with other devices should be corrected to ensure uniform device properties.
1e83a87 to
3221c4e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we not remove the C source files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are the built artifacts in the repo? Should there be a Make file to build them instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to add them to the gitignore, this is just remaining from my testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason why this is a submodule?
This commit introduces a complete Go-based mock NVML library that replaces the previous C implementation. Key improvements: Architecture: - Bridge layer (pkg/gpu/mocknvml/bridge/): CGo exports exposing 49 NVML C functions with proper type conversions and error handling - Engine layer (pkg/gpu/mocknvml/engine/): Mock server using go-nvml's dgxa100.Server with singleton pattern and handle table - Test suite: C test programs verifying library functionality Implementation highlights: - Leverages upstream go-nvml mock implementations (dgxa100) - Type-safe Go code with memory-safe handle management - Runtime configuration via environment variables - Supports NVIDIA Device Plugin and DRA Driver requirements Build system: - Unified Makefile with both C and Go targets - CGo-based shared library generation (libnvidia-ml.so) - Proper library versioning and symlinking - Cross-platform support (Linux, macOS via Docker) Testing: - Integration with existing gpu-mockctl infrastructure - Kubernetes deployment via Helm charts - E2E validation with NVIDIA Device Plugin This provides a maintainable foundation for mock GPU testing in Kubernetes environments without requiring physical hardware.
Implements the 'gpu-mockctl cdi' command to parse Container Device Interface (CDI) specifications and extract mock GPU infrastructure configuration. Features: - Parses CDI v0.5.0 specifications (YAML/JSON) - Auto-detects GPU architecture from nvidia.com/gpu.model annotation - Extracts device nodes (nvidia0-N, nvidiactl, nvidia-uvm*) - Generates /proc/driver/nvidia/gpus/* entries - Outputs simplified JSON configuration for infrastructure setup Command usage: gpu-mockctl cdi --spec cdi-spec.yaml --output mock-config.json This enables declarative GPU topology definition via industry-standard CDI format, providing flexibility for testing different hardware configurations without code changes.
Adds an intelligent entrypoint that detects and configures the mock GPU infrastructure based on the operating mode. Operating modes: 1. Default mode: Uses built-in dgxa100 configuration (8 A100 GPUs) - Zero configuration required - Activated when no CDI spec is present - Sets MOCK_GPU_ARCH=dgxa100, MOCK_NVML_NUM_DEVICES=8 2. CDI mode: Uses declarative CDI specification - Activated when CDI spec file is detected - Calls gpu-mockctl cdi to parse specification - Creates device nodes dynamically - Sets up /proc filesystem entries - Configures environment from CDI content The entrypoint provides clear logging of the selected mode and configuration, then executes the gpu-mockctl driver command. Container updates: - Modified Dockerfile to include jq for JSON parsing - Set entrypoint.sh as container ENTRYPOINT - Added infrastructure setup logic before driver execution
Provides reference CDI specifications for testing different GPU configurations. Files: - cdi-spec-a100-2gpu.yaml: Minimal 2-GPU A100 configuration * Useful for quick testing and validation * Lower resource footprint * Demonstrates minimal CDI structure - cdi-spec-a100-8gpu.yaml: Full DGX A100 configuration * Simulates complete DGX A100 system (8 GPUs) * Matches default dgxa100 behavior * Comprehensive device node setup Both specifications follow CDI v0.5.0 standard and include: - Proper device node definitions (nvidia0-N, nvidiactl, nvidia-uvm*) - GPU model annotations for architecture detection - Container edit specifications - Required metadata Usage: kubectl create configmap my-spec \ --from-file=spec.yaml=examples/cdi-spec-a100-2gpu.yaml helm install gpu-mock ./helm/gpu-mock \ --set cdi.enabled=true \ --set cdi.configMapName=my-spec
|
As a general note here, it feels as if we're trying to do too many thingsin a single PR. Am I correct in stating the core problem of this project as "providing a representation of a GPU-enabled system to allow us to test our own components without actually having access to GPUs"? What are the use cases that we would like to cover here? For this PR specifically, there seem to be two core work items here (not in order of priority):
Other non-core questions include:
I would argue that each of these are large enough to be their own PR. |
3221c4e to
df0941e
Compare
This is good feedback, I'll break this down into multiple PR's |
Overview
Problem Statement
Previous limitations:
go-nvmlmocksWhat this PR solves:
go-nvmlmocks🏗️ Architecture
High-Level Design
Component Layers
1. Go-Based NVML Library (Commit 1)
Files:
pkg/gpu/mocknvml/{bridge/, engine/}Bridge Layer (
bridge/): CGo exports exposing 49 NVML C functionsbridge.go: Init, shutdown, system info, error handlingdevice.go: Device enumeration, properties, UUIDmemory.go: Memory queries (device, BAR1)pci.go: PCI bus informationprocess.go: Process enumerationevents.go: Event monitoring stubsmig.go: MIG stubs (not supported)Engine Layer (
engine/): Mock server usinggo-nvmlKey Benefits:
go-nvml/pkg/nvml/mock/dgxa100(8 A100 GPUs)2. CDI Parser (Commit 2)
Files:
cmd/gpu-mockctl/commands/cdi.goParses CDI specifications and generates mock infrastructure configuration:
Features:
nvidia.com/gpu.modelannotation/proc/driver/nvidia/gpus/*entries3. Entrypoint Orchestration (Commit 3)
Files:
deployments/devel/gpu-mock/container/entrypoint.shIntelligent mode detection and setup:
Logs clearly indicate mode:
4. Helm Chart Enhancements (Commit 4)
Files:
deployments/devel/gpu-mock/helm/gpu-mock/{values.yaml, templates/}Adds CDI configuration options while maintaining backward compatibility:
DaemonSet changes:
entrypoint.shinstead of directgpu-mockctlcall5. CDI Spec Examples (Commit 5)
Files:
deployments/devel/gpu-mock/examples/cdi-spec-*.yamlBoth follow CDI v0.5.0 specification with proper device nodes, annotations, and metadata.
Usage
Quick Start (Default Mode - Zero Config)
What happens:
MOCK_GPU_ARCH=dgxa100,MOCK_NVML_NUM_DEVICES=8CDI Mode (Custom Topology)
What happens:
/config/cdi-spec.yaml→ CDI modegpu-mockctl cdito parse specTesting
Validated Scenarios
Test Commands
Commit Breakdown
cmd/gpu-mockctl/commands/cdi.goentrypoint.sh,Dockerfileexamples/cdi-spec-*.yamlEach commit is self-contained and tells a clear story of incremental enhancement.
Reviewer Guide
Recommended review order:
Commit 1 - Understand the complete infrastructure
pkg/gpu/mocknvml/{bridge,engine}/for NVML librarycmd/gpu-mockctl/for CLI toolsdeployments/for Kubernetes integrationCommit 2 - CDI parser logic
Commit 3 - Integration layer
Commit 4 - Examples (optional, for testing)
Key files to review:
pkg/gpu/mocknvml/bridge/bridge.go- CGo layerpkg/gpu/mocknvml/engine/engine.go- Mock servercmd/gpu-mockctl/commands/driver.go- Driver deploymentdeployments/.../entrypoint.sh- Mode detectioncmd/gpu-mockctl/commands/cdi.go- CDI parser