generated from amazon-archives/__template_Custom
-
Notifications
You must be signed in to change notification settings - Fork 53
Amd device plugin #748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Amd device plugin #748
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
a64d69f to
5038ad8
Compare
yeazelm
reviewed
Nov 14, 2025
432bae0 to
919378b
Compare
arnaldo2792
reviewed
Nov 17, 2025
dab0d0f to
e8d2452
Compare
yeazelm
reviewed
Nov 20, 2025
e8d2452 to
ee096f4
Compare
bcressey
reviewed
Nov 21, 2025
bcressey
reviewed
Nov 21, 2025
yeazelm
reviewed
Nov 22, 2025
ee096f4 to
5af5a22
Compare
yeazelm
reviewed
Nov 24, 2025
bcressey
reviewed
Nov 25, 2025
Signed-off-by: Gaurav Sharma <[email protected]>
b1990ca to
cd39740
Compare
bcressey
approved these changes
Nov 25, 2025
Signed-off-by: Gaurav Sharma <[email protected]>
cd39740 to
e0d8fee
Compare
Contributor
Author
|
Make rocm-k8s-device-plugin depend on [email protected] to ensure AMD GPU driver loads first. |
yeazelm
approved these changes
Nov 26, 2025
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of changes:
This PR adds support for AMD GPU detection and device plugin functionality to Bottlerocket:
https://github.com/ROCm/k8s-device-plugin/blob/763445e18f3838fa72b22e31a04ec25987334bff/Dockerfile#L15
Dependencies
Testing done
AMD GPU Instance (MI300X with 8 GPUs)
PCI device detection:
ROCm device plugin running:
bash-5.1# systemctl list-units | grep -i rocm rocm-k8s-device-plugin.service loaded active running Start ROCm kubernetes device pluginbash-5.1# systemctl status rocm-k8s-device-plugin.service ● rocm-k8s-device-plugin.service - Start ROCm kubernetes device plugin Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/rocm-k8s-device-plugin.service; enabled; preset: enabled) Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d └─00-aws-config.conf Active: active (running) since Wed 2025-11-26 06:22:34 UTC; 2min 47s ago Invocation: ae2b73174a7245b5976c1b9eb7835695 Process: 47801 ExecStartPre=/usr/bin/sleep 0.1 (code=exited, status=0/SUCCESS) Process: 47823 ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock (code=exited, status=0/SUCCESS) Main PID: 47825 (rocm-device-plu) Tasks: 12 (limit: 629145) Memory: 26.9M (peak: 28M) CPU: 63ms CGroup: /system.slice/rocm-k8s-device-plugin.service └─47825 /usr/bin/rocm-device-plugin -logtostderr=true -stderrthreshold=INFO -v=5 Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799514 47825 amdgpu.go:261] Devices map: map[0000:51:00.0:map[card:0 computePartitionType:spx devID:10833827524210284120 memoryPartitionType:nps1 nodeId:2 numaNode:0 renderD:128] 0000:52:00.0:map[card:8 computePartitionType:spx devID:5789222631865278233 memoryPartitionType:nps1 nodeId:3 numaNode:0 renderD:136] 0000:62:00.0:map[card:16 computePartitionType:spx devID:1537282919108789315 memoryPartitionType:nps1 nodeId:4 numaNode:0 renderD:144] 0000:63:00.0:map[card:24 computePartitionType:spx devID:17650589009955793669 memoryPartitionType:nps1 nodeId:5 numaNode:0 renderD:152] 0000:73:00.0:map[card:32 computePartitionType:spx devID:17599352850648317852 memoryPartitionType:nps1 nodeId:6 numaNode:1 renderD:160] 0000:74:00.0:map[card:40 computePartitionType:spx devID:16824009056103876010 memoryPartitionType:nps1 nodeId:7 numaNode:1 renderD:168] 0000:84:00.0:map[card:48 computePartitionType:spx devID:7459436697213601231 memoryPartitionType:nps1 nodeId:8 numaNode:1 renderD:176] 0000:85:00.0:map[card:56 computePartitionType:spx devID:10369896732957786260 memoryPartitionType:nps1 nodeId:9 numaNode:1 renderD:184]] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799540 47825 amdgpu.go:278] Partition counts: map[spx_nps1:8] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799548 47825 plugin.go:254] Watching GPU with bus ID: 0000:52:00.0 NUMA Node: [0] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799558 47825 plugin.go:254] Watching GPU with bus ID: 0000:62:00.0 NUMA Node: [0] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799561 47825 plugin.go:254] Watching GPU with bus ID: 0000:63:00.0 NUMA Node: [0] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799564 47825 plugin.go:254] Watching GPU with bus ID: 0000:73:00.0 NUMA Node: [1] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799567 47825 plugin.go:254] Watching GPU with bus ID: 0000:74:00.0 NUMA Node: [1] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799570 47825 plugin.go:254] Watching GPU with bus ID: 0000:84:00.0 NUMA Node: [1] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799572 47825 plugin.go:254] Watching GPU with bus ID: 0000:85:00.0 NUMA Node: [1] Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799575 47825 plugin.go:254] Watching GPU with bus ID: 0000:51:00.0 NUMA Node: [0]GPU capacity advertised to Kubernetes:
All services and targets that
rocm-k8s-device-plugin.servicedepends on:All services that depend on [email protected] :
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.