Skip to content

Conversation

@mgsharm
Copy link
Contributor

@mgsharm mgsharm commented Nov 14, 2025

Description of changes:

This PR adds support for AMD GPU detection and device plugin functionality to Bottlerocket:

Dependencies

Testing done

AMD GPU Instance (MI300X with 8 GPUs)

PCI device detection:

bash-5.1# lspci -d 1002:75a3
51:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
52:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
62:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
63:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
73:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
74:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
84:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3
85:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Device 75a3

ROCm device plugin running:

bash-5.1# systemctl list-units | grep -i rocm
  rocm-k8s-device-plugin.service                                                                                                                loaded active     
running   Start ROCm kubernetes device plugin
bash-5.1# systemctl status rocm-k8s-device-plugin.service
● rocm-k8s-device-plugin.service - Start ROCm kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/rocm-k8s-device-plugin.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf
     Active: active (running) since Wed 2025-11-26 06:22:34 UTC; 2min 47s ago
 Invocation: ae2b73174a7245b5976c1b9eb7835695
    Process: 47801 ExecStartPre=/usr/bin/sleep 0.1 (code=exited, status=0/SUCCESS)
    Process: 47823 ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock (code=exited, status=0/SUCCESS)
   Main PID: 47825 (rocm-device-plu)
      Tasks: 12 (limit: 629145)
     Memory: 26.9M (peak: 28M)
        CPU: 63ms
     CGroup: /system.slice/rocm-k8s-device-plugin.service
             └─47825 /usr/bin/rocm-device-plugin -logtostderr=true -stderrthreshold=INFO -v=5

Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799514   47825 amdgpu.go:261] Devices map: map[0000:51:00.0:map[card:0 computePartitionType:spx devID:10833827524210284120 memoryPartitionType:nps1 nodeId:2 numaNode:0 renderD:128] 0000:52:00.0:map[card:8 computePartitionType:spx devID:5789222631865278233 memoryPartitionType:nps1 nodeId:3 numaNode:0 renderD:136] 0000:62:00.0:map[card:16 computePartitionType:spx devID:1537282919108789315 memoryPartitionType:nps1 nodeId:4 numaNode:0 renderD:144] 0000:63:00.0:map[card:24 computePartitionType:spx devID:17650589009955793669 memoryPartitionType:nps1 nodeId:5 numaNode:0 renderD:152] 0000:73:00.0:map[card:32 computePartitionType:spx devID:17599352850648317852 memoryPartitionType:nps1 nodeId:6 numaNode:1 renderD:160] 0000:74:00.0:map[card:40 computePartitionType:spx devID:16824009056103876010 memoryPartitionType:nps1 nodeId:7 numaNode:1 renderD:168] 0000:84:00.0:map[card:48 computePartitionType:spx devID:7459436697213601231 memoryPartitionType:nps1 nodeId:8 numaNode:1 renderD:176] 0000:85:00.0:map[card:56 computePartitionType:spx devID:10369896732957786260 memoryPartitionType:nps1 nodeId:9 numaNode:1 renderD:184]]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799540   47825 amdgpu.go:278] Partition counts: map[spx_nps1:8]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799548   47825 plugin.go:254] Watching GPU with bus ID: 0000:52:00.0 NUMA Node: [0]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799558   47825 plugin.go:254] Watching GPU with bus ID: 0000:62:00.0 NUMA Node: [0]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799561   47825 plugin.go:254] Watching GPU with bus ID: 0000:63:00.0 NUMA Node: [0]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799564   47825 plugin.go:254] Watching GPU with bus ID: 0000:73:00.0 NUMA Node: [1]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799567   47825 plugin.go:254] Watching GPU with bus ID: 0000:74:00.0 NUMA Node: [1]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799570   47825 plugin.go:254] Watching GPU with bus ID: 0000:84:00.0 NUMA Node: [1]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799572   47825 plugin.go:254] Watching GPU with bus ID: 0000:85:00.0 NUMA Node: [1]
Nov 26 06:22:34 ip-172-31-60-25.us-west-2.compute.internal rocm-device-plugin[47825]: I1126 06:22:34.799575   47825 plugin.go:254] Watching GPU with bus ID: 0000:51:00.0 NUMA Node: [0]

GPU capacity advertised to Kubernetes:

  ➜ kubectl get nodes -o json | jq '.items[].status.capacity'
  {
    "amd.com/gpu": "8",
    "cpu": "192",
    "ephemeral-storage": "15471392Ki",
    "hugepages-1Gi": "0",
    "hugepages-2Mi": "0",
    "memory": "4206782968Ki",
    "pods": "110"
  }

All services and targets that rocm-k8s-device-plugin.service depends on:

bash-5.1#   systemctl list-dependencies rocm-k8s-device-plugin.service
rocm-k8s-device-plugin.service
● ├─kubelet.service
● ├─[email protected]
● ├─system.slice
● └─sysinit.target
●   ├─dev-hugepages.mount
●   ├─dev-mqueue.mount
●   ├─kmod-static-nodes.service
●   ├─ldconfig.service
●   ├─proc-sys-fs-binfmt_misc.mount
●   ├─sys-fs-fuse-connections.mount
●   ├─sys-kernel-config.mount
●   ├─sys-kernel-debug.mount
●   ├─sys-kernel-tracing.mount
●   ├─systemd-journal-catalog-update.service
●   ├─systemd-journal-flush.service
●   ├─systemd-journald.service
●   ├─systemd-machine-id-commit.service
●   ├─systemd-modules-load.service
●   ├─systemd-network-generator.service
○   ├─systemd-pstore.service
●   ├─systemd-random-seed.service
○   ├─systemd-repart.service
●   ├─systemd-resolved.service
●   ├─systemd-sysctl.service
●   ├─systemd-sysusers.service
●   ├─systemd-tmpfiles-setup-dev-early.service
●   ├─systemd-tmpfiles-setup-dev.service
●   ├─systemd-tmpfiles-setup.service
●   ├─systemd-udev-load-credentials.service
●   ├─systemd-udev-trigger.service
●   ├─systemd-udevd.service
●   ├─local-fs.target
●   │ ├─\x2ebottlerocket.mount
●   │ ├─has-boot-ever-succeeded.service
●   │ ├─local.mount
●   │ ├─mnt.mount
●   │ ├─opt.mount
●   │ ├─prepare-local-fs.service
●   │ ├─prepare-opt.service
●   │ ├─prepare-var-lib-containerd.service
●   │ ├─prepare-var-lib-kubelet.service
●   │ ├─prepare-var.service
○   │ ├─repart-data-fallback.service
○   │ ├─repart-data-preferred.service
●   │ ├─repart-local.service
●   │ ├─selinux-policy-files.service
●   │ ├─systemd-remount-fs.service
●   │ ├─tmp.mount
●   │ ├─var.mount
●   │ └─x86_64\x2dbottlerocket\x2dlinux\x2dgnu-sys\x2droot-usr-lib-modules.mount
●   └─swap.target

All services that depend on [email protected] :

bash-5.1#   systemctl list-dependencies --reverse [email protected]
[email protected]
● └─rocm-k8s-device-plugin.service

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@mgsharm mgsharm force-pushed the amd-device-plugin branch 5 times, most recently from 432bae0 to 919378b Compare November 17, 2025 07:56
@mgsharm mgsharm force-pushed the amd-device-plugin branch 2 times, most recently from dab0d0f to e8d2452 Compare November 18, 2025 18:31
@mgsharm mgsharm marked this pull request as ready for review November 24, 2025 07:34
Signed-off-by: Gaurav Sharma <[email protected]>
@mgsharm mgsharm force-pushed the amd-device-plugin branch 2 times, most recently from b1990ca to cd39740 Compare November 25, 2025 22:29
@mgsharm
Copy link
Contributor Author

mgsharm commented Nov 26, 2025

Make rocm-k8s-device-plugin depend on [email protected] to ensure AMD GPU driver loads first.

@mgsharm mgsharm merged commit 84dca2c into bottlerocket-os:develop Nov 26, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants