Skip to content

Commit 6b8fd13

Browse files
wbc6080tangming1996
andcommitted
add GPU docs
Signed-off-by: wbc6080 <[email protected]> Co-authored-by: ming.tang <[email protected]>
1 parent a450f7a commit 6b8fd13

File tree

2 files changed

+585
-0
lines changed
  • docs/advanced
  • i18n/zh/docusaurus-plugin-content-docs/current/advanced

2 files changed

+585
-0
lines changed

docs/advanced/gpu.md

Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
---
2+
title: Edge Pods use GPU
3+
sidebar_position: 5
4+
---
5+
6+
## Abstract
7+
8+
With the development of edge AI, the demand for deploying GPU applications on edge nodes is gradually increasing. Currently, KubeEdge can manage GPU nodes through some configurations,
9+
and allocate Nvidia GPU resources to user edge applications through the k8s device-plugin component. If you need to use this feature, please refer to the steps below.
10+
11+
## Getting Started
12+
13+
### GPU running environment construction
14+
15+
Using Nvidia GPU on edge nodes requires building a GPU operating environment first, which mainly includes the following steps:
16+
17+
1. Install GPU driver
18+
19+
First you need to determine whether the edge node machine has GPU. You can use the `lspci | grep NVIDIA` command to check. Download the appropriate GPU driver according to the specific GPU model and complete the installation.
20+
After the installation is complete, you can use the `nvidia-smi` command to check whether the driver is installed successfully.
21+
22+
2. Download container runtime
23+
24+
To connect the GPU node to the KubeEdge cluster, you need to first install container runtimes such as Docker and Containerd.
25+
For specific installation guides, please refer to [Container Runtime](https://kubeedge.io/docs/setup/prerequisites/runtime)
26+
27+
:::tip
28+
Since KubeEdge v1.14, support for Dockershim has been removed, and use Docker runtime to manage edge containers is no longer supported. If you still need to use Docker, you need to install [cri-dockerd](https://kubeedge.io/docs/setup/prerequisites/runtime#docker-engine) after installing Docker.
29+
:::
30+
31+
3. Install Nvidia-Container-Toolkit
32+
33+
- If the edge node can directly access the external network, it can be installed directly according to the [official documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html).
34+
- If the edge node cannot directly access the external network, you need to download the official [offline installation package](https://github.com/NVIDIA/nvidia-container-toolkit/releases) on a machine with network connectivity,
35+
and transfer the installation package to the edge node to complete decompression. After decompression, the following files should appear in the directory:
36+
37+
```shell
38+
root@edgenode:~/release-v1.16.0-rc.1-experimental/packages/ubuntu18.04/amd64# ls
39+
libnvidia-container1_1.16.0~rc.1-1_amd64.deb libnvidia-container-tools_1.16.0~rc.1-1_amd64.deb nvidia-container-toolkit-operator-extensions_1.16.0~rc.1-1_amd64.deb
40+
libnvidia-container1-dbg_1.16.0~rc.1-1_amd64.deb nvidia-container-toolkit_1.16.0~rc.1-1_amd64.deb
41+
libnvidia-container-dev_1.16.0~rc.1-1_amd64.deb nvidia-container-toolkit-base_1.16.0~rc.1-1_amd64.deb
42+
```
43+
44+
Execute the following command in this directory to complete the installation:
45+
46+
```shell
47+
sudo apt install ./*
48+
```
49+
50+
4. Configure container runtime to support GPU
51+
52+
After successfully installing Nvidia-Container-Toolkit, you can use `nvidia-ctk` to configure each container runtime to support GPU.
53+
54+
```shell
55+
# docker
56+
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
57+
# containerd
58+
sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
59+
```
60+
61+
5. Restart container runtime
62+
63+
Restart the container runtime and confirm whether GPU is supported.
64+
65+
```shell
66+
# docker:
67+
systemctl daemon-reload && systemctl restart docker
68+
# Check whether the runtime is modified successfully.
69+
root@nano-desktop:~# docker info |grep Runtime
70+
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
71+
Default Runtime: nvidia
72+
73+
# containerd:
74+
systemctl daemon-reload && systemctl restart containerd
75+
# Check whether the runtime is modified successfully.
76+
root@edgenode:~# cat /etc/containerd/config.toml |grep nvidia
77+
default_runtime_name = "nvidia"
78+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
79+
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
80+
BinaryName = "/usr/bin/nvidia-container-runtime"
81+
```
82+
83+
84+
Through the above steps, the edge node already has a GPU driver, and the container runtime also has the ability to allocate GPU devices. Next, the edge node can be managed into the KubeEdge cluster.
85+
86+
### Edge GPU node management
87+
88+
Hosting edge Nvidia GPU nodes mainly includes the following steps:
89+
90+
1. Manage the node to the cluster
91+
92+
It is recommended to use the keadm tool to manage edge nodes to the KubeEdge cluster. The access method is the same as ordinary edge nodes. For detailed information, please refer to [keadm join](https://kubeedge.io/docs/setup/install-with-keadm#setup-edge-side-kubeedge-worker-node).
93+
Here, Docker and Containerd container runtime are used as examples:
94+
95+
```shell
96+
# docker:
97+
keadm join --cgroupdriver=systemd \
98+
--cloudcore-ipport="THE-EXPOSED-IP":10000 \
99+
--kubeedge-version=v1.17.0 \
100+
--token="YOUR TOKEN"
101+
--remote-runtime-endpoint=unix:///var/run/cri-dockerd.sock
102+
# containerd:
103+
keadm join --cgroupdriver=cgroupfs \
104+
--cloudcore-ipport="THE-EXPOSED-IP":10000 \
105+
--kubeedge-version=v1.17.0 \
106+
--token="YOUR TOKEN"
107+
--remote-runtime-endpoint=unix:///run/containerd/containerd.sock
108+
109+
```
110+
111+
Output:
112+
113+
```shell
114+
...
115+
KubeEdge edgecore is running, For logs visit: journalctl -u edgecore.service -xe
116+
```
117+
118+
You can run the `systemctl status edgecore` command to confirm whether EdgeCore is running successfully:
119+
120+
```shell
121+
# systemctl status edgecore
122+
● edgecore.service
123+
Loaded: loaded (/etc/systemd/system/edgecore.service; enabled; vendor preset: enabled)
124+
Active: active (running) since Wed 2022-10-26 11:26:59 CST; 6s ago
125+
Main PID: 2745865 (edgecore)
126+
Tasks: 13 (limit: 4915)
127+
CGroup: /system.slice/edgecore.service
128+
└─2745865 /usr/local/bin/edgecore
129+
```
130+
131+
2. Deploy k8s-device-plugin
132+
133+
You can create k8s-device-plugin daemonSet according to the following yaml file.
134+
135+
```yaml
136+
apiVersion: apps/v1
137+
kind: DaemonSet
138+
metadata:
139+
name: nvidia-device-plugin-daemonset
140+
namespace: kube-system
141+
spec:
142+
revisionHistoryLimit: 10
143+
selector:
144+
matchLabels:
145+
name: nvidia-device-plugin-ds
146+
template:
147+
metadata:
148+
labels:
149+
name: nvidia-device-plugin-ds
150+
spec:
151+
containers:
152+
- env:
153+
- name: FAIL_ON_INIT_ERROR
154+
value: "false"
155+
image: nvcr.io/nvidia/k8s-device-plugin:v0.14.3
156+
imagePullPolicy: IfNotPresent
157+
name: nvidia-device-plugin-ctr
158+
resources: {}
159+
securityContext:
160+
allowPrivilegeEscalation: false
161+
capabilities:
162+
drop:
163+
- ALL
164+
terminationMessagePath: /dev/termination-log
165+
terminationMessagePolicy: File
166+
volumeMounts:
167+
- mountPath: /var/lib/kubelet/device-plugins
168+
name: device-plugin
169+
dnsPolicy: ClusterFirst
170+
priorityClassName: system-node-critical
171+
restartPolicy: Always
172+
schedulerName: default-scheduler
173+
securityContext: {}
174+
terminationGracePeriodSeconds: 30
175+
tolerations:
176+
- effect: NoSchedule
177+
key: nvidia.com/gpu
178+
operator: Exists
179+
volumes:
180+
- hostPath:
181+
path: /var/lib/kubelet/device-plugins
182+
type: ""
183+
name: device-plugin
184+
```
185+
186+
Check whether k8s-device-plugin is deployed successfully:
187+
188+
```shell
189+
# After deployment, check whether it is successfully deployed on the edge node
190+
[root@master-01 ~]# kubectl get daemonsets.apps -n kube-system|grep nvidia
191+
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
192+
nvidia-device-plugin-daemonset 2 2 2 2 2 <none> 292d
193+
[root@master-01 ~]# kubectl get po -n kube-system -owide|grep nvidia
194+
nvidia-device-plugin-daemonset-d5nbc 1/1 Running 0 22m 10.88.0.4 nvidia-edge-node <none> <none>
195+
nvidia-device-plugin-daemonset-qbwdd 1/1 Running 0 2d6h 10.88.0.2 nano-1iamih8np <none> <none>
196+
```
197+
198+
After successfully deploying k8s-device-plugin, you can use the `kubectl describe node` command to verify whether the node GPU information is reported correctly.
199+
200+
```shell
201+
# Seeing the key of [nvidia.com/gpu] under the Capacity and Allocatable fields indicates that the device-plugin is deployed successfully and the GPU information of
202+
# the node has been successfully reported.
203+
[root@master-01 nvidia-test]# kubectl describe node {YOUR EDGENODE NAME}
204+
Name: nvidia-edge-node
205+
Roles: agent,edge
206+
Labels: beta.kubernetes.io/arch=amd64
207+
...
208+
Capacity:
209+
cpu: 12
210+
ephemeral-storage: 143075484Ki
211+
hugepages-1Gi: 0
212+
hugepages-2Mi: 0
213+
memory: 40917620Ki
214+
nvidia.com/gpu: 1
215+
pods: 110
216+
Allocatable:
217+
cpu: 12
218+
ephemeral-storage: 131858365837
219+
hugepages-1Gi: 0
220+
hugepages-2Mi: 0
221+
memory: 40815220Ki
222+
nvidia.com/gpu: 1
223+
pods: 110
224+
```
225+
226+
If the `nvidia.com/gpu` resource appears in the node information, the edge GPU node has been successfully managed into the KubeEdge cluster, and the GPU resource can be directly
227+
allocated by the application's yaml file. You can deploy the test application as follows to verify the GPU allocation capability.
228+
229+
### Test GPU resource allocation ability
230+
231+
1. Deploy GPU applications
232+
233+
You can use the sample yaml shown below to deploy a pytorch edge application that uses one GPU resource.
234+
235+
```yaml
236+
kind: Deployment
237+
apiVersion: apps/v1
238+
metadata:
239+
name: test-gpu
240+
namespace: default
241+
spec:
242+
replicas: 1
243+
selector:
244+
matchLabels:
245+
app: test-gpu
246+
template:
247+
metadata:
248+
labels:
249+
app: test-gpu
250+
spec:
251+
containers:
252+
- name: container-1
253+
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-devel
254+
command:
255+
- tail
256+
- '-f'
257+
- /dev/null
258+
resources:
259+
limits:
260+
nvidia.com/gpu: '1'
261+
requests:
262+
nvidia.com/gpu: '1'
263+
imagePullPolicy: IfNotPresent
264+
nodeName: nvidia-edge-node
265+
```
266+
267+
2. Verify whether the GPU is successfully mounted
268+
269+
Enter the container created by this application and call the `torch.cuda.is_available()` command in pytorch to verify whether the GPU is successfully mounted.
270+
271+
```shell
272+
# docker
273+
root@nano-desktop:~# docker ps
274+
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
275+
e7e3804626a5 853b58c1dce6 "tail -f /dev/null" 53 seconds ago Up 45 seconds k8s_container-1_test-gpu-arm64-nano-7f8fd7f79f-hzvp5_default_64fb7a90-b0e6-4b46-a34f-8a06b24b9169_0
276+
root@nano-desktop:~# docker exec -it e7e3804626a5 /bin/bash
277+
root@test-gpu-arm64-nano-7f8fd7f79f-hzvp5:/# python3
278+
Python 3.8.10 (default, Nov 14 2022, 12:59:47)
279+
[GCC 9.4.0] on linux
280+
Type "help", "copyright", "credits" or "license" for more information.
281+
>>> import torch
282+
>>> torch.cuda.is_available()
283+
True
284+
285+
# containerd
286+
root@edgenode:~# crictl ps
287+
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID POD
288+
de1f1e60abc0a 0dd75116a8ce8 2 minutes ago Running container-1 0 6beffb412af3f test-gpu-6bfbdc9449-jfbrl
289+
root@edgenode:~# crictl exec -it de1f1e60abc0a /bin/bash
290+
root@test-gpu-6bfbdc9449-jfbrl:/workspace# python3
291+
Python 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] on linux
292+
Type "help", "copyright", "credits" or "license" for more information.
293+
>>> import torch
294+
>>> torch.cuda.is_available()
295+
True
296+
```

0 commit comments

Comments
 (0)