Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,7 @@ Now you can install the DRA driver's Helm chart into the Kubernetes cluster:
Submit workload:

```console
kubectl apply -f ./demo/specs/quickstart/gpu-test2.yaml
kubectl apply -f ./demo/specs/quickstart/v1/gpu-test2.yaml
```

If you're curious, have a look at [the `ResourceClaimTemplate`](https://github.com/jgehrcke/k8s-dra-driver-gpu/blob/526130fbaa3c8f5b1f6dcfd9ef01c9bdd5c229fe/demo/specs/quickstart/gpu-test2.yaml#L12) definition in this spec, and how the corresponding _single_ `ResourceClaim` is [being referenced](https://github.com/jgehrcke/k8s-dra-driver-gpu/blob/526130fbaa3c8f5b1f6dcfd9ef01c9bdd5c229fe/demo/specs/quickstart/gpu-test2.yaml#L46) by both containers.
Expand Down
2 changes: 1 addition & 1 deletion demo/clusters/kind/scripts/common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ DRIVER_IMAGE_VERSION=$(from_versions_mk "VERSION")
# From https://github.com/kubernetes/kubernetes/tags
# See also https://hub.docker.com/r/kindest/node/tags
: ${KIND_K8S_REPO:="https://github.com/kubernetes/kubernetes.git"}
: ${KIND_K8S_TAG:="v1.32.0"}
: ${KIND_K8S_TAG:="v1.34.0"}

# The name of the kind cluster to create
: ${KIND_CLUSTER_NAME:="${DRIVER_NAME}-cluster"}
Expand Down
3 changes: 3 additions & 0 deletions demo/specs/quickstart/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
#### Apply the half-balanced mig-parted config

To enable MIG mode on your GPUs, you can use [nvidia-mig-parted](https://github.com/NVIDIA/mig-parted).
```console
sudo -E nvidia-mig-parted apply -f mig-parted-config.yaml -c half-balanced
```
Expand All @@ -15,6 +17,7 @@ kubectl get pod -A

#### Deploy the 4 example apps discussed in the slides
```console
cd v1
kubectl apply --filename=gpu-test{1,2,3,4}.yaml
```

Expand Down
64 changes: 64 additions & 0 deletions demo/specs/quickstart/v1/gpu-test-mps.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# One pod, 2 containers share GPU using MPS
---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test-mps
---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test-mps
name: shared-gpu
spec:
spec:
devices:
requests:
- name: mps-gpu
exactly:
deviceClassName: gpu.nvidia.com
config:
- requests: ["mps-gpu"]
opaque:
driver: gpu.nvidia.com
parameters:
apiVersion: resource.nvidia.com/v1beta1
kind: GpuConfig
sharing:
strategy: MPS
mpsConfig:
defaultActiveThreadPercentage: 50
defaultPinnedDeviceMemoryLimit: 10Gi
---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test-mps
name: test-pod
labels:
app: pod
spec:
containers:
- name: mps-ctr0
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
resources:
claims:
- name: shared-gpu
request: mps-gpu
- name: mps-ctr1
image: nvcr.io/nvidia/k8s/cuda-sample:nbody-cuda11.6.0-ubuntu18.04
command: ["bash", "-c"]
args: ["trap 'exit 0' TERM; /tmp/sample --benchmark --numbodies=4226000 & wait"]
resources:
claims:
- name: shared-gpu
request: mps-gpu
resourceClaims:
- name: shared-gpu
resourceClaimTemplateName: shared-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
42 changes: 42 additions & 0 deletions demo/specs/quickstart/v1/gpu-test-vfiopci.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# One pod, one container asking for 1 distinct GPU

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test-vfio

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test-vfio
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: vfio.gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test-vfiopci
name: pod1
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["sleep 9999 & wait"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: single-gpu
72 changes: 72 additions & 0 deletions demo/specs/quickstart/v1/gpu-test1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Two pods, one container each
# Each container asking for 1 distinct GPU

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test1

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test1
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test1
name: pod1
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test1
name: pod2
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: gpu
resourceClaims:
- name: gpu
resourceClaimTemplateName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
54 changes: 54 additions & 0 deletions demo/specs/quickstart/v1/gpu-test2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# One pod, two containers
# Each asking for shared access to a single GPU

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test2

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaimTemplate
metadata:
namespace: gpu-test2
name: single-gpu
spec:
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test2
name: pod
labels:
app: pod
spec:
containers:
- name: ctr0
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
- name: ctr1
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimTemplateName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
71 changes: 71 additions & 0 deletions demo/specs/quickstart/v1/gpu-test3.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# One shared, global claim providing access to a GPU
# Two pods, each asking for access to the shared GPU

---
apiVersion: v1
kind: Namespace
metadata:
name: gpu-test3

---
apiVersion: resource.k8s.io/v1
kind: ResourceClaim
metadata:
namespace: gpu-test3
name: single-gpu
spec:
devices:
requests:
- name: gpu
exactly:
deviceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test3
name: pod1
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"

---
apiVersion: v1
kind: Pod
metadata:
namespace: gpu-test3
name: pod2
labels:
app: pod
spec:
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["nvidia-smi -L; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: shared-gpu
resourceClaims:
- name: shared-gpu
resourceClaimName: single-gpu
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Loading