-
Notifications
You must be signed in to change notification settings - Fork 66
Description
Hi everyone, I'm trying to get K8s working with RoCE for a distributed, multi-node pytorch training job.
I'm struggling to get the Nvidia Network Operator running. From my understanding, I need to do two things:
- Get RDMA verbs passed through to the containers
- Get a secondary network working so that my RoCE NICs can talk to each other
My setup:
6 baremetal nodes of 8xH200s with RoCE Interconnects. Each node has 8 RoCE NICs
I have confirmed that the RoCE works using ib_send_bw on the bare metal nodes
I have K8s running on the nodes with gpu operator running. The cluster is already working as expected with single node training
My goal:
I'm trying to get host device network with RDMA working with the instructions from this doc:
https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/host-device-rdma.html
I've installed the yaml configs shown in the document and now have hostdev: 10.
I've create containers on two nodes and I'm able to see the net1 IP on both nodes:
Pod A on node 1
547: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
link/ether c4:70:bd:be:30:22 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.60/24 brd 192.168.3.255 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::c670:bdff:febe:3022/64 scope link
valid_lft forever preferred_lft forever
Pod B on node 2
543: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
link/ether c4:70:bd:bd:6f:8e brd ff:ff:ff:ff:ff:ff
inet 192.168.3.210/24 brd 192.168.3.255 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::c670:bdff:febd:6f8e/64 scope link
valid_lft forever preferred_lft forever
However, if I try to ping Pod B from Pod A, I get the following error.
[root@hostdev-test-5c9cdd96cf-bhg7h /]# ping 192.168.3.210
PING 192.168.3.210 (192.168.3.210) 56(84) bytes of data.
From 192.168.3.60 icmp_seq=1 Destination Host Unreachable
From 192.168.3.60 icmp_seq=2 Destination Host Unreachable
From 192.168.3.60 icmp_seq=3 Destination Host Unreachable
I believe it has to do with my bare metal node's netplan networking:
root@worker03:~# cat /etc/netplan/50-cloud-init.yaml
network:
version: 2
renderer: networkd
bonds:
bond0:
addresses:
- 10.2.201.28/24
dhcp4: false
dhcp6: false
interfaces:
- enp86s0f0
- enp86s0f1
nameservers:
addresses:
- 1.1.1.1
- 8.8.8.8
optional: false
parameters:
lacp-rate: fast
mode: 802.3ad
transmit-hash-policy: layer3+4
routes:
- to: default
via: 10.2.201.254
ethernets:
enp86s0f0:
dhcp4: false
dhcp6: false
optional: true
enp86s0f1:
dhcp4: false
dhcp6: false
optional: true
rail1:
ignore-carrier: true
addresses: [172.16.0.44/31]
routes:
- to: 172.16.0.0/15
via: 172.16.0.45
mtu: 9216
rail5:
ignore-carrier: true
addresses: [172.24.0.44/31]
routes:
- to: 172.24.0.0/15
via: 172.24.0.45
mtu: 9216
rail2:
ignore-carrier: true
addresses: [172.18.0.44/31]
routes:
- to: 172.18.0.0/15
via: 172.18.0.45
mtu: 9216
rail6:
ignore-carrier: true
addresses: [172.26.0.44/31]
routes:
- to: 172.26.0.0/15
via: 172.26.0.45
mtu: 9216
rail3:
ignore-carrier: true
addresses: [172.20.0.44/31]
routes:
- to: 172.20.0.0/15
via: 172.20.0.45
mtu: 9216
rail7:
ignore-carrier: true
addresses: [172.28.0.44/31]
routes:
- to: 172.28.0.0/15
via: 172.28.0.45
mtu: 9216
rail4:
ignore-carrier: true
addresses: [172.22.0.44/31]
routes:
- to: 172.22.0.0/15
via: 172.22.0.45
mtu: 9216
rail8:
ignore-carrier: true
addresses: [172.30.0.44/31]
routes:
- to: 172.30.0.0/15
via: 172.30.0.45
mtu: 9216
I'm pretty sure the issue is due to the netplan and advertising the IPs but I'm not sure how to get it to work with Nvidia Network Operator. Any help or pointers would be appreciated!