Skip to content

Question on Host Device Networking Setup #1773

@cholical

Description

@cholical

Hi everyone, I'm trying to get K8s working with RoCE for a distributed, multi-node pytorch training job.

I'm struggling to get the Nvidia Network Operator running. From my understanding, I need to do two things:

  1. Get RDMA verbs passed through to the containers
  2. Get a secondary network working so that my RoCE NICs can talk to each other

My setup:

6 baremetal nodes of 8xH200s with RoCE Interconnects. Each node has 8 RoCE NICs
I have confirmed that the RoCE works using ib_send_bw on the bare metal nodes
I have K8s running on the nodes with gpu operator running. The cluster is already working as expected with single node training

My goal:

I'm trying to get host device network with RDMA working with the instructions from this doc:

https://docs.nvidia.com/networking/display/kubernetes2570/quick-start/host-device-rdma.html

I've installed the yaml configs shown in the document and now have hostdev: 10.

I've create containers on two nodes and I'm able to see the net1 IP on both nodes:

Pod A on node 1

547: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
link/ether c4:70:bd:be:30:22 brd ff:ff:ff:ff:ff:ff
inet 192.168.3.60/24 brd 192.168.3.255 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::c670:bdff:febe:3022/64 scope link
valid_lft forever preferred_lft forever

Pod B on node 2

543: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9216 qdisc mq state UP group default qlen 1000
link/ether c4:70:bd:bd:6f:8e brd ff:ff:ff:ff:ff:ff
inet 192.168.3.210/24 brd 192.168.3.255 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::c670:bdff:febd:6f8e/64 scope link
valid_lft forever preferred_lft forever

However, if I try to ping Pod B from Pod A, I get the following error.

[root@hostdev-test-5c9cdd96cf-bhg7h /]# ping 192.168.3.210
PING 192.168.3.210 (192.168.3.210) 56(84) bytes of data.
From 192.168.3.60 icmp_seq=1 Destination Host Unreachable
From 192.168.3.60 icmp_seq=2 Destination Host Unreachable
From 192.168.3.60 icmp_seq=3 Destination Host Unreachable

I believe it has to do with my bare metal node's netplan networking:

root@worker03:~# cat /etc/netplan/50-cloud-init.yaml

network:
version: 2
renderer: networkd
bonds:
bond0:
addresses:
- 10.2.201.28/24
dhcp4: false
dhcp6: false
interfaces:
- enp86s0f0
- enp86s0f1
nameservers:
addresses:
- 1.1.1.1
- 8.8.8.8
optional: false
parameters:
lacp-rate: fast
mode: 802.3ad
transmit-hash-policy: layer3+4
routes:
- to: default
via: 10.2.201.254
ethernets:
enp86s0f0:
dhcp4: false
dhcp6: false
optional: true
enp86s0f1:
dhcp4: false
dhcp6: false
optional: true
rail1:
ignore-carrier: true
addresses: [172.16.0.44/31]
routes:
- to: 172.16.0.0/15
via: 172.16.0.45
mtu: 9216
rail5:
ignore-carrier: true
addresses: [172.24.0.44/31]
routes:
- to: 172.24.0.0/15
via: 172.24.0.45
mtu: 9216
rail2:
ignore-carrier: true
addresses: [172.18.0.44/31]
routes:
- to: 172.18.0.0/15
via: 172.18.0.45
mtu: 9216
rail6:
ignore-carrier: true
addresses: [172.26.0.44/31]
routes:
- to: 172.26.0.0/15
via: 172.26.0.45
mtu: 9216
rail3:
ignore-carrier: true
addresses: [172.20.0.44/31]
routes:
- to: 172.20.0.0/15
via: 172.20.0.45
mtu: 9216
rail7:
ignore-carrier: true
addresses: [172.28.0.44/31]
routes:
- to: 172.28.0.0/15
via: 172.28.0.45
mtu: 9216
rail4:
ignore-carrier: true
addresses: [172.22.0.44/31]
routes:
- to: 172.22.0.0/15
via: 172.22.0.45
mtu: 9216
rail8:
ignore-carrier: true
addresses: [172.30.0.44/31]
routes:
- to: 172.30.0.0/15
via: 172.30.0.45
mtu: 9216

I'm pretty sure the issue is due to the netplan and advertising the IPs but I'm not sure how to get it to work with Nvidia Network Operator. Any help or pointers would be appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions