Client network failure after upgrade to 8.2 (from 8.1) #86
-
|
We recently performed an upgrade of BeeGFS from 8.1.0 to 8.2.1 and observed unstable client connections, making beegfs practically unusable. A downgrade to 8.1 resolved the issue. We didn't downgrade the service daemons yet, but consider doing so due to continued (minor??) health issues being reported; see below. Q1: Is it safe to downgrade mgmt, meta and storage services from 8.2.1 to 8.1.0? Additional informationThe most striking observation is that the client comes up fine after upgrade, but quickly encounters connection issues to the service daemons and hangs a lot. The command On a v8.1 client node where we didn't update the client, the connection is stable although it reports degraded after service restart: On a client-node where we did an upgrade-downgrade cycle 8.1->8.2->8.1 it looks like this: Output of command IB driver version is DOCA 2.9 LTS (https://linux.mellanox.com/public/repo/doca/latest-2.9-LTS/rhel9.6/x86_64/). Modinfo reports: We have syslog information. The most notable messages look like this: I can't see anything else specifically providing further clues. These messages start pretty much right after the initial connection is established and then the connections rotate around different interfaces. Please let me know what I should look for. We still have a mix of client versions installed and can pull out a bit more info. Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 6 replies
-
|
On the server side, Health check is fine everywhere with the exception that client connections are reported as degraded due to fallback connections being used. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @frsc-ku ,
As long as you haven’t enabled any new 8.2-specific functionality (for example, switched to the new If the system is currently stable with the services on 8.2.1 and the clients on 8.1.0, I recommend keeping that configuration while we investigate further. BeeGFS 8.1 clients are fully compatible with 8.2 servers. If you still have a test client running 8.2.1, please try the following: (1) Set (2) If the issue persists, set both (3) If instability continues, set Lastly, do the 8.2.1 mgmtd/meta/storage servers log anything notable, or do they appear to communicate normally with each other (no fallbacks or instability)? ~Joe |
Beta Was this translation helpful? Give feedback.
-
|
Dear Joe, (1) disabling IPV6 has no effect We would like to avoid pinning the interface to a single interface, because we have 400G IB and 2x100G ETH on these hosts and would like to have the 2x100G ETH connection as a fall back for the clients. For now we can live with that, but it would be great if this could be fixed. Here a session with collected output on a node with a v8.2.1 client: Right after log-in: Disabling IPV6 with Setting only connInterfacesFile has no or minimal effect (I think only the mgmt connection becomes stable): Setting both, connInterfacesFile and connRDMAInterfacesFile stabilises the client connections while the health error persists (mgmt connection is erroneously reported as fall-back): This connectivity remains stable. Enabling IPV6 again does not change that. We are now running with these settings: The health error, however, remains reported: I also got the client that is co-located with the storage services up this way. For some reason I can't bind it to The only sign of change on the server side is that mgmtd does not report |
Beta Was this translation helpful? Give feedback.
-
|
Hi @frsc-ku , Thanks for the update. I agree this is not a permanent solution, I only intended this temporarily to stabilize the system and help narrow down the issue. As you observed, based on (1) It looks like the management may not have the Then, if possible on one client could you: (2) Revert the (3) Reset back to (4) It looks like your metadata and storage servers have ~Joe |
Beta Was this translation helpful? Give feedback.
-
|
Dear Joe. (1) No, its not set. According to the default selection policy described in https://doc.beegfs.io/latest/advanced_topics/network_tuning.html#network-interfaces it is not necessary. Well, at least it wasn't until 8.2. I was considering to blacklist vlans that should not be used, but other than that the default will do. I would prefer if that behaviour was restored, else, the documentation should be updated to reflect that an explicit priority must be defined to avoid random selection. (2) Great that I can skip that one, saved some time. (3) I set Unfortunately, adding a second interface to the client's interfaces file makes the client connection unstable again (right after service restart): The connections now rotate around the 2 allowed interfaces. Keeping the mgmt interface config and removing the 2x100G ETH interface from the client config again (only the one IB device is allowed), the connection is stable again and also the health error disappears: (4) No, no specific order is configured. They are using the default policy as referred to in (1) and seem to adhere to it. Its only the mgmtd that has an opinion of its own now. |
Beta Was this translation helpful? Give feedback.
-
|
Hi @frsc-ku , Thanks for all the help testing. There are two issues at play here: (1) When For your setup with multiple RDMA capable interfaces in different subnets you normally don't want to use Thanks for trying this out anyway, it helps confirm we found the right issue. (2) In 8.2 as part of introducing support for IPv6 there was new IP filtering logic added that changed the default interface priority/sorting on the management. In 8.1 interfaces were priortized in the order they were returned by the kernel. In 8.2 in an attempt to make the sorting more deterministic the default behavior changed to sort by IP address (which explains why Our goal was for 8.2 to behave exactly the same as 8.1 and require no configuration changes (i.e., priortizing the same interfaces as before the upgrade) while now including IPv6 addresses and expanding the We are currently going through final testing then intend to publish an 8.2.2 release to fix these issues ASAP. Thanks, ~Joe |
Beta Was this translation helpful? Give feedback.
-
|
Dear Joe, thanks a lot for the update. It would be great if on our end we could revert the explicit configs back to defaults (and I guess any user upgrading to 8.2 using default settings will be happy too). Since you consider a bug fix release, could I ask you to try to include a solution for #81 as well? I would be interested in a config option that makes clients always wait for BeeGFS services to be up instead of erring out (IO hangs in D-state until a down service comes up). The current behaviour makes it impossible to perform service without crashing jobs. If there is a not so complicated solution, for example, if Best regards, |
Beta Was this translation helpful? Give feedback.
-
|
I'm on v8.1.0 (some nodes on 8.0.1) and have not upgraded to 8.2. But, I'm also facing this network connection issue: sudo beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: signal: killed
sudo beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: exit status 1
sudo beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: exit status 1Restarting the client service did not help either: sudo systemctl restart beegfs-client.service
sudo systemctl status beegfs-client
● beegfs-client.service - Start BeeGFS Client
Loaded: loaded (/lib/systemd/system/beegfs-client.service; enabled; vendor preset: enab>
Drop-In: /etc/systemd/system/beegfs-client.service.d
└─override.conf
Active: active (exited) since Thu 2025-11-20 14:34:20 IST; 5s ago
Process: 1761929 ExecStart=/usr/bin/numactl --cpunodebind=1 --membind=1 /etc/init.d/beeg>
Main PID: 1761929 (code=exited, status=0/SUCCESS)
CPU: 36ms
Nov 20 14:34:20 node3 systemd[1]: Starting Start BeeGFS Client...
Nov 20 14:34:20 node3 numactl[1761929]: Starting BeeGFS Client:
Nov 20 14:34:20 node3 numactl[1761929]: - Loading BeeGFS modules
Nov 20 14:34:20 node3 numactl[1761929]: - Mounting directories from /etc/beegfs/beegfs-mount>
Nov 20 14:34:20 node3 systemd[1]: Finished Start BeeGFS Client.
sudo beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: signal: killed
sudo beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: exit status 1
sudo beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: exit status 1However, the beegfs mount is accessible during this time ls /media/beegfs
testfile testfile1 testfile2 test.txtBut, the beegfs mount is not visible in df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 6.3G 4.1M 6.3G 1% /run
/dev/nvme0n1p2 94G 76G 13G 86% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
/dev/nvme0n1p1 975M 6.1M 968M 1% /boot/efi
192.168.1.91:/ 916G 573G 297G 66% /mnt
tmpfs 6.3G 92K 6.3G 1% /run/user/121
tmpfs 6.3G 72K 6.3G 1% /run/user/10034However, when I retried df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 6.3G 4.1M 6.3G 1% /run
/dev/nvme0n1p2 94G 76G 13G 86% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
/dev/nvme0n1p1 975M 6.1M 968M 1% /boot/efi
tmpfs 6.3G 92K 6.3G 1% /run/user/121
beegfs_nodev 60T 36T 24T 61% /media/beegfs
tmpfs 6.3G 72K 6.3G 1% /run/user/10034After the above sudo beegfs health net
========================================================================
Client ID: c8572-691B2290-node8 (beegfs://192.168.1.12 -> /media/beegfs)
========================================================================
---------------
Management Node
---------------
management [ID: 1]
Connections: ethernet: 1 (169.254.3.1:8008);
--------------
Metadata Nodes
--------------
node_meta_11 [ID: 11]
Connections: rdma: 1 (192.168.100.11:8005);
-------------
Storage Nodes
-------------
node_storage_3 [ID: 3]
Connections: rdma: 2 (192.168.100.3:8003);
node_storage_4 [ID: 4]
Connections: rdma: 2 (192.168.100.4:8003);
node_storage_5 [ID: 5]
Connections: rdma: 3 (192.168.100.5:8003);
node_storage_6 [ID: 6]
Connections: rdma: 3 (192.168.100.6:8003);
node_storage_7 [ID: 7]
Connections: ethernet: 1 (192.168.100.7:8003 [fallback route]);rdma: 3 (192.168.100.7:8003);
node_storage_8 [ID: 8]
Connections: ethernet: 1 (192.168.100.8:8003 [fallback route]);rdma: 3 (192.168.100.8:8003);
node_storage_9 [ID: 9]
Connections: ethernet: 1 (192.168.100.9:8003 [fallback route]);rdma: 4 (192.168.100.9:8003);
node_storage_10 [ID: 10]
Connections: ethernet: 3 (192.168.100.10:8003 [fallback route]);
node_storage_11 [ID: 11]
Connections: ethernet: 1 (192.168.100.11:8003 [fallback route]);rdma: 2 (192.168.100.11:8003);Another issue is that sometimes, some storage node connections fall back to Ethernet, and RDMA is not even listed. node_storage_3, for example, below: sudo beegfs health net
=========================================================================
Client ID: c10A3F-691AF53D-node9 (beegfs://192.168.1.12 -> /media/beegfs)
=========================================================================
---------------
Management Node
---------------
management [ID: 1]
Connections: ethernet: 1 (169.254.3.1:8008);
--------------
Metadata Nodes
--------------
node_meta_11 [ID: 11]
Connections: rdma: 1 (192.168.101.11:8005);
-------------
Storage Nodes
-------------
node_storage_3 [ID: 3]
Connections: ethernet: 2 (192.168.100.3:8003 [fallback route]);
node_storage_4 [ID: 4]
Connections: rdma: 2 (192.168.100.4:8003 [fallback route]);
node_storage_5 [ID: 5]
Connections: ethernet: 1 (192.168.101.5:8003 [fallback route]);rdma: 3 (192.168.101.5:8003);
node_storage_6 [ID: 6]
Connections: rdma: 3 (192.168.100.6:8003 [fallback route]);
node_storage_7 [ID: 7]
Connections: rdma: 4 (192.168.100.7:8003 [fallback route]);
node_storage_8 [ID: 8]
Connections: rdma: 3 (192.168.100.8:8003 [fallback route]);
node_storage_9 [ID: 9]
Connections: rdma: 3 (192.168.100.9:8003 [fallback route]);
node_storage_10 [ID: 10]
Connections: rdma: 3 (192.168.101.10:8003);
node_storage_11 [ID: 11]
Connections: rdma: 2 (192.168.101.11:8003);Other notes:
|
Beta Was this translation helpful? Give feedback.
Hi @frsc-ku ,
Thanks for all the help testing. There are two issues at play here:
(1) When
connRDMAInterfaceswas not set, the client still tried to apply its own filtering logic and populated specific outbound RDMA interfaces instead of simply allowing "any" so the kernel routing would pick the right interface based on the remote server address for a given connection. NormallyconnRDMAInterfacesshould only be set when you have multiple interfaces in the same IPoIB subnet, and you want the client to load balance connections across those interfaces (e.g., multi-rail support). This is why you observed the connections rotating around the two interfaces.For your setup with multiple RDMA cap…