Client network failure after upgrade to 8.2 (from 8.1) #86

frsc-ku · 2025-10-31T12:04:23Z

frsc-ku
Oct 31, 2025

We recently performed an upgrade of BeeGFS from 8.1.0 to 8.2.1 and observed unstable client connections, making beegfs practically unusable. A downgrade to 8.1 resolved the issue. We didn't downgrade the service daemons yet, but consider doing so due to continued (minor??) health issues being reported; see below.

Q1: Is it safe to downgrade mgmt, meta and storage services from 8.2.1 to 8.1.0?

Additional information

The most striking observation is that the client comes up fine after upgrade, but quickly encounters connection issues to the service daemons and hangs a lot. The command beegfs health net shows flipping/unstable connectivity until it fails entirely. A typical v8.2 session looks like that:

$ beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: signal: killed
$ sudo systemctl restart beegfs-client.service 
$ beegfs health net
====================================================================================
Client ID: c94E6-69048306-rubuscmpn02fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (10.84.12.144:8110 [fallback route]);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: ethernet: 1 (169.245.12.144:8210 [fallback route]);rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: ethernet: 1 (169.245.12.144:8220 [fallback route]);rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: ethernet: 1 (169.245.12.144:8230 [fallback route]);rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: ethernet: 1 (169.245.12.144:8240 [fallback route]);rdma: 8 (169.245.12.144:8240);

$ beegfs health net
====================================================================================
Client ID: c94E6-69048306-rubuscmpn02fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (10.84.12.144:8110 [fallback route]);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: ethernet: 1 (169.245.12.144:8210 [fallback route]);rdma: 8 (10.84.12.144:8210 [fallback route]);
nas01_g2 [ID: 102]
   Connections: ethernet: 1 (169.245.12.144:8220 [fallback route]);rdma: 8 (10.84.12.144:8220 [fallback route]);
nas01_g3 [ID: 103]
   Connections: ethernet: 1 (169.245.12.144:8230 [fallback route]);rdma: 8 (10.84.12.144:8230 [fallback route]);
nas01_g4 [ID: 104]
   Connections: ethernet: 1 (169.245.12.144:8240 [fallback route]);rdma: 8 (10.84.12.144:8240 [fallback route]);

$ beegfs health net
====================================================================================
Client ID: c94E6-69048306-rubuscmpn02fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (10.84.12.144:8110 [fallback route]);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: ethernet: 1 (169.245.12.144:8210 [fallback route]);rdma: 8 (10.84.12.144:8210 [fallback route]);
nas01_g2 [ID: 102]
   Connections: ethernet: 1 (169.245.12.144:8220 [fallback route]);rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: ethernet: 1 (169.245.12.144:8230 [fallback route]);rdma: 8 (10.84.12.144:8230 [fallback route]);
nas01_g4 [ID: 104]
   Connections: ethernet: 1 (169.245.12.144:8240 [fallback route]);rdma: 8 (169.245.12.144:8240);

$ beegfs health net
====================================================================================
Client ID: c94E6-69048306-rubuscmpn02fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: <none>
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: ethernet: 1 (169.245.12.144:8210 [fallback route]);rdma: 7 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: ethernet: 1 (169.245.12.144:8220 [fallback route]);rdma: 8 (10.84.12.144:8220 [fallback route]);
nas01_g3 [ID: 103]
   Connections: ethernet: 1 (169.245.12.144:8230 [fallback route]);rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: ethernet: 1 (169.245.12.144:8240 [fallback route]);rdma: 8 (169.245.12.144:8240);

$ beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: signal: killed
$

On a v8.1 client node where we didn't update the client, the connection is stable although it reports degraded after service restart:

$ beegfs health net
====================================================================================
Client ID: c1C4A-69021122-rubuscmpn05fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);

$ sudo systemctl restart beegfs-client.service 
$ beegfs health net
====================================================================================
Client ID: c70E7-69048879-rubuscmpn05fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);

$

On a client-node where we did an upgrade-downgrade cycle 8.1->8.2->8.1 it looks like this:

$ beegfs health net
====================================================================================
Client ID: cA4B7-69047DAB-rubuscmpn01fl.uni (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);

$

Output of command modinfo beegfs for both versions:

$ modinfo beegfs
filename:       /lib/modules/5.14.0-570.58.1.el9_6.x86_64/updates/fs/beegfs_autobuild/beegfs.ko
version:        8.1.0
alias:          fs-beegfs
author:         ThinkParQ GmbH
description:    BeeGFS parallel file system client (https://www.beegfs.io)
license:        GPL v2
rhelversion:    9.6
srcversion:     C2CD035F546DFFA95867692
depends:        ib_core,rdma_cm,mlx_compat
retpoline:      Y
name:           beegfs
vermagic:       5.14.0-570.58.1.el9_6.x86_64 SMP preempt mod_unload modversions

$ modinfo beegfs
filename:       /lib/modules/5.14.0-570.58.1.el9_6.x86_64/updates/fs/beegfs_autobuild/beegfs.ko
version:        8.2.1
alias:          fs-beegfs
author:         ThinkParQ GmbH
description:    BeeGFS parallel file system client (https://www.beegfs.io)
license:        GPL v2
rhelversion:    9.6
srcversion:     16575DCE6FED931F34AECCA
depends:        ib_core,rdma_cm,mlx_compat
retpoline:      Y
name:           beegfs
vermagic:       5.14.0-570.58.1.el9_6.x86_64 SMP preempt mod_unload modversions

IB driver version is DOCA 2.9 LTS (https://linux.mellanox.com/public/repo/doca/latest-2.9-LTS/rhel9.6/x86_64/). Modinfo reports:

$ modinfo mlx5_core
filename:       /lib/modules/5.14.0-570.58.1.el9_6.x86_64/extra/mlnx-ofa_kernel/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.ko
alias:          auxiliary:mlx5_core.eth-rep
alias:          auxiliary:mlx5_core.eth
basedon:        Korg 6.8-rc4
version:        24.10-3.2.5
license:        Dual BSD/GPL
description:    Mellanox 5th generation network adapters (ConnectX series) core driver
author:         Eli Cohen <[email protected]>
rhelversion:    9.6
srcversion:     D041342324195E07553B479
alias:          pci:v000015B3d0000A2DFsv*sd*bc*sc*i*
alias:          pci:v000015B3d0000A2DCsv*sd*bc*sc*i*
alias:          pci:v000015B3d0000A2D6sv*sd*bc*sc*i*
alias:          pci:v000015B3d0000A2D3sv*sd*bc*sc*i*
alias:          pci:v000015B3d0000A2D2sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001025sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001023sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001021sv*sd*bc*sc*i*
alias:          pci:v000015B3d0000101Fsv*sd*bc*sc*i*
alias:          pci:v000015B3d0000101Esv*sd*bc*sc*i*
alias:          pci:v000015B3d0000101Dsv*sd*bc*sc*i*
alias:          pci:v000015B3d0000101Csv*sd*bc*sc*i*
alias:          pci:v000015B3d0000101Bsv*sd*bc*sc*i*
alias:          pci:v000015B3d0000101Asv*sd*bc*sc*i*
alias:          pci:v000015B3d00001019sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001018sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001017sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001016sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001015sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001014sv*sd*bc*sc*i*
alias:          pci:v000015B3d00001013sv*sd*bc*sc*i*
alias:          auxiliary:mlx5_core.sf
depends:        mlx_compat,mlxdevm,psample,pci-hyperv-intf,tls,mlxfw
retpoline:      Y
name:           mlx5_core
vermagic:       5.14.0-570.58.1.el9_6.x86_64 SMP preempt mod_unload modversions 
parm:           num_of_groups:Eswitch offloads number of big groups in FDB table. Valid range 1 - 1024. Default 15 (uint)
parm:           debug_mask:debug mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both. Default=0 (uint)
parm:           prof_sel:profile selector. Valid range 0 - 4 (uint)

$ modinfo mlx5_ib
filename:       /lib/modules/5.14.0-570.58.1.el9_6.x86_64/extra/mlnx-ofa_kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko
alias:          auxiliary:mlx5_core.rdma-rep
alias:          auxiliary:mlx5_core.multiport
alias:          auxiliary:mlx5_core.rdma
license:        Dual BSD/GPL
description:    Mellanox 5th generation network adapters (ConnectX series) IB driver
author:         Eli Cohen <[email protected]>
rhelversion:    9.6
srcversion:     6A34FCBFA1E7C9AA6620CED
depends:        mlx5_core,ib_uverbs,ib_core,macsec,mlx_compat
retpoline:      Y
name:           mlx5_ib
vermagic:       5.14.0-570.58.1.el9_6.x86_64 SMP preempt mod_unload modversions 
parm:           dc_cnak_qp_depth:DC CNAK QP depth (uint)

We have syslog information. The most notable messages look like this:

Oct 29 10:25:55 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8110 (protocol: RDMA)
Oct 29 10:26:00 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8110 (protocol: RDMA)
Oct 29 10:26:05 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8210 (protocol: RDMA)
Oct 29 10:26:10 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8210 (protocol: RDMA)
Oct 29 10:27:01 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8220 (protocol: RDMA)
Oct 29 10:27:06 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8220 (protocol: RDMA)
Oct 29 10:27:58 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8230 (protocol: RDMA)
Oct 29 10:28:03 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8230 (protocol: RDMA)
Oct 29 10:28:54 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8240 (protocol: RDMA)
Oct 29 10:28:59 rubuscmpn01fl kernel: beegfs: mount(24562): NodeConn (acquire stream): Connect failed: [email protected]:8240 (protocol: RDMA)
Oct 29 10:30:06 rubuscmpn01fl kernel: beegfs: ls(24725): NodeConn (acquire stream): Connect failed: [email protected]:8120 (protocol: RDMA)
Oct 29 10:30:11 rubuscmpn01fl kernel: beegfs: ls(24725): NodeConn (acquire stream): Connect failed: [email protected]:8120 (protocol: RDMA)

I can't see anything else specifically providing further clues. These messages start pretty much right after the initial connection is established and then the connections rotate around different interfaces.

Please let me know what I should look for. We still have a mix of client versions installed and can pull out a bit more info.

Best regards,
Frank

Answered by iamjoemccormick

Nov 4, 2025

Hi @frsc-ku ,

Thanks for all the help testing. There are two issues at play here:

(1) When connRDMAInterfaces was not set, the client still tried to apply its own filtering logic and populated specific outbound RDMA interfaces instead of simply allowing "any" so the kernel routing would pick the right interface based on the remote server address for a given connection. Normally connRDMAInterfaces should only be set when you have multiple interfaces in the same IPoIB subnet, and you want the client to load balance connections across those interfaces (e.g., multi-rail support). This is why you observed the connections rotating around the two interfaces.

For your setup with multiple RDMA cap…

View full answer

frsc-ku · 2025-10-31T12:12:31Z

frsc-ku
Oct 31, 2025
Author

On the server side, beegfs node list --with-nics shows:

ID     TYPE        ALIAS                             NICS                                        
c:28   client      c4921-68E67F72-rubuscmpn04fl.uni  rdma:ibs5 (169.245.11.116:8004)             
                                                     rdma:vlan939 (10.106.17.133:8004)           
                                                     rdma:vlan849 (10.84.11.116:8004)            
                                                     ethernet:ibs5 (169.245.11.116:8004)         
                                                     ethernet:vlan939 (10.106.17.133:8004)       
                                                     ethernet:vlan849 (10.84.11.116:8004)        
c:37   client      c31A7-6902314C-rubuscmpn03fl.uni  rdma:vlan849 (10.84.11.114:8004)            
                                                     rdma:vlan939 (10.106.17.100:8004)           
                                                     rdma:ibp225s0 (169.245.11.114:8004)         
                                                     ethernet:vlan849 (10.84.11.114:8004)        
                                                     ethernet:vlan939 (10.106.17.100:8004)       
                                                     ethernet:ibp225s0 (169.245.11.114:8004)     
c:44   client      cA4B7-69047DAB-rubuscmpn01fl.uni  rdma:ibp225s0 (169.245.11.112:8004)         
                                                     rdma:vlan939 (10.106.17.98:8004)            
                                                     rdma:vlan849 (10.84.11.112:8004)            
                                                     ethernet:ibp225s0 (169.245.11.112:8004)     
                                                     ethernet:vlan939 (10.106.17.98:8004)        
                                                     ethernet:vlan849 (10.84.11.112:8004)        
c:47   client      c94E6-69048306-rubuscmpn02fl.uni  rdma:vlan849 (10.84.11.113:8004)            
                                                     rdma:vlan939 (10.106.17.99:8004)            
                                                     rdma:ibp225s0 (169.245.11.113:8004)         
                                                     ethernet:vlan849 (10.84.11.113:8004)        
                                                     ethernet:vlan939 (10.106.17.99:8004)        
                                                     ethernet:ibp225s0 (169.245.11.113:8004)     
c:48   client      c915A-6904865F-rubuscmpn06fl.uni  rdma:ibs5 (169.245.11.118:8004)             
                                                     rdma:vlan939 (10.106.17.134:8004)           
                                                     rdma:vlan849 (10.84.11.118:8004)            
                                                     ethernet:ibs5 (169.245.11.118:8004)         
                                                     ethernet:vlan939 (10.106.17.134:8004)       
                                                     ethernet:vlan849 (10.84.11.118:8004)        
c:49   client      cA343-6904874B-rubusgpun01fl.uni  rdma:ibs3 (169.245.12.32:8004)              
                                                     rdma:vlan939 (10.106.17.122:8004)           
                                                     rdma:vlan850 (10.84.12.32:8004)             
                                                     ethernet:ibs3 (169.245.12.32:8004)          
                                                     ethernet:vlan939 (10.106.17.122:8004)       
                                                     ethernet:vlan850 (10.84.12.32:8004)         
c:50   client      c70E7-69048879-rubuscmpn05fl.uni  rdma:ibs5 (169.245.11.117:8004)             
                                                     rdma:vlan939 (10.106.17.132:8004)           
                                                     rdma:vlan849 (10.84.11.117:8004)            
                                                     ethernet:ibs5 (169.245.11.117:8004)         
                                                     ethernet:vlan939 (10.106.17.132:8004)       
                                                     ethernet:vlan849 (10.84.11.117:8004)        
c:51   client      c3CDCA-690488E9-rubushgxn01fl.un  rdma:ibs11f0 (169.245.12.40:8004)           
                                                     rdma:vlan939 (10.106.17.121:8004)           
                                                     rdma:vlan851 (10.84.12.40:8004)             
                                                     ethernet:ibs11f0 (169.245.12.40:8004)       
                                                     ethernet:vlan939 (10.106.17.121:8004)       
                                                     ethernet:vlan851 (10.84.12.40:8004)         
m:101  meta        nas01_g5_a1                       rdma:ibs3 (169.245.12.144:8110)             
                                                     rdma:vlan939 (10.106.17.131:8110)           
                                                     rdma:vlan855 (10.84.12.144:8110)            
                                                     ethernet:ibs3 (169.245.12.144:8110)         
                                                     ethernet:vlan939 (10.106.17.131:8110)       
                                                     ethernet:vlan855 (10.84.12.144:8110)        
m:102  meta        nas01_g5_a2                       rdma:ibs3 (169.245.12.144:8120)             
                                                     rdma:vlan939 (10.106.17.131:8120)           
                                                     rdma:vlan855 (10.84.12.144:8120)            
                                                     ethernet:ibs3 (169.245.12.144:8120)         
                                                     ethernet:vlan939 (10.106.17.131:8120)       
                                                     ethernet:vlan855 (10.84.12.144:8120)        
m:103  meta        nas01_g5_a3                       rdma:ibs3 (169.245.12.144:8130)             
                                                     rdma:vlan939 (10.106.17.131:8130)           
                                                     rdma:vlan855 (10.84.12.144:8130)            
                                                     ethernet:ibs3 (169.245.12.144:8130)         
                                                     ethernet:vlan939 (10.106.17.131:8130)       
                                                     ethernet:vlan855 (10.84.12.144:8130)        
m:104  meta        nas01_g5_a4                       rdma:ibs3 (169.245.12.144:8140)             
                                                     rdma:vlan939 (10.106.17.131:8140)           
                                                     rdma:vlan855 (10.84.12.144:8140)            
                                                     ethernet:ibs3 (169.245.12.144:8140)         
                                                     ethernet:vlan939 (10.106.17.131:8140)       
                                                     ethernet:vlan855 (10.84.12.144:8140)        
s:101  storage     nas01_g1                          rdma:ibs3 (169.245.12.144:8210)             
                                                     rdma:vlan939 (10.106.17.131:8210)           
                                                     rdma:vlan855 (10.84.12.144:8210)            
                                                     ethernet:ibs3 (169.245.12.144:8210)         
                                                     ethernet:vlan939 (10.106.17.131:8210)       
                                                     ethernet:vlan855 (10.84.12.144:8210)        
s:102  storage     nas01_g2                          rdma:ibs3 (169.245.12.144:8220)             
                                                     rdma:vlan939 (10.106.17.131:8220)           
                                                     rdma:vlan855 (10.84.12.144:8220)            
                                                     ethernet:ibs3 (169.245.12.144:8220)         
                                                     ethernet:vlan939 (10.106.17.131:8220)       
                                                     ethernet:vlan855 (10.84.12.144:8220)        
s:103  storage     nas01_g3                          rdma:ibs3 (169.245.12.144:8230)             
                                                     rdma:vlan939 (10.106.17.131:8230)           
                                                     rdma:vlan855 (10.84.12.144:8230)            
                                                     ethernet:ibs3 (169.245.12.144:8230)         
                                                     ethernet:vlan939 (10.106.17.131:8230)       
                                                     ethernet:vlan855 (10.84.12.144:8230)        
s:104  storage     nas01_g4                          rdma:ibs3 (169.245.12.144:8240)             
                                                     rdma:vlan939 (10.106.17.131:8240)           
                                                     rdma:vlan855 (10.84.12.144:8240)            
                                                     ethernet:ibs3 (169.245.12.144:8240)         
                                                     ethernet:vlan939 (10.106.17.131:8240)       
                                                     ethernet:vlan855 (10.84.12.144:8240)        
mg:1   management  nas01_mgmt                        ethernet:vlan855 (10.84.12.144:8008,8010)   
                                                     ethernet:vlan939 (10.106.17.131:8008,8010)  
                                                     ethernet:ibs3 (169.245.12.144:8008,8010)    
                                                     ethernet:lo (127.0.0.1:8008,8010)           
                                                     ethernet:lo ([::1]:8008,8010)

Health check is fine everywhere with the exception that client connections are reported as degraded due to fallback connections being used.

0 replies

iamjoemccormick · 2025-10-31T21:16:38Z

iamjoemccormick
Oct 31, 2025
Maintainer

Hi @frsc-ku ,

Q1: Is it safe to downgrade mgmt, meta and storage services from 8.2.1 to 8.1.0?

As long as you haven’t enabled any new 8.2-specific functionality (for example, switched to the new connInterfaceFile/List format), there should be no issues downgrading the management, metadata, and storage services from 8.2.1 to 8.1.0.

If the system is currently stable with the services on 8.2.1 and the clients on 8.1.0, I recommend keeping that configuration while we investigate further. BeeGFS 8.1 clients are fully compatible with 8.2 servers.

If you still have a test client running 8.2.1, please try the following:

(1) Set connDisableIPv6=true in /etc/beegfs/beegfs-client.conf, restart the client (remount), and verify the change with: cat /proc/fs/beegfs/*/config | grep connDisableIPv6. Then check if the client connections remain unstable.

(2) If the issue persists, set both connInterfacesFile and connRDMAInterfacesFile to a single interface in the same IP subnet as the preferred interface on all server services. Restart the client and verify the configuration was applied by running cat /proc/fs/beegfs/*/client_info.

(3) If instability continues, set logLevel = 5 in the client configuration, restart the client, and collect kernel logs (journalctl -k or dmesg) showing any BeeGFS-related errors.

Lastly, do the 8.2.1 mgmtd/meta/storage servers log anything notable, or do they appear to communicate normally with each other (no fallbacks or instability)?

~Joe

0 replies

frsc-ku · 2025-11-03T11:48:05Z

frsc-ku
Nov 3, 2025
Author

Dear Joe,

(1) disabling IPV6 has no effect
(2) pinning the interface works as a workaround, it does not fix the health error

We would like to avoid pinning the interface to a single interface, because we have 400G IB and 2x100G ETH on these hosts and would like to have the 2x100G ETH connection as a fall back for the clients. For now we can live with that, but it would be great if this could be fixed.

Here a session with collected output on a node with a v8.2.1 client:

Right after log-in:

$ beegfs health net
Error: error forcing establishment of BeeGFS client/server connections: signal: killed

Disabling IPV6 with connDisableIPv6=true has no effect, connections keep rotating:

$ sudo systemctl restart beegfs-client.service

$ cat /proc/fs/beegfs/*/config | grep -i -e ipv6 -e interfaces
connInterfacesFile = 
connRDMAInterfacesFile = 
connDisableIPv6 = 1

$ beegfs health net
====================================================================================
Client ID: c13337-69087C84-rubuscmpn02fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (10.84.12.144:8110 [fallback route]);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: ethernet: 1 (169.245.12.144:8210 [fallback route]);rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: ethernet: 1 (169.245.12.144:8220 [fallback route]);rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: ethernet: 1 (169.245.12.144:8230 [fallback route]);rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: ethernet: 1 (169.245.12.144:8240 [fallback route]);rdma: 8 (169.245.12.144:8240);

Setting only connInterfacesFile has no or minimal effect (I think only the mgmt connection becomes stable):

$ sudo systemctl restart beegfs-client.service

$ cat /proc/fs/beegfs/*/config | grep -i -e ipv6 -e interfaces
connInterfacesFile = /etc/beegfs/interfaces.conf
connRDMAInterfacesFile = 
connDisableIPv6 = 1

$ cat /proc/fs/beegfs/*/client_info 
ClientID: c134CE-69087E0D-rubuscmpn02fl.un
Interfaces:
+ ibp225s0[ip addr: 169.245.11.113; type: RDMA]
+ ibp225s0[ip addr: 169.245.11.113; type: TCP]
Outbound RDMA Interfaces:
+ vlan849[ip addr: 10.84.11.113; type: RDMA]
+ vlan939[ip addr: 10.106.17.99; type: RDMA]
+ ibp225s0[ip addr: 169.245.11.113; type: RDMA]

$ beegfs health net
====================================================================================
Client ID: c134CE-69087E0D-rubuscmpn02fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (10.84.12.144:8110 [fallback route]);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: ethernet: 1 (169.245.12.144:8210 [fallback route]);rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: ethernet: 1 (169.245.12.144:8220 [fallback route]);rdma: 8 (10.84.12.144:8220 [fallback route]);
nas01_g3 [ID: 103]
   Connections: ethernet: 1 (169.245.12.144:8230 [fallback route]);rdma: 8 (10.84.12.144:8230 [fallback route]);
nas01_g4 [ID: 104]
   Connections: ethernet: 1 (169.245.12.144:8240 [fallback route]);rdma: 8 (10.84.12.144:8240 [fallback route]);

Setting both, connInterfacesFile and connRDMAInterfacesFile stabilises the client connections while the health error persists (mgmt connection is erroneously reported as fall-back):

$ sudo systemctl restart beegfs-client.service

$ cat /proc/fs/beegfs/*/config | grep -i -e ipv6 -e interfaces
connInterfacesFile = /etc/beegfs/interfaces.conf
connRDMAInterfacesFile = /etc/beegfs/interfaces.conf
connDisableIPv6 = 1

$ cat /proc/fs/beegfs/*/client_info 
ClientID: c14100-6908816C-rubuscmpn02fl.un
Interfaces:
+ ibp225s0[ip addr: 169.245.11.113; type: RDMA]
+ ibp225s0[ip addr: 169.245.11.113; type: TCP]
Outbound RDMA Interfaces:
+ ibp225s0[ip addr: 169.245.11.113; type: RDMA]

$ beegfs health net
====================================================================================
Client ID: c14100-6908816C-rubuscmpn02fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: rdma: 1 (169.245.12.144:8120);
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);

This connectivity remains stable. Enabling IPV6 again does not change that. We are now running with these settings:

$ cat /proc/fs/beegfs/*/config | grep -i -e ipv6 -e interfaces
connInterfacesFile = /etc/beegfs/interfaces.conf
connRDMAInterfacesFile = /etc/beegfs/interfaces.conf
connDisableIPv6 = 0

$ cat /proc/fs/beegfs/*/client_info 
ClientID: c141F0-690881CB-rubuscmpn02fl.un
Interfaces:
+ ibp225s0[ip addr: 169.245.11.113; type: RDMA]
+ ibp225s0[ip addr: 169.245.11.113; type: TCP]
Outbound RDMA Interfaces:
+ ibp225s0[ip addr: 169.245.11.113; type: RDMA]

The health error, however, remains reported:

$ sudo beegfs health check
#####################################################
Running Health Check for beegfs://169.245.12.144:8010
#####################################################
###################################
>>>>> Checking for Busy Nodes <<<<<
###################################

✅ Busy Metadata Nodes -> Number of queued requests does not exceed the degraded (16) or critical (512) thresholds.
✅ Busy Storage Nodes -> Number of queued requests does not exceed the degraded (16) or critical (512) thresholds.

############################
>>>>> Checking Targets <<<<<
############################

✅ Reachability -> All targets are responding.
✅ Consistency -> All mirrors are synchronized.
✅ Available Capacity -> All targets have sufficient free space based on the thresholds defined by the management service's configuration.

HINT: This mode does not check file system consistency. To check for file system inconsistencies,
      you can run 'beegfs-fsck --checkfs --readOnly' and consult with ThinkParQ support.

################################################
>>>>> Checking Connections to Server Nodes <<<<<
################################################
====================================================================================
Client ID: c141F0-690881CB-rubuscmpn02fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================

⚠️  Fallbacks -> Not all connections are using preferred NICs or protocols (such as Ethernet/TCP when RDMA is preferred).

---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008 [fallback route]);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: rdma: 1 (169.245.12.144:8120);
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);



Error: one or more checks failed

I also got the client that is co-located with the storage services up this way. For some reason I can't bind it to lo, is this intentional?

The only sign of change on the server side is that mgmtd does not report ibs3 as its preferred interface. I believe this is since the upgrade. Before the upgrade all services reported the sequence [ibs3, vlan855, vlan939] as their preferred interfaces. Now, the mgmt looks like this, all other services have ibs3 as preferred:

mg:1   management  nas01_mgmt                        ethernet:vlan855 (10.84.12.144:8008,8010)   
                                                     ethernet:vlan939 (10.106.17.131:8008,8010)  
                                                     ethernet:ibs3 (169.245.12.144:8008,8010)    
                                                     ethernet:lo (127.0.0.1:8008,8010)           
                                                     ethernet:lo ([::1]:8008,8010)

0 replies

iamjoemccormick · 2025-11-03T15:00:27Z

iamjoemccormick
Nov 3, 2025
Maintainer

Hi @frsc-ku ,

Thanks for the update. I agree this is not a permanent solution, I only intended this temporarily to stabilize the system and help narrow down the issue.

As you observed, based on beegfs node list --with-nics it looks like the management is now reporting vlan855 as the preferred interface. This is likely not routable from the single client ibp225s0 interface, so the client has to "fallback" to connecting using ibs3. This shouldn't cause any problems (outside the warning in the health check), but it would be good to configure the preferred interfaces:

(1) It looks like the management may not have the interfaces parameter set in beegfs-mgmtd.toml or the order doesn't priortize the IB interface. Could you set that to interfaces = ["ibs3", "vlan855", "vlan939, "lo"] (provided that is the preferred order) then restart the mgmtd?

Then, if possible on one client could you:

(2) Revert the connInterfacesFile/connRDMAInterfacesFile, increase to logLevel = 4, remount, then collect kernel logs (journalctl -k or dmesg) showing any BeeGFS errors and also output from cat /proc/fs/beegfs/*/client_info.

(3) Reset back to logLevel = 3 then set connInterfacesFile/connRDMAInterfacesFile so the 400G IB interface is listed first followed by the 2x100G ETH interfaces (e.g., ibp225s0, vlan849, vlan939). Remount and confirm if client connections remain stable.

(4) It looks like your metadata and storage servers have connInterfacesFile / connInterfacesList set to prefer the IB interface, but could you confirm this is the case?

~Joe

1 reply

iamjoemccormick Nov 4, 2025
Maintainer

@frsc-ku ,

Quick update. We were able to reproduce the issue in our lab and identify the problem. So for now (2) isn't needed.

If its not disruptive on your end (i.e., if you have a test client), it would be interesting to complete (3) and confirm everything remains stable.

At some point I would also recommend completing (1) and (4) just to ensure the preferred interface order matches across all servers and clients. I would also still be interested to know if your metadata and storage servers have connInterfacesFile / connInterfacesList or if the preferred interface order was set automatically.

Thanks,

~Joe

frsc-ku · 2025-11-04T15:03:49Z

frsc-ku
Nov 4, 2025
Author

Dear Joe.

(1) No, its not set. According to the default selection policy described in https://doc.beegfs.io/latest/advanced_topics/network_tuning.html#network-interfaces it is not necessary. Well, at least it wasn't until 8.2. I was considering to blacklist vlans that should not be used, but other than that the default will do. I would prefer if that behaviour was restored, else, the documentation should be updated to reflect that an explicit priority must be defined to avoid random selection.

(2) Great that I can skip that one, saved some time.

(3) I set interfaces = ["ibs3", "vlan855", "lo"] for the mgmt service and the interface list is now

mg:1   management  nas01_mgmt                        ethernet:ibs3 (169.245.12.144:8008,8010)
                                                     ethernet:vlan855 (10.84.12.144:8008,8010)
                                                     ethernet:lo (127.0.0.1:8008,8010)
                                                     ethernet:lo ([::1]:8008,8010)

Unfortunately, adding a second interface to the client's interfaces file makes the client connection unstable again (right after service restart):

$ cat interfaces.conf
ibp225s0
vlan849

$ beegfs health net
====================================================================================
Client ID: c1AFCF-690A1311-rubuscmpn01fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (10.84.12.144:8210 [fallback route]);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (10.84.12.144:8230 [fallback route]);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);

The connections now rotate around the 2 allowed interfaces. Keeping the mgmt interface config and removing the 2x100G ETH interface from the client config again (only the one IB device is allowed), the connection is stable again and also the health error disappears:

$ cat interfaces.conf
ibp225s0
# vlan849

$ beegfs health net
====================================================================================
Client ID: c1B0C4-690A1439-rubuscmpn01fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================
---------------
Management Node
---------------
nas01_mgmt [ID: 1]
   Connections: ethernet: 1 (169.245.12.144:8008);
--------------
Metadata Nodes
--------------
nas01_g5_a1 [ID: 101]
   Connections: rdma: 1 (169.245.12.144:8110);
nas01_g5_a2 [ID: 102]
   Connections: <none>
nas01_g5_a3 [ID: 103]
   Connections: <none>
nas01_g5_a4 [ID: 104]
   Connections: <none>
-------------
Storage Nodes
-------------
nas01_g1 [ID: 101]
   Connections: rdma: 8 (169.245.12.144:8210);
nas01_g2 [ID: 102]
   Connections: rdma: 8 (169.245.12.144:8220);
nas01_g3 [ID: 103]
   Connections: rdma: 8 (169.245.12.144:8230);
nas01_g4 [ID: 104]
   Connections: rdma: 8 (169.245.12.144:8240);

$ sudo beegfs health check
#####################################################
Running Health Check for beegfs://169.245.12.144:8010
#####################################################
###################################
>>>>> Checking for Busy Nodes <<<<<
###################################

✅ Busy Metadata Nodes -> Number of queued requests does not exceed the degraded (16) or critical (512) thresholds.
✅ Busy Storage Nodes -> Number of queued requests does not exceed the degraded (16) or critical (512) thresholds.

############################
>>>>> Checking Targets <<<<<
############################

✅ Reachability -> All targets are responding.
✅ Consistency -> All mirrors are synchronized.
✅ Available Capacity -> All targets have sufficient free space based on the thresholds defined by the management service's configuration.

HINT: This mode does not check file system consistency. To check for file system inconsistencies,
      you can run 'beegfs-fsck --checkfs --readOnly' and consult with ThinkParQ support.

################################################
>>>>> Checking Connections to Server Nodes <<<<<
################################################
====================================================================================
Client ID: c1B0C4-690A1439-rubuscmpn01fl.un (beegfs://169.245.12.144 -> /mnt/beegfs)
====================================================================================

✅ Fallbacks -> All connections are using preferred NICs and protocols.

(4) No, no specific order is configured. They are using the default policy as referred to in (1) and seem to adhere to it. Its only the mgmtd that has an opinion of its own now.

0 replies

iamjoemccormick · 2025-11-04T19:44:17Z

iamjoemccormick
Nov 4, 2025
Maintainer

Hi @frsc-ku ,

Thanks for all the help testing. There are two issues at play here:

(1) When connRDMAInterfaces was not set, the client still tried to apply its own filtering logic and populated specific outbound RDMA interfaces instead of simply allowing "any" so the kernel routing would pick the right interface based on the remote server address for a given connection. Normally connRDMAInterfaces should only be set when you have multiple interfaces in the same IPoIB subnet, and you want the client to load balance connections across those interfaces (e.g., multi-rail support). This is why you observed the connections rotating around the two interfaces.

For your setup with multiple RDMA capable interfaces in different subnets you normally don't want to use connRDMAInterfaces. Setting it to a single interface was only intended as a workaround. I initially though we might also be able to set it to multiple interfaces if the ordering was correct, but after revisiting the expected behavior of this parameter, realized this would not work. It will always try to round-robin connections, some of which will fail since the interfaces are in different subnets, causing failed connections and fallbacks. In fact that is exactly what the issue was. We were populating outbound RDMA interfaces even if connRDMAInterfaces was unset, so we always tried to establish connections (round robin) using all available interfaces.

Thanks for trying this out anyway, it helps confirm we found the right issue.

(2) In 8.2 as part of introducing support for IPv6 there was new IP filtering logic added that changed the default interface priority/sorting on the management. In 8.1 interfaces were priortized in the order they were returned by the kernel. In 8.2 in an attempt to make the sorting more deterministic the default behavior changed to sort by IP address (which explains why ibp225s0, vlan849, vlan939 became vlan849, vlan939, ibp225s0 given their IPs). Techically this won't break anything, but it does cause noise in the health check.

Our goal was for 8.2 to behave exactly the same as 8.1 and require no configuration changes (i.e., priortizing the same interfaces as before the upgrade) while now including IPv6 addresses and expanding the connInterfacesX syntax. While this is the case when preferred interfaces are configured, the behavior should be the same when interface priorties are automatically determined. So we'll revert back to the 8.1 default sorting behavior with the addition we'll prefer IPv4 over IPv6 addresses (done in ThinkParQ/beegfs-rust#39 if you are curious). We also made a similar change to the client filtering logic, though this doesn't affect your specific scenario.

We are currently going through final testing then intend to publish an 8.2.2 release to fix these issues ASAP.

Thanks,

~Joe

0 replies

frsc-ku · 2025-11-05T08:26:17Z

frsc-ku
Nov 5, 2025
Author

Dear Joe,

thanks a lot for the update. It would be great if on our end we could revert the explicit configs back to defaults (and I guess any user upgrading to 8.2 using default settings will be happy too).

Since you consider a bug fix release, could I ask you to try to include a solution for #81 as well? I would be interested in a config option that makes clients always wait for BeeGFS services to be up instead of erring out (IO hangs in D-state until a down service comes up). The current behaviour makes it impossible to perform service without crashing jobs.

If there is a not so complicated solution, for example, if sysMountSanityCheckMS=0 does that already also for mounted clients or there is another way to disable the sanity check for mounted clients, please let me know.

Best regards,
Frank

4 replies

frsc-ku Nov 5, 2025
Author

An updated description of the problem and expected behaviour is in comment #81 (comment).

iamjoemccormick Nov 6, 2025
Maintainer

Hi Frank,

We just released 8.2.2 which also marks the mgmtd configuration file as noreplace so on RHEL based systems it is not overwritten on an upgrade.

I did read your update on the other discussion, but decided that needs more time to investigate and think about how we can improve that behavior. In the meantime I didn't want to hold the 8.2.2 release as it already contains a number of important fixes.

Thanks again for your help getting to the bottom of this. If for some reason the issue is not fully resolved after updating to 8.2.2 please let me know. Also in case you are attending SC25, it'd be great to have you at the BeeGFS User Group meeting!

~Joe

frsc-ku Nov 7, 2025
Author

Hi Joe,

no problem. It was just in case there is already a possible workaround that can be used/implemented with little effort. For now we will need to allocate service windows whenever we reboot a storage server to avoid jobs crashing.

I'm not coming to the SC25. Its more likely that I can attend the ISC 26.

Best regards,
Frank

frsc-ku Nov 7, 2025
Author

Quick feedback on 8.2.2: it does fix the network connection issues. I was (maybe erroneously) expecting a file beegfs-mgmtd.toml.rpmnew to show up after upgrading to 8.2.2 (from 8.2.1). It didn't. On the upside, the config was not replaced this time.

manideep2510 · 2025-11-20T08:03:29Z

manideep2510
Nov 20, 2025

Hi @iamjoemccormick

I'm on v8.1.0 (some nodes on 8.0.1) and have not upgraded to 8.2. But, I'm also facing this network connection issue:

sudo beegfs health net                                                            
Error: error forcing establishment of BeeGFS client/server connections: signal: killed

sudo beegfs health net                                                              
Error: error forcing establishment of BeeGFS client/server connections: exit status 1

sudo beegfs health net                                                              
Error: error forcing establishment of BeeGFS client/server connections: exit status 1

Restarting the client service did not help either:

sudo systemctl restart beegfs-client.service                                      

sudo systemctl status beegfs-client
● beegfs-client.service - Start BeeGFS Client
     Loaded: loaded (/lib/systemd/system/beegfs-client.service; enabled; vendor preset: enab>
    Drop-In: /etc/systemd/system/beegfs-client.service.d
             └─override.conf
     Active: active (exited) since Thu 2025-11-20 14:34:20 IST; 5s ago
    Process: 1761929 ExecStart=/usr/bin/numactl --cpunodebind=1 --membind=1 /etc/init.d/beeg>
   Main PID: 1761929 (code=exited, status=0/SUCCESS)
        CPU: 36ms

Nov 20 14:34:20 node3 systemd[1]: Starting Start BeeGFS Client...
Nov 20 14:34:20 node3 numactl[1761929]: Starting BeeGFS Client:
Nov 20 14:34:20 node3 numactl[1761929]: - Loading BeeGFS modules
Nov 20 14:34:20 node3 numactl[1761929]: - Mounting directories from /etc/beegfs/beegfs-mount>
Nov 20 14:34:20 node3 systemd[1]: Finished Start BeeGFS Client.

sudo beegfs health net                                                            
Error: error forcing establishment of BeeGFS client/server connections: signal: killed

sudo beegfs health net                                                              
Error: error forcing establishment of BeeGFS client/server connections: exit status 1

sudo beegfs health net                                                              
Error: error forcing establishment of BeeGFS client/server connections: exit status 1

However, the beegfs mount is accessible during this time

ls /media/beegfs
testfile  testfile1  testfile2  test.txt

But, the beegfs mount is not visible in df -h

df -h                                                                              
Filesystem             Size  Used Avail Use% Mounted on
tmpfs                  6.3G  4.1M  6.3G   1% /run
/dev/nvme0n1p2          94G   76G   13G  86% /
tmpfs                   32G     0   32G   0% /dev/shm
tmpfs                  5.0M  4.0K  5.0M   1% /run/lock
/dev/nvme0n1p1         975M  6.1M  968M   1% /boot/efi
192.168.1.91:/         916G  573G  297G  66% /mnt
tmpfs                  6.3G   92K  6.3G   1% /run/user/121
tmpfs                  6.3G   72K  6.3G   1% /run/user/10034

However, when I retried df -h multiple times, after not showing the mount for multiple attempts, the df command froze for many seconds, and this time it returned the beegfs mount in the output. Generally, what I've observed with a 2-server setup is that if the client is inactive, it unmounts (or something to that effect). To remount, I could simply use df. When it is unmounted due to inactivity, I would not be able to access the mount at all, but in this case I was still able to.

df -h
Filesystem             Size  Used Avail Use% Mounted on
tmpfs                  6.3G  4.1M  6.3G   1% /run
/dev/nvme0n1p2          94G   76G   13G  86% /
tmpfs                   32G     0   32G   0% /dev/shm
tmpfs                  5.0M  4.0K  5.0M   1% /run/lock
/dev/nvme0n1p1         975M  6.1M  968M   1% /boot/efi
tmpfs                  6.3G   92K  6.3G   1% /run/user/121
beegfs_nodev            60T   36T   24T  61% /media/beegfs
tmpfs                  6.3G   72K  6.3G   1% /run/user/10034

After the above df, the net health check now worked, but shows a lot of fallback routes

sudo beegfs health net
========================================================================
Client ID: c8572-691B2290-node8 (beegfs://192.168.1.12 -> /media/beegfs)
========================================================================
---------------
Management Node
---------------
management [ID: 1]
   Connections: ethernet: 1 (169.254.3.1:8008);
--------------
Metadata Nodes
--------------
node_meta_11 [ID: 11]
   Connections: rdma: 1 (192.168.100.11:8005);
-------------
Storage Nodes
-------------
node_storage_3 [ID: 3]
   Connections: rdma: 2 (192.168.100.3:8003);
node_storage_4 [ID: 4]
   Connections: rdma: 2 (192.168.100.4:8003);
node_storage_5 [ID: 5]
   Connections: rdma: 3 (192.168.100.5:8003);
node_storage_6 [ID: 6]
   Connections: rdma: 3 (192.168.100.6:8003);
node_storage_7 [ID: 7]
   Connections: ethernet: 1 (192.168.100.7:8003 [fallback route]);rdma: 3 (192.168.100.7:8003);
node_storage_8 [ID: 8]
   Connections: ethernet: 1 (192.168.100.8:8003 [fallback route]);rdma: 3 (192.168.100.8:8003);
node_storage_9 [ID: 9]
   Connections: ethernet: 1 (192.168.100.9:8003 [fallback route]);rdma: 4 (192.168.100.9:8003);
node_storage_10 [ID: 10]
   Connections: ethernet: 3 (192.168.100.10:8003 [fallback route]);
node_storage_11 [ID: 11]
   Connections: ethernet: 1 (192.168.100.11:8003 [fallback route]);rdma: 2 (192.168.100.11:8003);

Another issue is that sometimes, some storage node connections fall back to Ethernet, and RDMA is not even listed. node_storage_3, for example, below:

sudo beegfs health net                                                                                    
=========================================================================
Client ID: c10A3F-691AF53D-node9 (beegfs://192.168.1.12 -> /media/beegfs)
=========================================================================
---------------
Management Node
---------------
management [ID: 1]
   Connections: ethernet: 1 (169.254.3.1:8008);
--------------
Metadata Nodes
--------------
node_meta_11 [ID: 11]
   Connections: rdma: 1 (192.168.101.11:8005);
-------------
Storage Nodes
-------------
node_storage_3 [ID: 3]
   Connections: ethernet: 2 (192.168.100.3:8003 [fallback route]);
node_storage_4 [ID: 4]
   Connections: rdma: 2 (192.168.100.4:8003 [fallback route]);
node_storage_5 [ID: 5]
   Connections: ethernet: 1 (192.168.101.5:8003 [fallback route]);rdma: 3 (192.168.101.5:8003);
node_storage_6 [ID: 6]
   Connections: rdma: 3 (192.168.100.6:8003 [fallback route]);
node_storage_7 [ID: 7]
   Connections: rdma: 4 (192.168.100.7:8003 [fallback route]);
node_storage_8 [ID: 8]
   Connections: rdma: 3 (192.168.100.8:8003 [fallback route]);
node_storage_9 [ID: 9]
   Connections: rdma: 3 (192.168.100.9:8003 [fallback route]);
node_storage_10 [ID: 10]
   Connections: rdma: 3 (192.168.101.10:8003);
node_storage_11 [ID: 11]
   Connections: rdma: 2 (192.168.101.11:8003);

Other notes:

Each node has two interfaces (from one NIC) on two subnets (192.168.100.x and 192.168.101.x).
I'm providing RDMA interfaces using connRDMAInterfacesFile and also passing the same interfaces to connInterfacesFile.
I've tried with both /24 (different subnets) and /23 (logically the same subnet), but these problems persisted.
I've verified that RDMA connections are working using rping and ib_write_bw between the nodes.

1 reply

iamjoemccormick Nov 24, 2025
Maintainer

Hi @manideep2510 ,

This must be a separate issue because the issue in this discussion thread was specifically caused by changes in 8.2.

Could you start a new discussion with this information and also provide log snippets from the client side (journalctl -k or dmesg) and affected meta/storage servers?

Client network failure after upgrade to 8.2 (from 8.1) #86

Uh oh!

frsc-ku Oct 31, 2025

Additional information

Replies: 8 comments · 6 replies

Uh oh!

frsc-ku Oct 31, 2025 Author

Uh oh!

iamjoemccormick Oct 31, 2025 Maintainer

Uh oh!

frsc-ku Nov 3, 2025 Author

Uh oh!

iamjoemccormick Nov 3, 2025 Maintainer

Uh oh!

iamjoemccormick Nov 4, 2025 Maintainer

Uh oh!

frsc-ku Nov 4, 2025 Author

Uh oh!

Uh oh!

iamjoemccormick Nov 4, 2025 Maintainer

Uh oh!

frsc-ku Nov 5, 2025 Author

Uh oh!

frsc-ku Nov 5, 2025 Author

Uh oh!

iamjoemccormick Nov 6, 2025 Maintainer

Uh oh!

frsc-ku Nov 7, 2025 Author

Uh oh!

frsc-ku Nov 7, 2025 Author

Uh oh!

Uh oh!

manideep2510 Nov 20, 2025

Uh oh!

iamjoemccormick Nov 24, 2025 Maintainer

frsc-ku
Oct 31, 2025

Replies: 8 comments 6 replies

frsc-ku
Oct 31, 2025
Author

iamjoemccormick
Oct 31, 2025
Maintainer

frsc-ku
Nov 3, 2025
Author

iamjoemccormick
Nov 3, 2025
Maintainer

iamjoemccormick Nov 4, 2025
Maintainer

frsc-ku
Nov 4, 2025
Author

iamjoemccormick
Nov 4, 2025
Maintainer

frsc-ku
Nov 5, 2025
Author

frsc-ku Nov 5, 2025
Author

iamjoemccormick Nov 6, 2025
Maintainer

frsc-ku Nov 7, 2025
Author

frsc-ku Nov 7, 2025
Author

manideep2510
Nov 20, 2025

iamjoemccormick Nov 24, 2025
Maintainer