-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Hi! I'm running a fresh cluster with 1 node & 3 worker nodes. I'm seeing weird issues where my nodes just cannot communicate with each other... I've been banging my head against a wall for about 2 weeks no with barely any progress, so I'd really appreciate some help! My cluster is compromised of a bunch of VPSs, but mostly w/o the cloud infra. The control node has ipv4 & ipv6, the worker nodes only have ipv6 & a private (ipv4) network w/ the control node. As context, the control node is called powerful-2. For this issue I'll be using my vault deployment (which is not working) as the example.
Service that I'll be using:
apiVersion: v1
kind: Service
metadata:
annotations:
meta.helm.sh/release-name: vault
meta.helm.sh/release-namespace: vault
creationTimestamp: "2025-11-23T22:28:58Z"
labels:
app.kubernetes.io/instance: vault
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: vault
helm.sh/chart: vault-0.27.0
vault-internal: "true"
name: vault-internal
namespace: vault
resourceVersion: "3684"
uid: 20382afc-a3a2-4e99-93db-5ba808eb2996
spec:
clusterIP: None
clusterIPs:
- None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv6
ipFamilyPolicy: SingleStack
ports:
- name: https
port: 8200
protocol: TCP
targetPort: 8200
- name: https-internal
port: 8201
protocol: TCP
targetPort: 8201
publishNotReadyAddresses: true
selector:
app.kubernetes.io/instance: vault
app.kubernetes.io/name: vault
component: server
sessionAffinity: None
type: ClusterIPPods
# kubectl get po -owide
vault-0 1/2 Running 0 5h27m 10.244.105.136 weak-1 <none> <none>
vault-1 1/2 Running 0 5h27m 10.244.217.199 weak-2 <none> <none>
vault-2 1/2 Running 0 5h27m fd40:10:200:0:4a36:85f5:61d4:897 powerful-2 <none> <none>
Note that the pods ARE connectable even though they are not ready.
Also note that the pods seem to have both v4 & v6 assigned to them:
metadata:
annotations:
cni.projectcalico.org/containerID: 18a5ae1ba0417a0a69b1765d10e57e1bcd73903e0af3af56702933fa256b24cb
cni.projectcalico.org/podIP: 10.244.105.136/32
cni.projectcalico.org/podIPs: 10.244.105.136/32,fd40:10:200:0:a6f7:5e60:d54c:70c7/128
kubectl.kubernetes.io/restartedAt: "2025-11-24T14:05:25Z"
name: vault-0
status:
hostIP: 2a01:--node-ip--
hostIPs:
- ip: 2a01:--node-ip--
podIP: 10.244.105.136
podIPs:
- ip: 10.244.105.136
- ip: fd40:10:200:0:a6f7:5e60:d54c:70c7
---
metadata:
annotations:
cni.projectcalico.org/containerID: 13ff4b5c2905b4ca67ee27ac55673dea75cd249b704d5841e5f88945697bbaa7
cni.projectcalico.org/podIP: 10.244.217.199/32
cni.projectcalico.org/podIPs: 10.244.217.199/32,fd40:10:200:0:9e5:9a66:f74e:d9c6/128
kubectl.kubernetes.io/restartedAt: "2025-11-24T14:05:25Z"
name: vault-1
status:
hostIP: 2a01:--node-ip--
hostIPs:
- ip: 2a01:--node-ip--
podIP: 10.244.217.199
podIPs:
- ip: 10.244.217.199
- ip: fd40:10:200:0:9e5:9a66:f74e:d9c6
---
metadata:
annotations:
cni.projectcalico.org/containerID: 4a006f9d49295ea4c44c5ff718486cb27a36b171e5041e6013b4d59d1b70b33e
cni.projectcalico.org/podIP: 10.244.8.152/32
cni.projectcalico.org/podIPs: 10.244.8.152/32,fd40:10:200:0:4a36:85f5:61d4:897/128
kubectl.kubernetes.io/restartedAt: "2025-11-24T14:05:25Z"
name: vault-2
status:
hostIP: 2a01:--node-ip--
hostIPs:
- ip: 2a01:--node-ip--
- ip: 10.0.0.2
podIP: fd40:10:200:0:4a36:85f5:61d4:897
podIPs:
- ip: fd40:10:200:0:4a36:85f5:61d4:897
- ip: 10.244.8.152The test will be to run ping6 against vault-X.vault-internal in every pod, which are all on different nodes
Expected Behavior
I expect that all pods in all nodes are able to communicate to each other equally
Current Behavior
vault-2 (the one which on the control node) is able to resolve all pod's addresses, but is only able to ping itself
vault-0 & vault-1 are not even able to resolve the other pod's addresses
Steps to Reproduce (for bugs)
n/a, see the headerless part of the description
Context
This is a real big issue for me - no connecting is working! I can't really deploy anything until this is resolved
Additional context is that I see these weird errors in the calico-node pods on all nodes other than the control node:
2025-11-24 20:02:11.340 [WARNING][63] felix/client.go 175: Failed to connect to flow server error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fd40:10:100::43e6]:7443: i/o timeout" target="dns:///goldmane.calico-system.svc:7443"
2025-11-24 20:02:11.341 [INFO][63] felix/client.go 251: Waiting before next connection attempt duration=10s
2025-11-24 20:02:21.342 [WARNING][63] felix/client.go 175: Failed to connect to flow server error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fd40:10:100::43e6]:7443: i/o timeout" target="dns:///goldmane.calico-system.svc:7443"
2025-11-24 20:02:21.342 [INFO][63] felix/client.go 251: Waiting before next connection attempt duration=10s
2025-11-24 20:02:22.899 [WARNING][63] felix/client.go 224: Flow client buffer full, dropping flow
2025-11-24 20:02:31.343 [WARNING][63] felix/client.go 175: Failed to connect to flow server error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fd40:10:100::43e6]:7443: i/o timeout" target="dns:///goldmane.calico-system.svc:7443"
2025-11-24 20:02:31.343 [INFO][63] felix/client.go 251: Waiting before next connection attempt duration=10s
Your Environment
- Calico version: v3.31.2
- Calico dataplane: iptables
- Orchestrator version (e.g. kubernetes, openshift, etc.): kubernetes
- Operating System and version:
6.12.57+deb13-cloud-amd64
Installed via tigera operator
Thanks for yall's time in advance & I hope this can get resolved!