Skip to content

All statuses reported as healthy, but connectivety does not work #11439

@ShadiestGoat

Description

@ShadiestGoat

Hi! I'm running a fresh cluster with 1 node & 3 worker nodes. I'm seeing weird issues where my nodes just cannot communicate with each other... I've been banging my head against a wall for about 2 weeks no with barely any progress, so I'd really appreciate some help! My cluster is compromised of a bunch of VPSs, but mostly w/o the cloud infra. The control node has ipv4 & ipv6, the worker nodes only have ipv6 & a private (ipv4) network w/ the control node. As context, the control node is called powerful-2. For this issue I'll be using my vault deployment (which is not working) as the example.

Service that I'll be using:

apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: vault
    meta.helm.sh/release-namespace: vault
  creationTimestamp: "2025-11-23T22:28:58Z"
  labels:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: vault
    helm.sh/chart: vault-0.27.0
    vault-internal: "true"
  name: vault-internal
  namespace: vault
  resourceVersion: "3684"
  uid: 20382afc-a3a2-4e99-93db-5ba808eb2996
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv6
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 8200
    protocol: TCP
    targetPort: 8200
  - name: https-internal
    port: 8201
    protocol: TCP
    targetPort: 8201
  publishNotReadyAddresses: true
  selector:
    app.kubernetes.io/instance: vault
    app.kubernetes.io/name: vault
    component: server
  sessionAffinity: None
  type: ClusterIP

Pods

# kubectl get po -owide
vault-0                                1/2     Running   0               5h27m   10.244.105.136                     weak-1       <none>           <none>
vault-1                                1/2     Running   0               5h27m   10.244.217.199                     weak-2       <none>           <none>
vault-2                                1/2     Running   0               5h27m   fd40:10:200:0:4a36:85f5:61d4:897   powerful-2   <none>           <none>

Note that the pods ARE connectable even though they are not ready.

Also note that the pods seem to have both v4 & v6 assigned to them:

metadata:
  annotations:
    cni.projectcalico.org/containerID: 18a5ae1ba0417a0a69b1765d10e57e1bcd73903e0af3af56702933fa256b24cb
    cni.projectcalico.org/podIP: 10.244.105.136/32
    cni.projectcalico.org/podIPs: 10.244.105.136/32,fd40:10:200:0:a6f7:5e60:d54c:70c7/128
    kubectl.kubernetes.io/restartedAt: "2025-11-24T14:05:25Z"
  name: vault-0
status:
  hostIP: 2a01:--node-ip--
  hostIPs:
    - ip: 2a01:--node-ip--
  podIP: 10.244.105.136
  podIPs:
    - ip: 10.244.105.136
    - ip: fd40:10:200:0:a6f7:5e60:d54c:70c7
---
metadata:
  annotations:
    cni.projectcalico.org/containerID: 13ff4b5c2905b4ca67ee27ac55673dea75cd249b704d5841e5f88945697bbaa7
    cni.projectcalico.org/podIP: 10.244.217.199/32
    cni.projectcalico.org/podIPs: 10.244.217.199/32,fd40:10:200:0:9e5:9a66:f74e:d9c6/128
    kubectl.kubernetes.io/restartedAt: "2025-11-24T14:05:25Z"
  name: vault-1
status:
  hostIP: 2a01:--node-ip--
  hostIPs:
    - ip: 2a01:--node-ip--
  podIP: 10.244.217.199
  podIPs:
    - ip: 10.244.217.199
    - ip: fd40:10:200:0:9e5:9a66:f74e:d9c6
---
metadata:
  annotations:
    cni.projectcalico.org/containerID: 4a006f9d49295ea4c44c5ff718486cb27a36b171e5041e6013b4d59d1b70b33e
    cni.projectcalico.org/podIP: 10.244.8.152/32
    cni.projectcalico.org/podIPs: 10.244.8.152/32,fd40:10:200:0:4a36:85f5:61d4:897/128
    kubectl.kubernetes.io/restartedAt: "2025-11-24T14:05:25Z"
  name: vault-2
status:
  hostIP: 2a01:--node-ip--
  hostIPs:
    - ip: 2a01:--node-ip--
    - ip: 10.0.0.2
  podIP: fd40:10:200:0:4a36:85f5:61d4:897
  podIPs:
    - ip: fd40:10:200:0:4a36:85f5:61d4:897
    - ip: 10.244.8.152

The test will be to run ping6 against vault-X.vault-internal in every pod, which are all on different nodes

Expected Behavior

I expect that all pods in all nodes are able to communicate to each other equally

Current Behavior

vault-2 (the one which on the control node) is able to resolve all pod's addresses, but is only able to ping itself
vault-0 & vault-1 are not even able to resolve the other pod's addresses

Steps to Reproduce (for bugs)

n/a, see the headerless part of the description

Context

This is a real big issue for me - no connecting is working! I can't really deploy anything until this is resolved

Additional context is that I see these weird errors in the calico-node pods on all nodes other than the control node:

2025-11-24 20:02:11.340 [WARNING][63] felix/client.go 175: Failed to connect to flow server error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fd40:10:100::43e6]:7443: i/o timeout" target="dns:///goldmane.calico-system.svc:7443"
2025-11-24 20:02:11.341 [INFO][63] felix/client.go 251: Waiting before next connection attempt duration=10s
2025-11-24 20:02:21.342 [WARNING][63] felix/client.go 175: Failed to connect to flow server error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fd40:10:100::43e6]:7443: i/o timeout" target="dns:///goldmane.calico-system.svc:7443"
2025-11-24 20:02:21.342 [INFO][63] felix/client.go 251: Waiting before next connection attempt duration=10s
2025-11-24 20:02:22.899 [WARNING][63] felix/client.go 224: Flow client buffer full, dropping flow
2025-11-24 20:02:31.343 [WARNING][63] felix/client.go 175: Failed to connect to flow server error=rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fd40:10:100::43e6]:7443: i/o timeout" target="dns:///goldmane.calico-system.svc:7443"
2025-11-24 20:02:31.343 [INFO][63] felix/client.go 251: Waiting before next connection attempt duration=10s

Your Environment

  • Calico version: v3.31.2
  • Calico dataplane: iptables
  • Orchestrator version (e.g. kubernetes, openshift, etc.): kubernetes
  • Operating System and version: 6.12.57+deb13-cloud-amd64

Installed via tigera operator

Thanks for yall's time in advance & I hope this can get resolved!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions