Skip to content

StatefulSet Pod Network Failure After Cluster Upgrade Due to Stale Interface State #11016

@sljslj

Description

@sljslj

During cluster upgrades, StatefulSet pods rebuild with identical caliXXX interface names. In observed scenarios, Calico occasionally processes new pod interface creation (idx=204) before handling old pod interface deletion (idx=83), leading to route loss.

Root Cause Analysis:

In function storeAndNotifyLinkInner:

  • When the new interface (idx=204) state changes to "up":
    m.ifaceIdxToInfo[204] = {State: up}
    m.ifaceNameToIdx[calidce2a7f6bc3] = 204
  • When the old interface (idx=83) state changes to "down":
    m.ifaceIdxToInfo[83] = {State: down}
    m.ifaceNameToIdx[calidce2a7f6bc3] = 83
    Routes are deleted due to down state.
  • When the new interface (idx=204) triggers a state sync:
    newState = oldState = m.ifaceIdxToInfo[204].State = "up"
    Required route additions are skipped because the state appears unchanged

Evidence:

Calico Log

2025-09-09 10:58:02.685 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=204 ifaceName="calidce2a7f6bc3" state="down"
2025-09-09 10:58:02.685 [INFO][677] felix/int_dataplane.go 1487: Linux interface addrs changed. addrs=set.Set{fe80::ecee:eeff:feee:eeee} ifaceName="calidce2a7f6bc3"
2025-09-09 10:58:02.685 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=204 ifaceName="calidce2a7f6bc3" state="up"
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=83 ifaceName="calidce2a7f6bc3" state="down"
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1487: Linux interface addrs changed. addrs=set.Set{} ifaceName="calidce2a7f6bc3"
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=83 ifaceName="calidce2a7f6bc3" state=""
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1487: Linux interface addrs changed. addrs=<nil> ifaceName="calidce2a7f6bc3"
2025-09-09 10:58:03.090 [INFO][677] felix/route_table.go 1149: Spotted interface had changed index during resync. ifaceName="calidce2a7f6bc3" newIdx=204 oldIdx=83
...
2025-09-09 11:03:13.470 [INFO][677] felix/int_dataplane.go 2053: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"nce/nce-hofs-cluster-hofsosdservice-0" endpoint_id:"eth0" > endpoint:<state:"active" name:"calidce2a7f6bc3" profile_ids:"kns.nce" profile_ids:"ksa.nce.default" ipv4_nets:"172.20.193.168/32" > 
2025-09-09 11:03:13.472 [INFO][677] felix/endpoint_mgr.go 1439: Skipping configuration of interface because it is oper down. ifaceName="calidce2a7f6bc3"
...

Current interface state on node

[root@caasnode1 sopuser]# ip a s calidce2a7f6bc3
204: calidce2a7f6bc3@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-91b39a7e-dad5-970d-131f-62d70d04f96a
    inet6 fe80::ecee:eeff:feee:eeee/64 scope link 
       valid_lft forever preferred_lft forever

Related Code:

https://github.com/projectcalico/calico/blob/v3.29.1/felix/ifacemonitor/iface_monitor.go#L309

Expected Behavior

Routes should persist correctly after pod recreation.

Current Behavior

Pod network connectivity fails due to missing routes after interface recreation.

Possible Solution

Add name validation in interface state comparison logic:

if info := m.ifaceIdxToInfo[ifIndex]; info != nil && m.ifaceNameToIdx[ifaceName] == ifIndex {
	oldState = info.State
}

Steps to Reproduce (for bugs)

We encountered this issue during upgrades of a 3-node cluster. The upgrade process triggers mass pod restarts through Helm template rendering. No reliable reproduction method exists currently.
Suspected Contributing Factors:

  1. Calico component overload (high CPU/memory usage)
  2. Frequent interface change events exceeding processing capacity

Your Environment

Calico version v3.29.1
Orchestrator version: kubernetes v1.31.1
Operating System and version: eulerosv2r13

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions