-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
During cluster upgrades, StatefulSet pods rebuild with identical caliXXX interface names. In observed scenarios, Calico occasionally processes new pod interface creation (idx=204) before handling old pod interface deletion (idx=83), leading to route loss.
Root Cause Analysis:
In function storeAndNotifyLinkInner:
- When the new interface (idx=204) state changes to "up":
m.ifaceIdxToInfo[204] = {State: up}
m.ifaceNameToIdx[calidce2a7f6bc3] = 204 - When the old interface (idx=83) state changes to "down":
m.ifaceIdxToInfo[83] = {State: down}
m.ifaceNameToIdx[calidce2a7f6bc3] = 83
Routes are deleted due to down state. - When the new interface (idx=204) triggers a state sync:
newState = oldState = m.ifaceIdxToInfo[204].State = "up"
Required route additions are skipped because the state appears unchanged
Evidence:
Calico Log
2025-09-09 10:58:02.685 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=204 ifaceName="calidce2a7f6bc3" state="down"
2025-09-09 10:58:02.685 [INFO][677] felix/int_dataplane.go 1487: Linux interface addrs changed. addrs=set.Set{fe80::ecee:eeff:feee:eeee} ifaceName="calidce2a7f6bc3"
2025-09-09 10:58:02.685 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=204 ifaceName="calidce2a7f6bc3" state="up"
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=83 ifaceName="calidce2a7f6bc3" state="down"
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1487: Linux interface addrs changed. addrs=set.Set{} ifaceName="calidce2a7f6bc3"
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1443: Linux interface state changed. ifIndex=83 ifaceName="calidce2a7f6bc3" state=""
2025-09-09 10:58:02.790 [INFO][677] felix/int_dataplane.go 1487: Linux interface addrs changed. addrs=<nil> ifaceName="calidce2a7f6bc3"
2025-09-09 10:58:03.090 [INFO][677] felix/route_table.go 1149: Spotted interface had changed index during resync. ifaceName="calidce2a7f6bc3" newIdx=204 oldIdx=83
...
2025-09-09 11:03:13.470 [INFO][677] felix/int_dataplane.go 2053: Received *proto.WorkloadEndpointUpdate update from calculation graph msg=id:<orchestrator_id:"k8s" workload_id:"nce/nce-hofs-cluster-hofsosdservice-0" endpoint_id:"eth0" > endpoint:<state:"active" name:"calidce2a7f6bc3" profile_ids:"kns.nce" profile_ids:"ksa.nce.default" ipv4_nets:"172.20.193.168/32" >
2025-09-09 11:03:13.472 [INFO][677] felix/endpoint_mgr.go 1439: Skipping configuration of interface because it is oper down. ifaceName="calidce2a7f6bc3"
...
Current interface state on node
[root@caasnode1 sopuser]# ip a s calidce2a7f6bc3
204: calidce2a7f6bc3@if2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-91b39a7e-dad5-970d-131f-62d70d04f96a
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
Related Code:
https://github.com/projectcalico/calico/blob/v3.29.1/felix/ifacemonitor/iface_monitor.go#L309
Expected Behavior
Routes should persist correctly after pod recreation.
Current Behavior
Pod network connectivity fails due to missing routes after interface recreation.
Possible Solution
Add name validation in interface state comparison logic:
if info := m.ifaceIdxToInfo[ifIndex]; info != nil && m.ifaceNameToIdx[ifaceName] == ifIndex {
oldState = info.State
}
Steps to Reproduce (for bugs)
We encountered this issue during upgrades of a 3-node cluster. The upgrade process triggers mass pod restarts through Helm template rendering. No reliable reproduction method exists currently.
Suspected Contributing Factors:
- Calico component overload (high CPU/memory usage)
- Frequent interface change events exceeding processing capacity
Your Environment
Calico version v3.29.1
Orchestrator version: kubernetes v1.31.1
Operating System and version: eulerosv2r13