-
Notifications
You must be signed in to change notification settings - Fork 637
Open
Labels
priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.Must be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.
Description
After checking multiple breaking changes, I thought got it under control, apparently not.
We run EKS 1.32 AWSManagedControlPlanes with 1.32 AWSManagedMachinePools with AL2 custom AMIs
The upgrade was going to be in 2 stages, first to "latest 1beta1" then latest 1beta2 as it is recommended here
So I did:
./clusterctl-v1.10.6 upgrade plan
Checking new release availability...
Latest release available for the v1beta1 API Version of Cluster API (contract):
NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
bootstrap-kubeadm capi-kubeadm-bootstrap-system BootstrapProvider v1.7.3 v1.10.6
control-plane-kubeadm capi-kubeadm-control-plane-system ControlPlaneProvider v1.7.3 v1.10.6
cluster-api capi-system CoreProvider v1.7.3 v1.10.6
infrastructure-aws capa-system InfrastructureProvider v2.5.2 v2.9.1
You can now apply the upgrade by executing the following command:
clusterctl upgrade apply --contract v1beta1
So I run the upgrade command to do the intermediate upgrade and I got all upgraded, however, both, CAPI and CAPA, started complaining constantly about reconciliation and connection errors.
Perhaps is this but I thought I had it under control because of this
These are the logs, I tried to pick only the ones for one particular cluster, we have almost 30, all failing like this.
Logs from capa-controller-manager
I0919 10:58:42.605598 1 awsmanagedmachinepool_controller.go:202] "Reconciling AWSManagedMachinePool" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
I0919 10:58:42.605729 1 launchtemplate.go:81] "checking for existing launch template" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED"
[...]
I0919 10:58:45.429754 1 tags.go:128] "Reconciling ASG tags" controller="awsmanagedmachinepool" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSManagedMachinePool" AWSManagedMachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="82996b04-ef8f-4b26-b570-95f5010121cb" MachinePool="prod/services-prod-pool-ap-southeast-2a" cluster="prod/services.REDACTED" cluster-name="services_ap-southeast-2_prod_alienvault_cloud" nodegroup-name="services-prod-pool-ap-southeast-2a"
Logs from capi-controller-manager
E0919 11:01:39.644472 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2b" namespace="prod" name="services-prod-pool-ap-southeast-2b" reconcileID="dd96348e-37dc-4d9d-90f8-33b72cca5aa1"
E0919 11:01:42.691574 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="4b104a11-3d94-401f-b227-c89eceb45e71"
E0919 11:01:44.009112 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2c" namespace="prod" name="services-prod-pool-ap-southeast-2c" reconcileID="35240758-7625-420d-85cc-517b095fa4f4"
E0919 11:01:52.674593 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="5cd6d5a9-452a-474b-bcff-09ad0e98e6a1"
E0919 11:01:52.952752 1 controller.go:347] "Reconciler error" err="Object prod/services.REDACTED is already owned by another MachinePool controller services-prod-pool-prometheus-ap-southeast-2" controller="machinepool" controllerGroup="cluster.x-k8s.io" controllerKind="MachinePool" MachinePool="prod/services-prod-pool-ap-southeast-2a" namespace="prod" name="services-prod-pool-ap-southeast-2a" reconcileID="36a5298a-d1d2-4e8c-a7e3-da275b13d90b"
Logs from capi-kubeadm-bootstrap-controller-manager
I0919 10:57:44.297447 1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.297492 1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="de112319-22c9-4bc8-a248-da3869cb4f13"
I0919 10:57:44.298712 1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
I0919 10:57:47.933214 1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.REDACTED" namespace="prod" name="services.REDACTED" reconcileID="b212685a-8419-4acd-8ff3-7d893b41a2e3"
Logs from capi-kubeadm-control-plane-system
I0919 11:00:09.828007 1 cluster_accessor.go:320] "Disconnecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.828056 1 cluster_accessor.go:327] "Disconnected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="f74b3271-9d4b-4b6a-95a7-7abe21839a7b"
I0919 11:00:09.829332 1 cluster_accessor.go:252] "Connecting" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"
I0919 11:00:13.479651 1 cluster_accessor.go:274] "Connected" controller="clustercache" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="prod/services.ap-southeast-2.prod.alienvault.cloud" namespace="prod" name="services.ap-southeast-2.prod.alienvault.cloud" reconcileID="95222f01-14a5-4e4b-bec3-372e95d9b983"
This is the config of this particular cluster:
ap-southeast-2 cluster YAML
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
name: services.REDACTED
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "0"
spec:
clusterNetwork:
pods:
cidrBlocks:
- 192.168.0.0/16
controlPlaneRef:
apiVersion: controlplane.cluster.x-k8s.io/v2beta2
kind: AWSManagedControlPlane
name: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
name: services.REDACTED
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedCluster
metadata:
name: services.REDACTED
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "10"
spec: {}
---
apiVersion: controlplane.cluster.x-k8s.io/v1beta2
kind: AWSManagedControlPlane
metadata:
name: services.REDACTED
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "20"
spec:
associateOIDCProvider: true
eksClusterName: services_REDACTED_1
region: ap-southeast-2
version: v1.32.0
network:
vpc:
id: vpc-XXXXXXXXXX
subnets:
- id: subnet-X
- id: subnet-Y
- id: subnet-Z
securityGroupOverrides:
node-eks-additional: sg-W
endpointAccess:
private: true
public: false
bastion:
enabled: false
oidcIdentityProviderConfig:
identityProviderConfigName: Okta
issuerUrl: https://.okta.com/oauth2/XXXXXXXXXXXX
clientId: XXXXXXXXX
usernameClaim: preferred_username
groupsClaim: groups
groupsPrefix: "okta:"
logging:
apiServer: false
controllerManager: false
audit: false
authenticator: false
scheduler: false
iamAuthenticatorConfig:
mapRoles:
- username: "kubernetes-admin"
rolearn: "arn:aws:iam::XXXXXXXXXXXX:role/saas-OktaAdmins"
groups:
- "system:masters"
addons:
- name: "kube-proxy"
version: "v1.32.6-eksbuild.6"
conflictResolution: "overwrite"
- name: "vpc-cni"
version: "v1.20.1-eksbuild.1"
conflictResolution: "overwrite"
- name: "aws-ebs-csi-driver"
version: "v1.48.0-eksbuild.1"
conflictResolution: "overwrite"
serviceAccountRoleARN: "arn:aws:iam::XXXXXXXXXXXX:role/prod-AmazonEKS_EBS_CSI_DriverRole"
vpcCni:
env:
- name: POD_SECURITY_GROUP_ENFORCING_MODE
value: standard
- name: ENABLE_POD_ENI
value: "true"
- name: ENABLE_PREFIX_DELEGATION
value: "true"
additionalTags:
Owner: "EngOps"
created_by: "https://bitbucket.org/redacted/capi-cluster"
Environment: "prod"
identityRef:
kind: AWSClusterRoleIdentity
name: prod
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonEKSVPCResourceController
---
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
metadata:
name: services.REDACTED
namespace: prod
spec:
boostrapCommandOverride: "# Self-bootstrap embedded in AMI, doing nothing here for cluster"
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-prometheus-ap-southeast-2
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "30"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2a
- ap-southeast-2b
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-prometheus-ap-southeast-2
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-prometheus-ap-southeast-2
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "30"
spec:
eksNodegroupName: services-prod-pool-prometheus
availabilityZones:
- ap-southeast-2a
- ap-southeast-2b
scaling:
minSize: 2
maxSize: 4
updateConfig:
maxUnavailable: 1
awsLaunchTemplate:
instanceType: m5.large
ami:
id: ami-YYYYYY
labels:
usm.io/role: prometheus
taints:
- key: dedicated
effect: no-schedule
value: prometheus
subnetIDs:
- subnet-X
- subnet-Y
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-ap-southeast-2a
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "40"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2a
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-ap-southeast-2a
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-ap-southeast-2a
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "40"
spec:
eksNodegroupName: services-prod-pool-ap-southeast-2a
availabilityZones:
- ap-southeast-2a
scaling:
minSize: 2
maxSize: 25
updateConfig:
maxUnavailablePercentage: 40
subnetIDs:
- subnet-X
awsLaunchTemplate:
instanceType: m5.xlarge
ami:
id: ami-YYYYYY
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-ap-southeast-2b
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "41"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2b
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-ap-southeast-2b
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-ap-southeast-2b
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "41"
spec:
eksNodegroupName: services-prod-pool-ap-southeast-2b
availabilityZones:
- ap-southeast-2b
scaling:
minSize: 2
maxSize: 25
updateConfig:
maxUnavailablePercentage: 40
subnetIDs:
- subnet-Y
awsLaunchTemplate:
instanceType: m5.xlarge
ami:
id: ami-YYYYYY
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachinePool
metadata:
name: services-prod-pool-ap-southeast-2c
namespace: prod
annotations:
cluster.x-k8s.io/replicas-managed-by: "external-autoscaler"
argocd.argoproj.io/sync-wave: "42"
spec:
clusterName: services.REDACTED
replicas: 2
failureDomains:
- ap-southeast-2c
template:
spec:
bootstrap:
configRef:
apiVersion: bootstrap.cluster.x-k8s.io/v1beta2
kind: EKSConfig
name: services.REDACTED
namespace: prod
dataSecretName: services.REDACTED
clusterName: services.REDACTED
infrastructureRef:
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
name: services-prod-pool-ap-southeast-2c
---
apiVersion: infrastructure.cluster.x-k8s.io/v1beta2
kind: AWSManagedMachinePool
metadata:
name: services-prod-pool-ap-southeast-2c
namespace: prod
annotations:
argocd.argoproj.io/sync-wave: "42"
spec:
eksNodegroupName: services-prod-pool-ap-southeast-2c
availabilityZones:
- ap-southeast-2c
scaling:
minSize: 2
maxSize: 25
updateConfig:
maxUnavailablePercentage: 40
subnetIDs:
- subnet-Z
awsLaunchTemplate:
instanceType: m5.xlarge
ami:
id: ami-YYYYYY
roleAdditionalPolicies:
- arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Metadata
Metadata
Assignees
Labels
priority/important-soonMust be staffed and worked on either currently, or very soon, ideally in time for the next release.Must be staffed and worked on either currently, or very soon, ideally in time for the next release.triage/acceptedIndicates an issue or PR is ready to be actively worked on.Indicates an issue or PR is ready to be actively worked on.