Skip to content

Conversation

@supershal
Copy link
Collaborator

@supershal supershal commented Oct 21, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:

The nodeadmconfig template sets ControlPlaneEndpoint.Host from AWSManagedCluster object. However something resets value of ControlPlaneEndpoint.Host in the AWSManagedCluster object randomly. This value is present when a nodepool is created first time. But it is removed when we try to add another nodepool later time. (I have not investigated what removes it)

This results in failure with nodeadmconfig creation randomly. We get following error from CAPA

E1017 03:53:22.998693       1 controller.go:347] "Reconciler error" err="API server endpoint is required for nodeadm" controller="nodeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="NodeadmConfig" NodeadmConfig="default/shalin-eks-test-cnaos-dedicated-8gb2t-t6646-lp5km" namespace="default" name="shalin-eks-test-cnaos-dedicated-8gb2t-t6646-lp5km" reconcileID="312659d4-3b7d-43e1-a372-115325cb918d

Following is the logs in our CAPA fork's code that fetches the ControlPlaneEndpoint.Host from AWSManagedClusters. You can notice how it gets resets at some point.

I1017 03:56:46.131771       1 nodeadmconfig_controller.go:246] "Generating nodeadm userdata" controller="nodeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="NodeadmConfig" NodeadmConfig="default/shalin-eks-md-0-pcw8g-5lpp7-pjrsc" namespace="default" name="shalin-eks-md-0-pcw8g-5lpp7-pjrsc" reconcileID="bdaa88ab-43df-42d0-b784-51bc2c133641" cluster="default_shalin-eks-7mf7c" endpoint=""
I1017 03:57:27.093679       1 nodeadmconfig_controller.go:246] "Generating nodeadm userdata" controller="nodeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="NodeadmConfig" NodeadmConfig="default/shalin-eks-md-0-pcw8g-5lpp7-pjrsc" namespace="default" name="shalin-eks-md-0-pcw8g-5lpp7-pjrsc" reconcileID="50ce47ca-5bd8-4d5b-a166-dcb030b4628a" cluster="default_shalin-eks-7mf7c" endpoint="https://7BFB78C8C07464C26D46F5400168ED72.gr7.us-west-2.eks.amazonaws.com"
I1017 03:57:27.113737       1 nodeadmconfig_controller.go:246] "Generating nodeadm userdata" controller="nodeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="NodeadmConfig" NodeadmConfig="default/shalin-eks-md-0-pcw8g-5lpp7-pjrsc" namespace="default" name="shalin-eks-md-0-pcw8g-5lpp7-pjrsc" reconcileID="fd37b849-0909-49f1-83d1-3fdfb1d24a84" cluster="default_shalin-eks-7mf7c" endpoint="https://7BFB78C8C07464C26D46F5400168ED72.gr7.us-west-2.eks.amazonaws.com"
I1017 03:57:28.762807       1 nodeadmconfig_controller.go:246] "Generating nodeadm userdata" controller="nodeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="NodeadmConfig" NodeadmConfig="default/shalin-eks-test-cnaos-dedicated-8gb2t-t6646-lp5km" namespace="default" name="shalin-eks-test-cnaos-dedicated-8gb2t-t6646-lp5km" reconcileID="aedc3cca-cf31-41e7-9559-d67c4b202e34" cluster="default_shalin-eks-7mf7c" endpoint=""
I1017 03:58:37.484125       1 nodeadmconfig_controller.go:246] "Generating nodeadm userdata" controller="nodeadmconfig" controllerGroup="bootstrap.cluster.x-k8s.io" controllerKind="NodeadmConfig" NodeadmConfig="default/shalin-eks-test-cnaos-dedicated-8gb2t-t6646-lp5km" namespace="default" name="shalin-eks-test-cnaos-dedicated-8gb2t-t6646-lp5km" reconcileID="90bd7d4a-26e7-4c0e-94ad-680ae20094e5" cluster="default_shalin-eks-7mf7c" endpoint=""

This PR fetches the ControlPlaneEndpoint from Cluster object where it always present.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

We still need to investigate what resets the controlplaneendpoint.host in awsmanagedclusters. This can be fixed with separate PR.

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds unit tests
  • adds or updates e2e tests

Release note:


@supershal supershal changed the title fix: retrieve controlplane host from Cluster object 🐛 fix: retrieve controlplane host from Cluster object Oct 21, 2025
Copy link
Collaborator

@dkoshkin dkoshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a ticket to find the proper fix, but this works for now.
Thanks for the quick turnaround!

@supershal supershal changed the title 🐛 fix: retrieve controlplane host from Cluster object (🐛 ) fix: retrieve controlplane host from Cluster object Oct 22, 2025
@supershal supershal changed the title (🐛 ) fix: retrieve controlplane host from Cluster object 🐛fix: retrieve controlplane host from Cluster object Oct 22, 2025
@supershal
Copy link
Collaborator Author

Jira ticket in our internal system created to investigate root cause.

@supershal supershal merged commit 47da5ae into nutanix-cloud-native:main Oct 22, 2025
16 of 19 checks passed
faiq pushed a commit that referenced this pull request Nov 26, 2025
* deps: upgrade Kubernetes dependencies to v0.33.4

- Update core Kubernetes dependencies from v0.32.3 to v0.33.4:
  - k8s.io/api, k8s.io/apimachinery, k8s.io/client-go
  - k8s.io/apiserver, k8s.io/cli-runtime, k8s.io/kubectl
  - k8s.io/apiextensions-apiserver, k8s.io/component-base
- Upgrade prometheus/client_golang from v1.19.1 to v1.22.0
- Update cel.dev/expr from v0.18.0 to v0.19.1
- Upgrade google/cel-go from v0.22.0 to v0.23.2
- Update golang.org/x/time from v0.8.0 to v0.9.0
- Upgrade gRPC from v1.67.3 to v1.68.1
- Update OpenTelemetry packages to v1.33.0
- Refresh k8s.io/utils and other indirect dependencies
- Update kube-openapi and structured-merge-diff versions

* deps: update cluster-api to v1.11.1 and controller-runtime to v0.21.0

- Upgrade cluster-api from v1.10.2 to v1.11.1
- Upgrade controller-runtime from v0.20.4 to v0.21.0
- Update various golang.org/x/* packages
- Update testing dependencies (ginkgo, gomega)
- Update OpenTelemetry and other indirect dependencies

* WIP no IDE errors

* WIP IDE Errors

* Fix go dependencies

Signed-off-by: Borja Clemente <[email protected]>

* Update imports, code and generations to CAPI 1.11

- Update all imports to v1beta2 types except for conditions staying in
  v1beta1.
- Adapt source code to work with v1beta2 and deprecated conditions.
- Manually update conversions.

Signed-off-by: Borja Clemente <[email protected]>

* Update linting pkg alias and fix broken imports blocks

Signed-off-by: Borja Clemente <[email protected]>

* Remove unnecessary Paused constants

Signed-off-by: Borja Clemente <[email protected]>

* Fix import aliases

Signed-off-by: Borja Clemente <[email protected]>

* Fix broken imports

Signed-off-by: Borja Clemente <[email protected]>

* Revert public APIS back to v1beta1 while internally using v1beta2

Introducing v1beta2 on public types is a breaking change so they have to
stay in v1beta1. Internally though, migration to v1beta2 is happening
(except for conditions).

Signed-off-by: Borja Clemente <[email protected]>

* Revert infrav1 conditions to v1beta1 and consolidate imports

Signed-off-by: Borja Clemente <[email protected]>

* Consolidate conditions imports and fix linting

Signed-off-by: Borja Clemente <[email protected]>

* Fix regression in machine deployments without failure domain set

Signed-off-by: Borja Clemente <[email protected]>

* Revert missing public APIs to v1beta1

Signed-off-by: Borja Clemente <[email protected]>

* Consolidate infrav1beta1 imports into infrav1

Signed-off-by: Borja Clemente <[email protected]>

* Remove unused conditions constants

Signed-off-by: Borja Clemente <[email protected]>

* Fix setting wrong condition type

Signed-off-by: Borja Clemente <[email protected]>

* Cast v1beta1 conditions instead of creating a new constant

Signed-off-by: Borja Clemente <[email protected]>

* Revert changed public APIs and adapt internally to v1beta2

Signed-off-by: Borja Clemente <[email protected]>

* Resolve conflicts with main

Signed-off-by: Borja Clemente <[email protected]>

* Add deprecated CAPI imports linter rule

Add rule to allow using deprecated v1beta1 CAPI APIs and removed linter
comments everywhere.

Signed-off-by: Borja Clemente <[email protected]>

* Apply review corrections

Signed-off-by: Borja Clemente <[email protected]>

* Adjust e2e and metadata versions

Signed-off-by: Borja Clemente <[email protected]>

* Apply review feedback on awscluster_webhook

Signed-off-by: Borja Clemente <[email protected]>

* FIx unit tests

Signed-off-by: Borja Clemente <[email protected]>

* Review feedback

Signed-off-by: Borja Clemente <[email protected]>

* Apply review feedback

Signed-off-by: Borja Clemente <[email protected]>

* Add CRD RBAC to the awsmachine controller

Signed-off-by: Borja Clemente <[email protected]>

* e2e: add v1beta1 CAPI scheme to clients and adjust modifyFunc test to use the new field name

* Fix linting issues

Signed-off-by: Borja Clemente <[email protected]>

* Fix nodeDrainTimeoutSeconds field in clusterclass test

Signed-off-by: Borja Clemente <[email protected]>

* e2e: fix contract for CAPI

* fix path again

* e2e: fix contract for capa 9.99.99 (#3)

* e2e: use correct type for setting field (#4)

* rosa: deflake unit test (#5)

* rosa: deflake unit test

* fixup

* e2e: fix config metadata and contract version pinning (#6)

* e2e: fix config metadata file path

Signed-off-by: Borja Clemente <[email protected]>

* Bump KCP Template for clusterclass changes (#7)

---------

Signed-off-by: Borja Clemente <[email protected]>
Co-authored-by: Bryan Cox <[email protected]>
Co-authored-by: Christian Schlotter <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants