Move hardware timeout detection to Metal3 plugin #1369

tliu2021 · 2025-11-05T16:18:54Z

This commit improves how the system handles hardware timeouts.

Better Timeout Handling: Moves hardware timeout detection into the Metal3 plugin (where it belongs) instead of the O-Cloud Manager. This keeps the code better organized.

Annotation Cleanup: Adds new logic to automatically clean up temporary metadata (annotations) from hardware objects, especially after a timeout. This prevents stale data from causing errors.

Day 2 Retry Logic Improvements: Improves the logic for retrying failed hardware configurations.

Updates documentation to explain the new timeout and retry features.

Assisted-by: Cursor/claude-4-sonnet

Move hardware timeout detection from O-Cloud Manager to Metal3 plugin. The plugin now handles timeout detection and reports timeouts via callbacks. Signed-off-by: Tao Liu <[email protected]>

Enable retry of hardware configuration operations after timeouts or failures by allowing spec changes to trigger new configuration attempts. Key changes: - Add webhook validation to prevent spec updates for Day 0 provisioning timeouts/failures (requires delete and recreate) - Track ObservedConfigTransactionId to detect configuration spec changes - Skip timeout checking when ConfigTransactionId changes (indicates retry) - Add waitingForConfigStart logic to handle new configuration attempts - Update shouldUpdateHardwareStatus to properly handle terminal states - Add Day 2 retry test scenarios Assisted-by: Cursor/claude-4-sonnet Signed-off-by: Tao Liu <[email protected]>

openshift-ci · 2025-11-05T16:19:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bartwensley for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tliu2021 · 2025-11-05T16:22:18Z

/hold
testing-in-progress

api/hardwaremanagement/plugins/v1alpha1/node_allocation_requests.go

browsell · 2025-11-05T17:23:18Z

hwmgr-plugins/metal3/controller/baremetalhost_manager.go

 	pluginNamespace string,
 	bmh *metal3v1alpha1.BareMetalHost, profileName string, postInstall bool) (bool, error) {

+	// Clear any existing update annotations to ensure clean state


Possible this is called if a update is already in progress ?

As a defensive measure, we explicitly clear any potential annotations before starting instead of assuming a clean state.

My question is, can we reach this code path if an update is already in progress.

Yes, this is for retry scenarios: if a previous request times out while an update is already in progress, the annotation should be cleared when that timeout is detected.

hwmgr-plugins/metal3/controller/helpers.go

donpenney · 2025-11-05T18:49:17Z

It looks like ci-job is failing because some generated code changes haven't been committed

donpenney · 2025-11-05T18:57:42Z

hwmgr-plugins/metal3/controller/metal3_nodeallocationrequest_controller.go


+	// Propagate timeout state to AllocatedNodes
+	if conditionReason == hwmgmtv1alpha1.TimedOut {
+		if err := propagateTimeoutToAllocatedNodes(


Why is this in the updateConditionAndSendCallback function, as opposed to being in the timeout handling code in HandleNodeAllocationRequest?

This keeps the timeout handler focused on detection and orchestration, while centralizing condition updates in updateConditionAndSendCallback. As discussed, propagateTimeoutToAllocatedNodes() is not needed and will be removed.

tliu2021 · 2025-11-12T17:39:06Z

/hold

docs/user-guide/cluster-configuration.md

docs/user-guide/cluster-provisioning.md

hwmgr-plugins/metal3/controller/helpers.go

browsell · 2025-11-13T13:13:04Z

/test scorecard

- Consolidate hardware operation timestamps into single HardwareOperationStartTime field - Simplify ObservedConfigTransactionId type (*int64 -> int64) - Improve error handling with errors.Join and add elapsed duration to logs - Add test coverage for edge cases and race conditions Signed-off-by: Tao Liu <[email protected]>

Updates documentation to explain the new timeout and retry features. Signed-off-by: Tao Liu <[email protected]>

tliu2021 · 2025-11-14T17:42:28Z

/hold cancel

tliu2021 · 2025-11-17T20:30:32Z

/hold
I found an issue: the NAR and PR states are toggling when the BMH is in servicing error.

When a BMH enters servicing error during Day 2 configuration, the NodeAllocationRequest and ProvisioningRequest were toggling between Failed and ConfigurationUpdateRequested states. Root cause: The code had logic to skip status aggregation and preserve terminal conditions, but: 1. It only handled TimedOut states, not Failed states 2. It didn't update observedGeneration before returning early Fix: Update observedGeneration in two locations: 1. Before early return when skipping aggregation for terminal states 2. After status aggregation when reaching a terminal state (True, Failed, or TimedOut) This ensures that once a terminal state is reached, observedGeneration is always updated, preventing false spec change detection. Signed-off-by: Tao Liu <[email protected]>

tliu2021 · 2025-11-18T02:34:58Z

/test scorecard

tliu2021 · 2025-11-18T03:40:13Z

/hold cancel

tliu2021 added 2 commits November 5, 2025 11:13

Move hardware timeout detection to Metal3 plugin

b5b05e0

Move hardware timeout detection from O-Cloud Manager to Metal3 plugin. The plugin now handles timeout detection and reports timeouts via callbacks. Signed-off-by: Tao Liu <[email protected]>

openshift-ci bot requested review from donpenney and rauhersu November 5, 2025 16:19

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2025

donpenney requested review from browsell and sakhoury and removed request for rauhersu November 5, 2025 17:14

browsell reviewed Nov 5, 2025

View reviewed changes

api/hardwaremanagement/plugins/v1alpha1/node_allocation_requests.go Outdated Show resolved Hide resolved

browsell reviewed Nov 5, 2025

View reviewed changes

hwmgr-plugins/metal3/controller/helpers.go Outdated Show resolved Hide resolved

browsell reviewed Nov 5, 2025

View reviewed changes

hwmgr-plugins/metal3/controller/helpers.go Show resolved Hide resolved

donpenney reviewed Nov 5, 2025

View reviewed changes

tliu2021 force-pushed the OCPBUGS-62298 branch from b907e10 to 15c8a98 Compare November 13, 2025 03:31