Skip to content

Conversation

@tliu2021
Copy link
Collaborator

@tliu2021 tliu2021 commented Nov 5, 2025

This commit improves how the system handles hardware timeouts.

Better Timeout Handling: Moves hardware timeout detection into the Metal3 plugin (where it belongs) instead of the O-Cloud Manager. This keeps the code better organized.

Annotation Cleanup: Adds new logic to automatically clean up temporary metadata (annotations) from hardware objects, especially after a timeout. This prevents stale data from causing errors.

Day 2 Retry Logic Improvements: Improves the logic for retrying failed hardware configurations.

Updates documentation to explain the new timeout and retry features.

Assisted-by: Cursor/claude-4-sonnet

Move hardware timeout detection from O-Cloud Manager to Metal3 plugin.
The plugin now handles timeout detection and reports timeouts via
callbacks.

Signed-off-by: Tao Liu <[email protected]>
Enable retry of hardware configuration operations after timeouts or
failures by allowing spec changes to trigger new configuration attempts.

Key changes:
- Add webhook validation to prevent spec updates for Day 0 provisioning
  timeouts/failures (requires delete and recreate)
- Track ObservedConfigTransactionId to detect configuration spec changes
- Skip timeout checking when ConfigTransactionId changes (indicates retry)
- Add waitingForConfigStart logic to handle new configuration attempts
- Update shouldUpdateHardwareStatus to properly handle terminal states
- Add Day 2 retry test scenarios

Assisted-by: Cursor/claude-4-sonnet
Signed-off-by: Tao Liu <[email protected]>
@openshift-ci openshift-ci bot requested review from donpenney and rauhersu November 5, 2025 16:19
@openshift-ci
Copy link

openshift-ci bot commented Nov 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bartwensley for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tliu2021
Copy link
Collaborator Author

tliu2021 commented Nov 5, 2025

/hold
testing-in-progress

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 5, 2025
@donpenney donpenney requested review from browsell and sakhoury and removed request for rauhersu November 5, 2025 17:14
pluginNamespace string,
bmh *metal3v1alpha1.BareMetalHost, profileName string, postInstall bool) (bool, error) {

// Clear any existing update annotations to ensure clean state
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible this is called if a update is already in progress ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a defensive measure, we explicitly clear any potential annotations before starting instead of assuming a clean state.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question is, can we reach this code path if an update is already in progress.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is for retry scenarios: if a previous request times out while an update is already in progress, the annotation should be cleared when that timeout is detected.

@donpenney
Copy link
Collaborator

It looks like ci-job is failing because some generated code changes haven't been committed


// Propagate timeout state to AllocatedNodes
if conditionReason == hwmgmtv1alpha1.TimedOut {
if err := propagateTimeoutToAllocatedNodes(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this in the updateConditionAndSendCallback function, as opposed to being in the timeout handling code in HandleNodeAllocationRequest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This keeps the timeout handler focused on detection and orchestration, while centralizing condition updates in updateConditionAndSendCallback. As discussed, propagateTimeoutToAllocatedNodes() is not needed and will be removed.

@tliu2021
Copy link
Collaborator Author

/hold

@browsell
Copy link
Collaborator

/test scorecard

- Consolidate hardware operation timestamps into single
  HardwareOperationStartTime field
- Simplify ObservedConfigTransactionId type (*int64 -> int64)
- Improve error handling with errors.Join and add elapsed duration to logs
- Add test coverage for edge cases and race conditions

Signed-off-by: Tao Liu <[email protected]>
Updates documentation to explain the new timeout and retry features.

Signed-off-by: Tao Liu <[email protected]>
@tliu2021
Copy link
Collaborator Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 14, 2025
@tliu2021
Copy link
Collaborator Author

tliu2021 commented Nov 17, 2025

/hold
I found an issue: the NAR and PR states are toggling when the BMH is in servicing error.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2025
When a BMH  enters servicing error during Day 2 configuration,
the NodeAllocationRequest and ProvisioningRequest were toggling
between Failed and ConfigurationUpdateRequested states.

Root cause: The code had logic to skip status aggregation and
preserve terminal conditions, but:
1. It only handled TimedOut states, not Failed states
2. It didn't update observedGeneration before returning early

Fix: Update observedGeneration in two locations:
1. Before early return when skipping aggregation for terminal states
2. After status aggregation when reaching a terminal state
   (True, Failed, or TimedOut)

This ensures that once a terminal state is reached, observedGeneration
is always updated, preventing false spec change detection.

Signed-off-by: Tao Liu <[email protected]>
@tliu2021
Copy link
Collaborator Author

/test scorecard

@tliu2021
Copy link
Collaborator Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants