-
Notifications
You must be signed in to change notification settings - Fork 35
Move hardware timeout detection to Metal3 plugin #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Move hardware timeout detection from O-Cloud Manager to Metal3 plugin. The plugin now handles timeout detection and reports timeouts via callbacks. Signed-off-by: Tao Liu <[email protected]>
Enable retry of hardware configuration operations after timeouts or failures by allowing spec changes to trigger new configuration attempts. Key changes: - Add webhook validation to prevent spec updates for Day 0 provisioning timeouts/failures (requires delete and recreate) - Track ObservedConfigTransactionId to detect configuration spec changes - Skip timeout checking when ConfigTransactionId changes (indicates retry) - Add waitingForConfigStart logic to handle new configuration attempts - Update shouldUpdateHardwareStatus to properly handle terminal states - Add Day 2 retry test scenarios Assisted-by: Cursor/claude-4-sonnet Signed-off-by: Tao Liu <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold |
api/hardwaremanagement/plugins/v1alpha1/node_allocation_requests.go
Outdated
Show resolved
Hide resolved
| pluginNamespace string, | ||
| bmh *metal3v1alpha1.BareMetalHost, profileName string, postInstall bool) (bool, error) { | ||
|
|
||
| // Clear any existing update annotations to ensure clean state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible this is called if a update is already in progress ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a defensive measure, we explicitly clear any potential annotations before starting instead of assuming a clean state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My question is, can we reach this code path if an update is already in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is for retry scenarios: if a previous request times out while an update is already in progress, the annotation should be cleared when that timeout is detected.
|
It looks like ci-job is failing because some generated code changes haven't been committed |
|
|
||
| // Propagate timeout state to AllocatedNodes | ||
| if conditionReason == hwmgmtv1alpha1.TimedOut { | ||
| if err := propagateTimeoutToAllocatedNodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this in the updateConditionAndSendCallback function, as opposed to being in the timeout handling code in HandleNodeAllocationRequest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This keeps the timeout handler focused on detection and orchestration, while centralizing condition updates in updateConditionAndSendCallback. As discussed, propagateTimeoutToAllocatedNodes() is not needed and will be removed.
|
/hold |
b907e10 to
15c8a98
Compare
|
/test scorecard |
- Consolidate hardware operation timestamps into single HardwareOperationStartTime field - Simplify ObservedConfigTransactionId type (*int64 -> int64) - Improve error handling with errors.Join and add elapsed duration to logs - Add test coverage for edge cases and race conditions Signed-off-by: Tao Liu <[email protected]>
Updates documentation to explain the new timeout and retry features. Signed-off-by: Tao Liu <[email protected]>
15c8a98 to
de84294
Compare
|
/hold cancel |
|
/hold |
When a BMH enters servicing error during Day 2 configuration, the NodeAllocationRequest and ProvisioningRequest were toggling between Failed and ConfigurationUpdateRequested states. Root cause: The code had logic to skip status aggregation and preserve terminal conditions, but: 1. It only handled TimedOut states, not Failed states 2. It didn't update observedGeneration before returning early Fix: Update observedGeneration in two locations: 1. Before early return when skipping aggregation for terminal states 2. After status aggregation when reaching a terminal state (True, Failed, or TimedOut) This ensures that once a terminal state is reached, observedGeneration is always updated, preventing false spec change detection. Signed-off-by: Tao Liu <[email protected]>
faf9f9d to
e4aeb63
Compare
|
/test scorecard |
|
/hold cancel |
This commit improves how the system handles hardware timeouts.
Better Timeout Handling: Moves hardware timeout detection into the Metal3 plugin (where it belongs) instead of the O-Cloud Manager. This keeps the code better organized.
Annotation Cleanup: Adds new logic to automatically clean up temporary metadata (annotations) from hardware objects, especially after a timeout. This prevents stale data from causing errors.
Day 2 Retry Logic Improvements: Improves the logic for retrying failed hardware configurations.
Updates documentation to explain the new timeout and retry features.
Assisted-by: Cursor/claude-4-sonnet