Skip to content

Commit faf9f9d

Browse files
committed
Fix NAR/PR state toggling when BMH is in servicing error
When a BMH enters servicing error during Day 2 configuration, the NodeAllocationRequest and ProvisioningRequest were toggling between Failed and ConfigurationUpdateRequested states. Root cause: observedGeneration was only updated on configuration success. When configuration failed due to BMH errors, the generation mismatch persisted, causing the FSM to continuously re-trigger spec change handling. Fix: Always update observedGeneration after processing a spec change, regardless of success or failure. Signed-off-by: Tao Liu <[email protected]>
1 parent de84294 commit faf9f9d

File tree

1 file changed

+12
-4
lines changed

1 file changed

+12
-4
lines changed

hwmgr-plugins/metal3/controller/metal3_nodeallocationrequest_controller.go

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -724,10 +724,18 @@ func (r *NodeAllocationRequestReconciler) handleNodeAllocationRequestSpecChanged
724724
err = updateErr
725725
}
726726
}
727-
if status == metav1.ConditionTrue && reason == string(hwmgmtv1alpha1.ConfigApplied) {
728-
if err := hwmgrutils.UpdateNodeAllocationRequestPluginStatus(ctx, r.Client, nodeAllocationRequest); err != nil {
729-
return hwmgrutils.RequeueWithShortInterval(), fmt.Errorf("failed to update hwMgrPlugin observedGeneration Status: %w", err)
730-
}
727+
728+
// Update observedGeneration to acknowledge the spec change, regardless of success or failure.
729+
// This prevents infinite retry loops when configuration fails due to hardware errors.
730+
// If we only update on success, a persistent hardware error (like BMH servicing error)
731+
// will cause the FSM to continuously detect a spec change and re-trigger configuration attempts.
732+
if updateErr := hwmgrutils.UpdateNodeAllocationRequestPluginStatus(ctx, r.Client, nodeAllocationRequest); updateErr != nil {
733+
r.Logger.ErrorContext(ctx, "Failed to update hwMgrPlugin observedGeneration Status",
734+
slog.String("nodeAllocationRequest", nodeAllocationRequest.Name),
735+
slog.String("error", updateErr.Error()))
736+
// Return error to trigger requeue
737+
return hwmgrutils.RequeueWithShortInterval(),
738+
fmt.Errorf("failed to update hwMgrPlugin observedGeneration Status: %w", updateErr)
731739
}
732740
}
733741

0 commit comments

Comments
 (0)