Skip to content

Commit 42f979f

Browse files
Troubleshooting guide additions
Add troubleshooting guiding on BMH registration, inspection and provisioning errors, and CAPM3 errors related to missing hosts. Signed-off-by: erjavaskivuori <[email protected]>
1 parent e3f17f8 commit 42f979f

File tree

1 file changed

+149
-0
lines changed

1 file changed

+149
-0
lines changed

docs/user-guide/src/troubleshooting.md

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,152 @@ of a forced deletion of its BareMetalHost object. If valid BMC credentials were
8888
provided, Ironic will keep checking the power state of the host and enforcing
8989
the last requested power state. The only solution is again to delete the
9090
Ironic's internal database.
91+
92+
## BMH registration errors
93+
94+
BMC credentials may be incorrect or missing. These issues appear in the
95+
BareMetalHost’s status and in Events.
96+
97+
Check both `kubectl describe bmh <name>` and recent Events for details.
98+
99+
Example output:
100+
101+
```text
102+
Normal RegistrationError 23s metal3-baremetal-controller Failed to get
103+
power state for node 67ac51af-a6b3. Error: Redfish exception occurred.
104+
Error: HTTP GET https://192.168.111.1:8000/redfish/v1/Systems/... returned code 401.
105+
```
106+
107+
## BMH inspection errors
108+
109+
### The host is not able to communicate back results to Ironic
110+
111+
If the host cannot communicate with Ironic, it will result in a timeout.
112+
Accessing serial logs is necessary to determine the exact issue.
113+
114+
Example output from `kubectl get bmh -A`:
115+
116+
```text
117+
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
118+
metal3 node-1 inspecting true inspection error 46m
119+
```
120+
121+
BareMetalHost's events from `kubectl describe bmh <name> -n <namespace>`:
122+
123+
```text
124+
Events:
125+
Type Reason Age From Message
126+
---- ------ ---- ---- -------
127+
Normal InspectionStarted 37m metal3-baremetal-controller Hardware inspection started
128+
Normal InspectionError 7m12s metal3-baremetal-controller timeout reached while inspecting the node
129+
```
130+
131+
### Incompatible configuration
132+
133+
This can occur when attempting to use virtual media or UEFI on hardware that
134+
does not support it. The error will show in status and in events.
135+
136+
Example `kubectl get bmh -A`:
137+
138+
```text
139+
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
140+
metal3 node-1 inspecting true inspection error 8m17s
141+
```
142+
143+
BareMetalHost's events:
144+
145+
```text
146+
Normal InspectionError 5s metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection:
147+
Redfish exception occurred. Error: Setting boot mode to bios failed for node ceec28f5-cedb.rror: HTTP PATCH
148+
https://192.168.111.1:8000/redfish/v1/Systems/... returned code 500.
149+
```
150+
151+
## Provisioning errors
152+
153+
Errors during provisioning will be visible when listing the BareMetalHosts:
154+
155+
```text
156+
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
157+
metal3 node-1 provisioning test1-dt8j2 true provisioning error 149m
158+
```
159+
160+
Check BareMetalHost's events for the specific reason.
161+
162+
Wrong image checksum example:
163+
164+
```text
165+
Normal ProvisioningError 10m metal3-baremetal-controller Image provisioning failed: Deploy
166+
step deploy.write_image failed on node df880558-09da. Image failed to verify against checksum.
167+
location: CENTOS_9_NODE_IMAGE.img; image ID: /dev/sda; image checksum: abcd1234; verification checksum: ...
168+
```
169+
170+
No root device found example:
171+
172+
```text
173+
Normal ProvisioningStarted 15s metal3-baremetal-controller Image provisioning started for http://172.22.0.1/images/CENTOS_9_NODE_IMAGE.img
174+
Normal ProvisioningError 1s metal3-baremetal-controller Image provisioning failed: Deploy step deploy.write_image failed on node d25ce8de-914e-4146-a0c0-58825274572d. No suitable device was found for deployment using these hints {'name': 's== /dev/vdb'}
175+
```
176+
177+
## No BareMetalHost available or matching
178+
179+
This appears in the Metal3Machine status:
180+
181+
```text
182+
Status:
183+
Conditions:
184+
Last Transition Time: 2025-08-15T10:53:05Z
185+
Message: No available host found. Requeuing.. Object will be requeued after 30s
186+
Reason: AssociateBMHFailed
187+
Severity: Error
188+
Status: False
189+
Type: AssociateBMH
190+
```
191+
192+
CAPM3 controller logs when there is no available hosts:
193+
194+
```text
195+
I0815 11:10:35.699004 1 metal3machine_manager.go:332] "No available host found. Requeuing." logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="metal3/test-no-match-2" machine="test-no-match-2" cluster="test1" metal3-cluster="test1"
196+
```
197+
198+
CAPM3 controller logs when the annotated host is not found:
199+
200+
```text
201+
I0815 06:08:54.687380 1 metal3machine_manager.go:788] "Annotated host not found" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="metal3/test1-zxzn7-qvl6n" machine="test1-zxzn7-qvl6n" cluster="test1" metal3-cluster="test1" host="metal3/node-0"
202+
```
203+
204+
## Provider ID is missing
205+
206+
This occurs if `cloudProviderEnabled` is set to `true` on Metal3Cluster when no
207+
external cloud provider is used. The Metal3Machine will remain stuck in the
208+
Provisioning phase.
209+
210+
Example output from `kubectl get metal3machine -A`:
211+
212+
```text
213+
NAMESPACE NAME AGE PROVIDERID READY CLUSTER PHASE
214+
metal3 test1-82ljr 160m metal3://metal3/node-0/test1-82ljr true test1
215+
metal3 test1-bv9mv-2f8th 35m test1
216+
```
217+
218+
Metal3Machine's status:
219+
220+
```text
221+
Status:
222+
Conditions:
223+
Reason: NotReady
224+
Status: False
225+
Type: Available
226+
Message: * NodeHealthy: Waiting for Metal3Machine to report spec.providerID
227+
```
228+
229+
## `nodeRef` missing
230+
231+
A CAPI-level issue. This can be caused by a failure to boot the image or join it
232+
to the cluster. Access to the node or serial logs is needed to determine the
233+
exact cause. In particular, cloud-init logs can help pinpoint the issue.
234+
235+
CAPM3 controller logs:
236+
237+
```text
238+
I0815 11:10:36.545990 1 metal3labelsync_controller.go:150] "Could not find Node Ref on Machine object, will retry" logger="controllers.Metal3LabelSync.metal3-label-sync-controller" metal3-label-sync="metal3/node-0"
239+
```

0 commit comments

Comments
 (0)