Skip to content

Commit a70ab73

Browse files
Troubleshooting guide additions
Add troubleshooting guiding on BMH registration, inspection and provisioning errors, and CAPM3 errors related to missing hosts. Signed-off-by: erjavaskivuori <[email protected]>
1 parent e3f17f8 commit a70ab73

File tree

1 file changed

+148
-0
lines changed

1 file changed

+148
-0
lines changed

docs/user-guide/src/troubleshooting.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,151 @@ of a forced deletion of its BareMetalHost object. If valid BMC credentials were
8888
provided, Ironic will keep checking the power state of the host and enforcing
8989
the last requested power state. The only solution is again to delete the
9090
Ironic's internal database.
91+
92+
## BMH registration errors
93+
94+
BMC credentials could be wrong or missing. These issues show up in the
95+
BareMetalHost’s status and as Events.
96+
97+
Check both `kubectl describe bmh <name>` and recent Events for details.
98+
99+
Example output:
100+
101+
```text
102+
Normal RegistrationError 23s metal3-baremetal-controller Failed to get
103+
power state for node 67ac51af-a6b3. Error: Redfish exception occurred.
104+
Error: HTTP GET https://192.168.111.1:8000/redfish/v1/Systems/... returned code 401.
105+
```
106+
107+
## BMH inspection errors
108+
109+
### The host is not able to communicate back results to Ironic
110+
111+
If the host cannot communicate with Ironic, it will result in a timeout. Access
112+
to serial logs is needed to determine the exact issue.
113+
114+
Example output from `kubectl get bmh -A`:
115+
116+
```text
117+
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
118+
metal3 node-1 inspecting true inspection error 46m
119+
```
120+
121+
BareMetalHost's events from `kubectl describe bmh <name> -n <namespace>`:
122+
123+
```text
124+
Events:
125+
Type Reason Age From Message
126+
---- ------ ---- ---- -------
127+
Normal InspectionStarted 37m metal3-baremetal-controller Hardware inspection started
128+
Normal InspectionError 7m12s metal3-baremetal-controller timeout reached while inspecting the node
129+
```
130+
131+
### Incompatible configuration
132+
133+
This can happen when trying to use virtual media or UEFI when not supported.
134+
The error will show in status and events.
135+
136+
Example `kubectl get bmh -A`:
137+
138+
```text
139+
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
140+
metal3 node-1 inspecting true inspection error 8m17s
141+
```
142+
143+
BareMetalHost's events:
144+
145+
```text
146+
Normal InspectionError 5s metal3-baremetal-controller Failed to inspect hardware. Reason: unable to start inspection:
147+
Redfish exception occurred. Error: Setting boot mode to bios failed for node ceec28f5-cedb.rror: HTTP PATCH
148+
https://192.168.111.1:8000/redfish/v1/Systems/... returned code 500.
149+
```
150+
151+
## Provisioning errors
152+
153+
Errors during provisioning will be visible when listing the BareMetalHosts:
154+
155+
```text
156+
NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE
157+
metal3 node-1 provisioning test1-dt8j2 true provisioning error 149m
158+
```
159+
160+
Check BareMetalHost's events for the specific reason.
161+
162+
Wrong image checksum example:
163+
164+
```text
165+
Normal ProvisioningError 10m metal3-baremetal-controller Image provisioning failed: Deploy
166+
step deploy.write_image failed on node df880558-09da. Image failed to verify against checksum.
167+
location: CENTOS_9_NODE_IMAGE.img; image ID: /dev/sda; image checksum: abcd1234; verification checksum: ...
168+
```
169+
170+
No root device found example:
171+
172+
```text
173+
Normal ProvisioningStarted 15s metal3-baremetal-controller Image provisioning started for http://172.22.0.1/images/CENTOS_9_NODE_IMAGE.img
174+
Normal ProvisioningError 1s metal3-baremetal-controller Image provisioning failed: Deploy step deploy.write_image failed on node d25ce8de-914e-4146-a0c0-58825274572d. No suitable device was found for deployment using these hints {'name': 's== /dev/vdb'}
175+
```
176+
177+
## No BareMetalHost available or matching
178+
179+
This shows in the Metal3Machine status:
180+
181+
```text
182+
Status:
183+
Conditions:
184+
Last Transition Time: 2025-08-15T10:53:05Z
185+
Message: No available host found. Requeuing.. Object will be requeued after 30s
186+
Reason: AssociateBMHFailed
187+
Severity: Error
188+
Status: False
189+
Type: AssociateBMH
190+
```
191+
192+
CAPM3 controller logs when there is no available hosts:
193+
194+
```text
195+
I0815 11:10:35.699004 1 metal3machine_manager.go:332] "No available host found. Requeuing." logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="metal3/test-no-match-2" machine="test-no-match-2" cluster="test1" metal3-cluster="test1"
196+
```
197+
198+
CAPM3 controller logs when the annotated host is not found:
199+
200+
```text
201+
I0815 06:08:54.687380 1 metal3machine_manager.go:788] "Annotated host not found" logger="controllers.Metal3Machine.Metal3Machine-controller" metal3-machine="metal3/test1-zxzn7-qvl6n" machine="test1-zxzn7-qvl6n" cluster="test1" metal3-cluster="test1" host="metal3/node-0"
202+
```
203+
204+
## Provider ID is missing
205+
206+
Happens if `noCloudProvider` is set to `false` on Metal3Cluster when no external
207+
cloud provider is used. Metal3Machine will be stuck in Provisioning.
208+
209+
Example output from `kubectl get metal3machine -A`:
210+
211+
```text
212+
NAMESPACE NAME AGE PROVIDERID READY CLUSTER PHASE
213+
metal3 test1-82ljr 160m metal3://metal3/node-0/test1-82ljr true test1
214+
metal3 test1-bv9mv-2f8th 35m test1
215+
```
216+
217+
Metal3Machine's status:
218+
219+
```text
220+
Status:
221+
Conditions:
222+
Reason: NotReady
223+
Status: False
224+
Type: Available
225+
Message: * NodeHealthy: Waiting for Metal3Machine to report spec.providerID
226+
```
227+
228+
## `nodeRef` missing
229+
230+
A CAPI-level issue. Can be caused by failure to boot the image or to join it to
231+
the cluster. Access to the node or serial logs is needed to determine the exact
232+
cause. Especially cloud-init logs can help pinpointing it.
233+
234+
CAPM3 controller logs:
235+
236+
```text
237+
I0815 11:10:36.545990 1 metal3labelsync_controller.go:150] "Could not find Node Ref on Machine object, will retry" logger="controllers.Metal3LabelSync.metal3-label-sync-controller" metal3-label-sync="metal3/node-0"
238+
```

0 commit comments

Comments
 (0)