Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions docs/index.yml
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,22 @@ navigation:
path: playbooks/stuck_objects/waiting_for_network_config.md
- page: Adding New Machines to an Existing Site
path: playbooks/stuck_objects/adding_new_machines.md
- page: State Machine Debugging
path: playbooks/stuck_objects/state_machine_debugging.md
- page: Diagnostic Tools
path: playbooks/stuck_objects/diagnostic_tools.md
- page: DPU Provisioning Failures
path: playbooks/stuck_objects/dpu_provisioning_failures.md
- page: Host Ingestion Failures
path: playbooks/stuck_objects/host_ingestion_failures.md
- page: Health Alerts and Overrides
path: playbooks/stuck_objects/health_alerts.md
- page: Network Connectivity Issues
path: playbooks/stuck_objects/network_connectivity.md
- page: Instance and Fabric Issues
path: playbooks/stuck_objects/instance_and_fabric_issues.md
- page: Site Controller Health
path: playbooks/stuck_objects/site_controller_health.md
- page: Troubleshooting noDpuLogsWarning Alerts
path: playbooks/troubleshooting_noDpuLogsWarning_alerts.md
- section: Debugging Machine
Expand Down
164 changes: 164 additions & 0 deletions docs/playbooks/stuck_objects/diagnostic_tools.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Diagnostic Tools

Use this page as a command reference while investigating a stuck object or site
operation incident.

## CLI Setup

`nico-admin-cli` is the primary operator CLI for NICo site state.

```bash
cargo build -p carbide-admin-cli
```

Common connection options:

| Option | Meaning |
|---|---|
| `-c <url>` | NICo API gRPC endpoint. |
| `-f json` | JSON output for scripting. |
| `API_URL` | Environment variable for the API URL. |
| `https_proxy=socks5://...` | SOCKS5 proxy when reaching the site from off-site. |

## Common Commands

| Need | Command |
|---|---|
| API version or reachability | `nico-admin-cli version`, `nico-admin-cli ping` |
| All managed hosts | `nico-admin-cli managed-host show --all` |
| One managed host | `nico-admin-cli managed-host show <host-machine-id>` |
| Machine event history | `nico-admin-cli -f json machine show <machine-id>` |
| Debug bundle | `nico-admin-cli managed-host debug-bundle <machine-id> --start-time <time>` |
| Maintenance mode | `nico-admin-cli managed-host maintenance on --host <host-machine-id> --reference "INC-123"` |
| Health reports | `nico-admin-cli machine health-report show <machine-id>` |
| Site Explorer reports | `nico-admin-cli site-explorer get-report all` |
| Redfish browse | `nico-admin-cli redfish browse --address <bmc-ip> <uri>` |
| Network segments | `nico-admin-cli network-segment show` |
| InfiniBand partitions | `nico-admin-cli ib-partition show` |
| NVLink partitions | `nico-admin-cli nvl-partition show` |
| Compute allocation | `nico-admin-cli compute-allocation show` |

## Query State History

```bash
nico-admin-cli -c <api-url> -f json machine show <machine-id>
```

Use this to inspect state transitions, timestamps, and handler outcomes.

## Query Health

Aggregate state:

```bash
nico-admin-cli managed-host show <host-machine-id>
```

Per-source health reports:

```bash
nico-admin-cli machine health-report show <machine-id>
```

JSON output:

```bash
nico-admin-cli -f json machine health-report show <machine-id>
```

## Add or Remove Health Overrides

Mark a false positive healthy for allocation:

```bash
nico-admin-cli machine health-override add <machine-id> \
--template mark-healthy \
--message "false positive INC-123"
```

Hold a host out of allocation:

```bash
nico-admin-cli machine health-override add <machine-id> \
--template out-for-repair \
--message "INC-123"
```

Remove an override:

```bash
nico-admin-cli machine health-override remove <machine-id> <source-name>
```

## Kubernetes Logs

Namespace names vary by site and deployment generation. Confirm the namespace
before copying commands.

```bash
kubectl get ns
kubectl -n <nico-namespace> get pods
kubectl -n <nico-namespace> logs deploy/nico-api --tail=500 | grep <machine-id>
```

Common log sources:

| Component | What to look for |
|---|---|
| `nico-api` | State transitions, Redfish errors, Vault failures, health reports, gRPC errors. |
| `nico-dhcp` | DHCP lease and discovery issues. |
| `nico-pxe` | PXE and HTTP boot artifact requests. |
| Site Explorer | BMC endpoint discovery and scrape failures. |
| DPF operator | DPU provisioning custom resources and operator status. |
| `nico-dpu-agent` | DPU heartbeat, BGP, HBN, DHCP relay, and applied network config. |

## Loki and Grafana

Use a debug bundle when possible:

```bash
GRAFANA_AUTH_TOKEN=<token> \
nico-admin-cli managed-host debug-bundle <machine-id> \
--start-time <time> \
--grafana-url https://<grafana-host>
```

Use `logcli` directly when a bundle is not enough:

```bash
logcli --addr=http://localhost:3100 \
--org-id=<org-id> \
query \
--timezone=UTC \
--from="<YYYY-MM-DDTHH:MM:SSZ>" \
--to="<YYYY-MM-DDTHH:MM:SSZ>" \
--limit 0 \
--forward \
'{k8s_container_name="<container-name>"}'
```

## On-Metal Host and DPU Logs

| Location | Use |
|---|---|
| `/var/log/nico/nico-scout.log` | Host discovery scout during ingestion. |
| `journalctl -u nico-dpu-agent` | DPU agent heartbeat, network config, BGP, HBN, and service health. |
| DPU BMC or rshim console | Use when SSH to the DPU fails. |

## Metrics

Metric names may retain the historical `carbide_*` prefix even when the service
name is now NICo.

| Metric | Use |
|---|---|
| `carbide_machines_per_state` | Count hosts by state. |
| `carbide_machines_time_in_state_seconds` | Average time in each state. |
| `carbide_machines_per_state_above_sla` | Hosts past state SLA. |
| `carbide_hosts_health_status_count` | Healthy vs alerting hosts. |
| `carbide_hosts_health_overrides_count` | Active overrides. |
| `carbide_dpus_up_count` / `carbide_dpus_healthy_count` | DPU agent presence and health. |
| `carbide_endpoint_exploration_*` | BMC discovery health. |
| `carbide_available_ips_count` | DHCP or IP pool pressure. |
| `carbide_gpus_usable_count` | GPU capacity for allocation. |
| `carbide_api_vault_requests_failed_total` | Credential pipeline failures. |
131 changes: 131 additions & 0 deletions docs/playbooks/stuck_objects/dpu_provisioning_failures.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
# DPU Provisioning Failures

Use this playbook when a DPU is stuck during discovery, initialization,
reprovisioning, secure boot setup, or network configuration.

## Where Failures Appear

DPU provisioning issues usually show up in two places:

| Layer | Examples |
|---|---|
| NICo state machine | `DpuDiscoveringState`, `DPUInit`, `DPUReprovision`, `Assigned/DPUReprovision`. |
| DPF operator resources | DPU device, provisioning, secure boot, and service state. |

Start with NICo state. Move to DPF resources when NICo is waiting on DPF.

```bash
nico-admin-cli managed-host show <host-machine-id>
nico-admin-cli -f json machine show <host-machine-id>
```

## Install Path

Know which install path is active before debugging.

| Path | How it works | Common blockers |
|---|---|---|
| BFB install over Redfish | NICo or DPF instructs the DPU BMC to install a BFB. | Redfish connectivity, BMC credentials, BFB availability. |
| UEFI HTTP boot | DPU boots over HTTP through `nico-pxe`. | DHCP, HTTP boot URL, TLS root CA, boot order, DPU NIC path. |
| Reprovision | Existing DPU is updated or reinstalled. | User approval, assigned instance state, BFB version, DPF status. |

## Common States

### `DpuDiscoveringState`

NICo is discovering the DPU and preparing it for provisioning.

Check:

- DPU BMC reachability.
- Redfish credentials and Vault access.
- Site Explorer reports for the DPU BMC.
- DPF device status if DPF owns the next step.

### `DPUInit`

NICo is installing or bringing up the DPU OS and services.

Check:

- DPU BMC power and console.
- DPU install method: BFB over Redfish or UEFI HTTP boot.
- `nico-pxe` logs for HTTP boot requests.
- DPF operator status.
- `nico-dpu-agent` startup logs once the OS boots.

### `WaitingForNetworkConfig`

NICo blocks state advancement until the DPU agent reports:

1. It is alive.
2. It applied the latest desired network config version.
3. Its DPU network health is acceptable.

```bash
nico-admin-cli managed-host show <host-machine-id>
nico-admin-cli machine network status
nico-admin-cli machine health-report show <dpu-machine-id>
```

If `Last seen` is stale or `HeartbeatTimeout` is present, inspect the DPU
directly:

```bash
journalctl -u nico-dpu-agent.service -e --no-pager
```

### `DPUReprovision`

Reprovisioning may require approval when a host is assigned to an instance.

```bash
nico-admin-cli dpu reprovision list
nico-admin-cli dpu reprovision restart --id <host-machine-id>
```

If the host is assigned, confirm the tenant or user approval path before
forcing disruptive actions.

## Health Probes

Common DPU probe alerts:

| Probe | Meaning | First checks |
|---|---|---|
| `HeartbeatTimeout` | NICo has not received recent DPU agent health. | DPU booted, agent running, DPU can reach `nico-api`. |
| `BgpStats` | BGP peering is not healthy. | HBN container, FRR/BGP status, TOR or route-server reachability. |
| `ServiceRunning` | Required DPU service is down. | `crictl ps`, systemd status, HBN logs. |
| `DhcpRelay` / `DhcpServer` | Host-facing DHCP path is broken. | DPU agent logs, HBN, DHCP relay/server config. |

## DPU Console and Logs

If SSH to the DPU works:

```bash
ssh <dpu-oob-ip>
journalctl -u nico-dpu-agent.service -e --no-pager
```

If SSH fails, use DPU BMC or rshim access and check whether the DPU OS booted.

Useful on-DPU checks:

```bash
systemctl status nico-dpu-agent.service
journalctl -u nico-dpu-agent.service -e --no-pager
sudo crictl ps
sudo crictl exec -ti $(sudo crictl ps | grep doca-hbn | awk '{print $1}') vtysh -c 'show bgp summary'
```

## Mitigations

Use the least disruptive mitigation that addresses the root cause.

| Situation | Mitigation |
|---|---|
| DPU agent is stopped | `systemctl restart nico-dpu-agent.service` |
| Unit files changed | `systemctl daemon-reload && systemctl restart nico-dpu-agent.service` |
| DPU is unresponsive | Power cycle host only after confirming tenant or operator impact. |
| Reprovision stuck | `nico-admin-cli dpu reprovision restart --id <host-machine-id>` |
| False health blocker | Add a temporary override only with incident context and remove it after recovery. |
Loading
Loading