ci(nightly-e2e): add onboard negative-path E2E test

## Summary

Gap analysis of the last 3 weeks of bug fixes (Apr 7–27) identified onboard input validation / error handling as a systematic gap — **6 bugs** reached NV QA or community users because the nightly only exercises the happy-path onboard flow.

Additionally, PR #2607 wired `test-onboard-repair.sh` and `test-onboard-resume.sh` into the nightly pipeline on Apr 28, and **both are persistently failing** in every nightly run. The root cause is understood (see below) and fixing them is a primary goal of this issue.

---

## 🔴 Immediate goal: fix `onboard-repair-e2e` and `onboard-resume-e2e`

Both tests fail at Phase 2 with: `FAIL: First onboard exited 0 (expected 1)`

### Root cause

The tests set `NEMOCLAW_POLICY_MODE=invalid` to force onboard to fail at the policy step (step 8/8), creating an interrupted session file for resume/repair testing. This worked before PR #2434 (merged Apr 24), which **intentionally changed the behavior**: invalid policy modes now warn and fall back to suggested presets instead of calling `process.exit(1)`.

There are two policy-mode validation paths in `src/lib/onboard.ts`:

1. **Old `_setupPolicies()`** (line ~5605) — `process.exit(1)` on invalid mode (hard fail)
2. **New `setupPoliciesWithSelection()`** (line ~6240) — `console.warn()` + fallback to suggested presets (graceful)

The main `onboard()` function calls `setupPoliciesWithSelection()` — the graceful path. So `NEMOCLAW_POLICY_MODE=invalid` no longer causes a failure; onboard completes successfully (exit 0).

### Fix required

The tests need a **different fault injection mechanism** to create interrupted onboard state. Options:

1. **Add a test-only env var** (e.g., `NEMOCLAW_E2E_FORCE_FAIL_AT_STEP=policies`) that forces a failure at a named step — most reliable and explicit
2. **Kill the process mid-onboard** with a timed signal after sandbox creation but before policy application
3. **Inject a different real failure** (e.g., invalid API key format, unreachable endpoint) at an earlier step

Option 1 is recommended — it is deterministic, self-documenting, and does not couple test behavior to transient product error handling.

### Nightly failure evidence

Failed in all 3 recent nightly runs:
- [25084401781](https://github.com/NVIDIA/NemoClaw/actions/runs/25084401781) (schedule, Apr 29 00:15 UTC)
- [25082763579](https://github.com/NVIDIA/NemoClaw/actions/runs/25082763579) (dispatch, Apr 28 23:21 UTC)
- [25079089844](https://github.com/NVIDIA/NemoClaw/actions/runs/25079089844) (dispatch, Apr 28 21:39 UTC)

### Related PRs
- #2434 — `fix(onboard): fall back to tier suggestions on bad NEMOCLAW_POLICY_MODE` (the change that broke the fault injection)
- #2607 — `fix(ci): wire 6 unwired E2E scripts into nightly pipeline` (wired these tests)

---

## Consolidated from #446: onboard resumability

Issue #446 documented that `nemoclaw onboard` was not resumable — partial failures required full cleanup and restart. The core problems were:

1. **No checkpoint/resume** — no way to skip already-completed steps after a mid-onboard failure
2. **Stale gateway blocks re-onboarding** — gateway left running on port 8080 after failure, next attempt gets `Error: Gateway already running on port 8080`
3. **PATH not updated after install** — `nemoclaw: command not found` immediately after install

Since #446 was filed, significant progress has been made:
- `nemoclaw onboard --resume` now exists and skips cached steps (preflight, gateway, sandbox)
- Session file (`~/.nemoclaw/onboard-session.json`) records interrupted state
- PRs #961, #1136, #2060 hardened installer and onboard resiliency

**What remains from #446** is ensuring this resume mechanism is **tested end-to-end** — which is exactly what `test-onboard-resume.sh` and `test-onboard-repair.sh` do, once their fault injection is fixed. The E2E coverage in this issue fully subsumes the remaining work from #446.

---

## What it tests

Onboard error handling and input validation — the negative/edge-case paths that the current `cloud-e2e` happy-path onboard never exercises.

## Bugs it would have caught (Apr 7–27)

- #2434 / #2429 — bad `NEMOCLAW_POLICY_MODE` value crashes onboard or silently skips all presets
- #2428 — invalid Slack bot token accepted without validation
- #2389 — `nvapi-` prefix enforced on Anthropic and other non-NVIDIA API keys
- #2304 / #2177 — non-interactive preset selection ignored, presets don't apply correctly
- #2507 — Brave Search key validation failure aborts non-interactive onboard

## Proposed test cases

| # | Scenario | Assert |
|---|----------|--------|
| 1 | `NEMOCLAW_POLICY_MODE=restricted` (valid tier name, invalid mode) | Graceful fallback to tier suggestions, not `process.exit(1)` |
| 2 | `NEMOCLAW_POLICY_MODE=nonexistent` | Same graceful fallback |
| 3 | Invalid API key format for NVIDIA provider | Rejection message, not stack trace |
| 4 | Valid Anthropic key (no `nvapi-` prefix) for Anthropic provider | Accepted without error |
| 5 | Onboard on port already in use (18789 held by another process) | User-friendly port conflict error, not raw JS stack trace |
| 6 | Non-interactive onboard with specific `NEMOCLAW_POLICY_PRESETS` | Verify presets actually applied (not silently dropped) |
| 7 | Non-interactive onboard with `NEMOCLAW_PROVIDER=cloud` + specific model | Verify model selection honored |
| 8 | **Interrupted onboard → resume skips completed steps** (from `test-onboard-resume.sh`) | Resume skips preflight/gateway/sandbox, completes from failure point |
| 9 | **Missing sandbox → repair recreates it on resume** (from `test-onboard-repair.sh`) | Resume detects missing sandbox and recreates it |
| 10 | **Conflicting sandbox name on resume** (from `test-onboard-repair.sh`) | Resume rejects mismatched sandbox name |
| 11 | **Conflicting provider/model on resume** (from `test-onboard-repair.sh`) | Resume rejects mismatched provider/model |

## Implementation

1. **Fix fault injection** in `test-onboard-resume.sh` and `test-onboard-repair.sh` — replace `NEMOCLAW_POLICY_MODE=invalid` with a reliable mechanism (see options above)
2. New self-contained script (`test/e2e/test-onboard-negative-paths.sh`) for scenarios 1–7 (~300–400 lines)
3. New nightly job in `.github/workflows/nightly-e2e.yaml` with `NVIDIA_API_KEY` and `NEMOCLAW_NON_INTERACTIVE=1`

**Effort:** Medium — fault injection fix (~1 hour), new negative-path script (~half day), nightly YAML addition.

## Related issues

- #446 — Original onboard resumability issue (consolidated here, now closed)
- #2566 — Wire 4 existing self-contained scripts into nightly
- #2567 — Wire credential-sanitization and telegram-injection into nightly
- #2570 — Re-add cloud-experimental-e2e to nightly
- #2572 — Forward-proxy E2E for custom endpoints
- #2574 — Brev Launchable install-flow smoke test
- #2571 — Non-root sandbox smoke as PR gate
- #2564 — CodeRabbit E2E recommendations + selective nightly dispatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci(nightly-e2e): add onboard negative-path E2E test #2573

Summary

🔴 Immediate goal: fix `onboard-repair-e2e` and `onboard-resume-e2e`

Root cause

Fix required

Nightly failure evidence

Related PRs

Consolidated from #446: onboard resumability

What it tests

Bugs it would have caught (Apr 7–27)

Proposed test cases

Implementation

Related issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

#	Scenario	Assert
1	`NEMOCLAW_POLICY_MODE=restricted` (valid tier name, invalid mode)	Graceful fallback to tier suggestions, not `process.exit(1)`
2	`NEMOCLAW_POLICY_MODE=nonexistent`	Same graceful fallback
3	Invalid API key format for NVIDIA provider	Rejection message, not stack trace
4	Valid Anthropic key (no `nvapi-` prefix) for Anthropic provider	Accepted without error
5	Onboard on port already in use (18789 held by another process)	User-friendly port conflict error, not raw JS stack trace
6	Non-interactive onboard with specific `NEMOCLAW_POLICY_PRESETS`	Verify presets actually applied (not silently dropped)
7	Non-interactive onboard with `NEMOCLAW_PROVIDER=cloud` + specific model	Verify model selection honored
8	Interrupted onboard → resume skips completed steps (from `test-onboard-resume.sh`)	Resume skips preflight/gateway/sandbox, completes from failure point
9	Missing sandbox → repair recreates it on resume (from `test-onboard-repair.sh`)	Resume detects missing sandbox and recreates it
10	Conflicting sandbox name on resume (from `test-onboard-repair.sh`)	Resume rejects mismatched sandbox name
11	Conflicting provider/model on resume (from `test-onboard-repair.sh`)	Resume rejects mismatched provider/model

ci(nightly-e2e): add onboard negative-path E2E test #2573

Description

Summary

🔴 Immediate goal: fix onboard-repair-e2e and onboard-resume-e2e

Root cause

Fix required

Nightly failure evidence

Related PRs

Consolidated from #446: onboard resumability

What it tests

Bugs it would have caught (Apr 7–27)

Proposed test cases

Implementation

Related issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

🔴 Immediate goal: fix `onboard-repair-e2e` and `onboard-resume-e2e`