Summary
Gap analysis of the last 3 weeks of bug fixes (Apr 7–27) identified onboard input validation / error handling as a systematic gap — 6 bugs reached NV QA or community users because the nightly only exercises the happy-path onboard flow.
Additionally, PR #2607 wired test-onboard-repair.sh and test-onboard-resume.sh into the nightly pipeline on Apr 28, and both are persistently failing in every nightly run. The root cause is understood (see below) and fixing them is a primary goal of this issue.
🔴 Immediate goal: fix onboard-repair-e2e and onboard-resume-e2e
Both tests fail at Phase 2 with: FAIL: First onboard exited 0 (expected 1)
Root cause
The tests set NEMOCLAW_POLICY_MODE=invalid to force onboard to fail at the policy step (step 8/8), creating an interrupted session file for resume/repair testing. This worked before PR #2434 (merged Apr 24), which intentionally changed the behavior: invalid policy modes now warn and fall back to suggested presets instead of calling process.exit(1).
There are two policy-mode validation paths in src/lib/onboard.ts:
- Old
_setupPolicies() (line ~5605) — process.exit(1) on invalid mode (hard fail)
- New
setupPoliciesWithSelection() (line ~6240) — console.warn() + fallback to suggested presets (graceful)
The main onboard() function calls setupPoliciesWithSelection() — the graceful path. So NEMOCLAW_POLICY_MODE=invalid no longer causes a failure; onboard completes successfully (exit 0).
Fix required
The tests need a different fault injection mechanism to create interrupted onboard state. Options:
- Add a test-only env var (e.g.,
NEMOCLAW_E2E_FORCE_FAIL_AT_STEP=policies) that forces a failure at a named step — most reliable and explicit
- Kill the process mid-onboard with a timed signal after sandbox creation but before policy application
- Inject a different real failure (e.g., invalid API key format, unreachable endpoint) at an earlier step
Option 1 is recommended — it is deterministic, self-documenting, and does not couple test behavior to transient product error handling.
Nightly failure evidence
Failed in all 3 recent nightly runs:
Related PRs
Consolidated from #446: onboard resumability
Issue #446 documented that nemoclaw onboard was not resumable — partial failures required full cleanup and restart. The core problems were:
- No checkpoint/resume — no way to skip already-completed steps after a mid-onboard failure
- Stale gateway blocks re-onboarding — gateway left running on port 8080 after failure, next attempt gets
Error: Gateway already running on port 8080
- PATH not updated after install —
nemoclaw: command not found immediately after install
Since #446 was filed, significant progress has been made:
What remains from #446 is ensuring this resume mechanism is tested end-to-end — which is exactly what test-onboard-resume.sh and test-onboard-repair.sh do, once their fault injection is fixed. The E2E coverage in this issue fully subsumes the remaining work from #446.
What it tests
Onboard error handling and input validation — the negative/edge-case paths that the current cloud-e2e happy-path onboard never exercises.
Bugs it would have caught (Apr 7–27)
Proposed test cases
| # |
Scenario |
Assert |
| 1 |
NEMOCLAW_POLICY_MODE=restricted (valid tier name, invalid mode) |
Graceful fallback to tier suggestions, not process.exit(1) |
| 2 |
NEMOCLAW_POLICY_MODE=nonexistent |
Same graceful fallback |
| 3 |
Invalid API key format for NVIDIA provider |
Rejection message, not stack trace |
| 4 |
Valid Anthropic key (no nvapi- prefix) for Anthropic provider |
Accepted without error |
| 5 |
Onboard on port already in use (18789 held by another process) |
User-friendly port conflict error, not raw JS stack trace |
| 6 |
Non-interactive onboard with specific NEMOCLAW_POLICY_PRESETS |
Verify presets actually applied (not silently dropped) |
| 7 |
Non-interactive onboard with NEMOCLAW_PROVIDER=cloud + specific model |
Verify model selection honored |
| 8 |
Interrupted onboard → resume skips completed steps (from test-onboard-resume.sh) |
Resume skips preflight/gateway/sandbox, completes from failure point |
| 9 |
Missing sandbox → repair recreates it on resume (from test-onboard-repair.sh) |
Resume detects missing sandbox and recreates it |
| 10 |
Conflicting sandbox name on resume (from test-onboard-repair.sh) |
Resume rejects mismatched sandbox name |
| 11 |
Conflicting provider/model on resume (from test-onboard-repair.sh) |
Resume rejects mismatched provider/model |
Implementation
- Fix fault injection in
test-onboard-resume.sh and test-onboard-repair.sh — replace NEMOCLAW_POLICY_MODE=invalid with a reliable mechanism (see options above)
- New self-contained script (
test/e2e/test-onboard-negative-paths.sh) for scenarios 1–7 (~300–400 lines)
- New nightly job in
.github/workflows/nightly-e2e.yaml with NVIDIA_API_KEY and NEMOCLAW_NON_INTERACTIVE=1
Effort: Medium — fault injection fix (~1 hour), new negative-path script (~half day), nightly YAML addition.
Related issues
Summary
Gap analysis of the last 3 weeks of bug fixes (Apr 7–27) identified onboard input validation / error handling as a systematic gap — 6 bugs reached NV QA or community users because the nightly only exercises the happy-path onboard flow.
Additionally, PR #2607 wired
test-onboard-repair.shandtest-onboard-resume.shinto the nightly pipeline on Apr 28, and both are persistently failing in every nightly run. The root cause is understood (see below) and fixing them is a primary goal of this issue.🔴 Immediate goal: fix
onboard-repair-e2eandonboard-resume-e2eBoth tests fail at Phase 2 with:
FAIL: First onboard exited 0 (expected 1)Root cause
The tests set
NEMOCLAW_POLICY_MODE=invalidto force onboard to fail at the policy step (step 8/8), creating an interrupted session file for resume/repair testing. This worked before PR #2434 (merged Apr 24), which intentionally changed the behavior: invalid policy modes now warn and fall back to suggested presets instead of callingprocess.exit(1).There are two policy-mode validation paths in
src/lib/onboard.ts:_setupPolicies()(line ~5605) —process.exit(1)on invalid mode (hard fail)setupPoliciesWithSelection()(line ~6240) —console.warn()+ fallback to suggested presets (graceful)The main
onboard()function callssetupPoliciesWithSelection()— the graceful path. SoNEMOCLAW_POLICY_MODE=invalidno longer causes a failure; onboard completes successfully (exit 0).Fix required
The tests need a different fault injection mechanism to create interrupted onboard state. Options:
NEMOCLAW_E2E_FORCE_FAIL_AT_STEP=policies) that forces a failure at a named step — most reliable and explicitOption 1 is recommended — it is deterministic, self-documenting, and does not couple test behavior to transient product error handling.
Nightly failure evidence
Failed in all 3 recent nightly runs:
Related PRs
fix(onboard): fall back to tier suggestions on bad NEMOCLAW_POLICY_MODE(the change that broke the fault injection)fix(ci): wire 6 unwired E2E scripts into nightly pipeline(wired these tests)Consolidated from #446: onboard resumability
Issue #446 documented that
nemoclaw onboardwas not resumable — partial failures required full cleanup and restart. The core problems were:Error: Gateway already running on port 8080nemoclaw: command not foundimmediately after installSince #446 was filed, significant progress has been made:
nemoclaw onboard --resumenow exists and skips cached steps (preflight, gateway, sandbox)~/.nemoclaw/onboard-session.json) records interrupted stateWhat remains from #446 is ensuring this resume mechanism is tested end-to-end — which is exactly what
test-onboard-resume.shandtest-onboard-repair.shdo, once their fault injection is fixed. The E2E coverage in this issue fully subsumes the remaining work from #446.What it tests
Onboard error handling and input validation — the negative/edge-case paths that the current
cloud-e2ehappy-path onboard never exercises.Bugs it would have caught (Apr 7–27)
NEMOCLAW_POLICY_MODEvalue crashes onboard or silently skips all presetsnvapi-prefix enforced on Anthropic and other non-NVIDIA API keysProposed test cases
NEMOCLAW_POLICY_MODE=restricted(valid tier name, invalid mode)process.exit(1)NEMOCLAW_POLICY_MODE=nonexistentnvapi-prefix) for Anthropic providerNEMOCLAW_POLICY_PRESETSNEMOCLAW_PROVIDER=cloud+ specific modeltest-onboard-resume.sh)test-onboard-repair.sh)test-onboard-repair.sh)test-onboard-repair.sh)Implementation
test-onboard-resume.shandtest-onboard-repair.sh— replaceNEMOCLAW_POLICY_MODE=invalidwith a reliable mechanism (see options above)test/e2e/test-onboard-negative-paths.sh) for scenarios 1–7 (~300–400 lines).github/workflows/nightly-e2e.yamlwithNVIDIA_API_KEYandNEMOCLAW_NON_INTERACTIVE=1Effort: Medium — fault injection fix (~1 hour), new negative-path script (~half day), nightly YAML addition.
Related issues