Skip to content

ci(nightly-e2e): add onboard negative-path E2E test #2573

@jyaunches

Description

@jyaunches

Summary

Gap analysis of the last 3 weeks of bug fixes (Apr 7–27) identified onboard input validation / error handling as a systematic gap — 6 bugs reached NV QA or community users because the nightly only exercises the happy-path onboard flow.

Additionally, PR #2607 wired test-onboard-repair.sh and test-onboard-resume.sh into the nightly pipeline on Apr 28, and both are persistently failing in every nightly run. The root cause is understood (see below) and fixing them is a primary goal of this issue.


🔴 Immediate goal: fix onboard-repair-e2e and onboard-resume-e2e

Both tests fail at Phase 2 with: FAIL: First onboard exited 0 (expected 1)

Root cause

The tests set NEMOCLAW_POLICY_MODE=invalid to force onboard to fail at the policy step (step 8/8), creating an interrupted session file for resume/repair testing. This worked before PR #2434 (merged Apr 24), which intentionally changed the behavior: invalid policy modes now warn and fall back to suggested presets instead of calling process.exit(1).

There are two policy-mode validation paths in src/lib/onboard.ts:

  1. Old _setupPolicies() (line ~5605) — process.exit(1) on invalid mode (hard fail)
  2. New setupPoliciesWithSelection() (line ~6240) — console.warn() + fallback to suggested presets (graceful)

The main onboard() function calls setupPoliciesWithSelection() — the graceful path. So NEMOCLAW_POLICY_MODE=invalid no longer causes a failure; onboard completes successfully (exit 0).

Fix required

The tests need a different fault injection mechanism to create interrupted onboard state. Options:

  1. Add a test-only env var (e.g., NEMOCLAW_E2E_FORCE_FAIL_AT_STEP=policies) that forces a failure at a named step — most reliable and explicit
  2. Kill the process mid-onboard with a timed signal after sandbox creation but before policy application
  3. Inject a different real failure (e.g., invalid API key format, unreachable endpoint) at an earlier step

Option 1 is recommended — it is deterministic, self-documenting, and does not couple test behavior to transient product error handling.

Nightly failure evidence

Failed in all 3 recent nightly runs:

Related PRs


Consolidated from #446: onboard resumability

Issue #446 documented that nemoclaw onboard was not resumable — partial failures required full cleanup and restart. The core problems were:

  1. No checkpoint/resume — no way to skip already-completed steps after a mid-onboard failure
  2. Stale gateway blocks re-onboarding — gateway left running on port 8080 after failure, next attempt gets Error: Gateway already running on port 8080
  3. PATH not updated after installnemoclaw: command not found immediately after install

Since #446 was filed, significant progress has been made:

What remains from #446 is ensuring this resume mechanism is tested end-to-end — which is exactly what test-onboard-resume.sh and test-onboard-repair.sh do, once their fault injection is fixed. The E2E coverage in this issue fully subsumes the remaining work from #446.


What it tests

Onboard error handling and input validation — the negative/edge-case paths that the current cloud-e2e happy-path onboard never exercises.

Bugs it would have caught (Apr 7–27)

Proposed test cases

# Scenario Assert
1 NEMOCLAW_POLICY_MODE=restricted (valid tier name, invalid mode) Graceful fallback to tier suggestions, not process.exit(1)
2 NEMOCLAW_POLICY_MODE=nonexistent Same graceful fallback
3 Invalid API key format for NVIDIA provider Rejection message, not stack trace
4 Valid Anthropic key (no nvapi- prefix) for Anthropic provider Accepted without error
5 Onboard on port already in use (18789 held by another process) User-friendly port conflict error, not raw JS stack trace
6 Non-interactive onboard with specific NEMOCLAW_POLICY_PRESETS Verify presets actually applied (not silently dropped)
7 Non-interactive onboard with NEMOCLAW_PROVIDER=cloud + specific model Verify model selection honored
8 Interrupted onboard → resume skips completed steps (from test-onboard-resume.sh) Resume skips preflight/gateway/sandbox, completes from failure point
9 Missing sandbox → repair recreates it on resume (from test-onboard-repair.sh) Resume detects missing sandbox and recreates it
10 Conflicting sandbox name on resume (from test-onboard-repair.sh) Resume rejects mismatched sandbox name
11 Conflicting provider/model on resume (from test-onboard-repair.sh) Resume rejects mismatched provider/model

Implementation

  1. Fix fault injection in test-onboard-resume.sh and test-onboard-repair.sh — replace NEMOCLAW_POLICY_MODE=invalid with a reliable mechanism (see options above)
  2. New self-contained script (test/e2e/test-onboard-negative-paths.sh) for scenarios 1–7 (~300–400 lines)
  3. New nightly job in .github/workflows/nightly-e2e.yaml with NVIDIA_API_KEY and NEMOCLAW_NON_INTERACTIVE=1

Effort: Medium — fault injection fix (~1 hour), new negative-path script (~half day), nightly YAML addition.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    04-25-regressionIssues raised from the Apr 25 weekend regression analysisCI/CDUse this label to identify issues with NemoClaw CI/CD pipeline or GitHub Actions.E2EEnd-to-end testing — Brev infrastructure, test cases, nightly failures, and coverage gaps

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions