Skip to content

fix(machine-validation): skip reboot when disabled#2488

Open
iExalt wants to merge 1 commit into
NVIDIA:mainfrom
iExalt:fix/skip-disabled-machine-validation-reboot
Open

fix(machine-validation): skip reboot when disabled#2488
iExalt wants to merge 1 commit into
NVIDIA:mainfrom
iExalt:fix/skip-disabled-machine-validation-reboot

Conversation

@iExalt

@iExalt iExalt commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Description

NICO currently enters the machine-validation reboot path even when machine validation is disabled, issuing an unnecessary host restart and increasing startup latency.

This change:

  • skips disabled machine validation directly from RebootHost, before issuing a host reboot
  • reuses one completion path for disabled validation from both RebootHost and MachineValidating
  • preserves the existing wait when validation is disabled while a reboot is already in flight
  • preserves the upstream guard for validation runs that are no longer active
  • makes site-explorer fixtures branch on the observed machine state instead of the validation configuration
  • adds regression coverage for both disabled-validation paths

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Related to #2486 feat: skip machine validation reboot when validation is disabled

Breaking Changes

  • N/A

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Validation performed against current NVIDIA/infra-controller:main:

  • git diff --check
  • cargo fmt --all -- --check
  • cargo test -p carbide-machine-controller (19 passed)
  • cargo test -p carbide-api-core --no-default-features --no-run

Additional Notes

  • N/A

@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a67e899d-bfc7-4483-b657-689e7fc72e80

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR refactors machine validation skip behavior when disabled: introduces a reusable skip helper that marks validations as skipped and transitions hosts appropriately, integrates it into the validation state handler, updates test infrastructure to handle the new terminal state, and verifies via regression tests that reboot invariance is preserved.

Changes

Machine Validation Skip Behavior

Layer / File(s) Summary
Skip validation helper
crates/machine-controller/src/handler/machine_validation.rs
New skip_machine_validation async helper implements the skip workflow: marks the validation Skipped, updates last_machine_validation_list metrics, and transitions the host to HostInit::Discovered { skip_reboot_wait: true } via transactional DB mutation. Supports both handler paths via a single, auditable implementation.
Handler integration of skip logic
crates/machine-controller/src/handler/machine_validation.rs
Handler now delegates skip behavior to the helper rather than inlining it. In RebootHost arm, when machine validation is disabled, the handler early-returns the skip helper result. In MachineValidating arm, the previous inline skip block is replaced with a helper call, reducing duplication.
Test fixture state handling
crates/api-core/src/tests/common/api_fixtures/site_explorer.rs
Test fixtures host_state_controller_iterations and host_state_controller_iterations_with_machine_validation now recognize HostInit::Discovered { skip_reboot_wait: true } as a terminal state. Removes unused reboot_completed import, extends stop-condition logic, and refactors branching to distinguish between validation-completion vs. skip paths and route post-validation state transitions accordingly.
Test cases for skip behavior
crates/api-core/src/tests/machine_validation.rs
Adds MachineStateHandlerBuilder import. Updates test_machine_validation_disabled to snapshot last_reboot_requested before and after controller iteration, asserting invariance. Introduces test_machine_validation_disabled_waits_for_in_flight_reboot to verify that disabling validation mid-flight suppresses reboot requests and transitions the validation to "skipped" completion.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The PR exhibits moderate-to-high complexity due to: (1) dense state-machine logic in handler and fixture refactoring, (2) heterogeneous changes across three files with interdependencies, (3) new test infrastructure requiring careful inspection of state branching and assertions, and (4) the need to verify that skip semantics correctly suppress reboot behavior and mark validation status without regression. The implementation is clean and avoids duplication, but the state transitions and test harness updates warrant thorough verification.

Possibly related issues

  • #2486: This PR directly implements the requested behavior of marking machine validation runs as skipped and transitioning hosts to HostInit::Discovered { skip_reboot_wait: true } without requesting a reboot when machine validation is disabled, fulfilling the feature requirements.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.14% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the primary change: skipping reboot execution when machine validation is disabled, which directly addresses the core issue of unnecessary host restarts.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering the problem statement, implementation approach, test coverage, and validation steps.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@iExalt iExalt force-pushed the fix/skip-disabled-machine-validation-reboot branch from 77f0979 to 3793d9f Compare June 11, 2026 21:40
@iExalt iExalt marked this pull request as ready for review June 11, 2026 21:41
@iExalt iExalt requested a review from a team as a code owner June 11, 2026 21:41
@iExalt

iExalt commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@iExalt iExalt force-pushed the fix/skip-disabled-machine-validation-reboot branch from 3793d9f to 958d9bd Compare June 11, 2026 22:11
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@iExalt iExalt force-pushed the fix/skip-disabled-machine-validation-reboot branch from 958d9bd to aad6829 Compare June 12, 2026 00:17
@ajf ajf requested a review from sunilkumar-nvidia June 12, 2026 04:33
@ajf

ajf commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

@sunilkumar-nvidia please review

@ajf

ajf commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

/ok to test aad6829

@ajf

ajf commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

fyi @iExalt GitHub can't verify your signed commits belong to you, they show as unverified.

@github-actions

Copy link
Copy Markdown

@sunilkumar-nvidia

Copy link
Copy Markdown
Contributor

Thanks for the PR, this avoids unnecessary reboots when machine validation is disabled and should save time in skipped-validation flows.

@iExalt

iExalt commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

fyi @iExalt GitHub can't verify your signed commits belong to you, they show as unverified.

this is a problem when using GH to rebase, let me rebase locally

@iExalt iExalt force-pushed the fix/skip-disabled-machine-validation-reboot branch 12 times, most recently from dd34735 to 0321659 Compare June 12, 2026 20:26
@iExalt iExalt force-pushed the fix/skip-disabled-machine-validation-reboot branch 10 times, most recently from 489aa00 to 81bc0e4 Compare June 12, 2026 23:35
Signed-off-by: Clement Liaw <clliaw@nvidia.com>
@iExalt iExalt force-pushed the fix/skip-disabled-machine-validation-reboot branch from 81bc0e4 to 7a21474 Compare June 13, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants