Skip to content

feat(orchestrator): run ic-gateway as a side-car to the replica in cloud engines#10499

Draft
pierugo-dfinity wants to merge 24 commits into
masterfrom
pierugo/orchestrator/cloud-engine-ic-gateway
Draft

feat(orchestrator): run ic-gateway as a side-car to the replica in cloud engines#10499
pierugo-dfinity wants to merge 24 commits into
masterfrom
pierugo/orchestrator/cloud-engine-ic-gateway

Conversation

@pierugo-dfinity

@pierugo-dfinity pierugo-dfinity commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR runs ic-gateway as a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.
Launching ic-gateway is still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now.

Refactoring the orchestrator's process manager

The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for ic-boundary. Adding a third managed process the same way would have meant copying that logic again.
Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.

Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet. Unassigned nodes and API BNs are unaffected.

Testing

A new system test cloud_engine_ic_gateway_test (now still manual because the feature flag is disabled) hits api/v2/status on port 80 of each node, confirming ic-gateway is up and correctly proxying to the replica.

Implementation details for reviewers
  • The upgrade loop now holds a MultipleProcessesManager and calls start_all/stop_all. It's process-agnostic, except for stop_replica which is (arguably) still needed during recoveries.
  • All per-process logic (including ic-boundary) moved into processes.rs. Each Process declares how it's built via build(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.
  • ProcessManager<Process> centralizes the logging, metrics, and OrchestratorResult handling that was previously duplicated. MultipleProcessesManager holds several of these and decides what to start/stop (e.g. ic-gateway only on cloud engines).
  • ic-boundary is an exception as it has some additional logic when the node's domain changes. Keep that logic by introducing IcBoundaryManager, a wrapper over ProcessManager<IcBoundaryProcess> exposing ensure_ic_boundary_running_and_restarted_on_domain_change.

To add a new process: define NewProcessConfig and NewProcess implementing Process, add a ProcessManager<NewProcess> field to MultipleProcessesManager, and wire it into the start/stop methods.

@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets label Jun 17, 2026
@github-actions github-actions Bot added the feat label Jun 17, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.

Pull request overview

Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.

Changes:

  • Replace replica-specific process management in the upgrade loop with MultipleProcessesManager / per-process ProcessManager.
  • Add ic-gateway process definition + GuestOS packaging/env config; add a new (manual) system test verifying /api/v2/status via port 80 on CloudEngine nodes.
  • Rename orchestrator start-attempts metric to be per-process (orchestrator_processes_start_attempts_total) and update affected tests.
File summaries
File Description
rs/tests/driver/src/driver/test_env_api.rs Adjust metrics threshold matching to support prefix-based metric checks.
rs/tests/driver/src/driver/group.rs Update default orchestrator metrics name and document prefix-based matching expectations.
rs/tests/consensus/subnet_splitting_test.rs Update expected orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs Update removed orchestrator metric name.
rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs Update removed orchestrator metric name.
rs/tests/consensus/replica_determinism_test.rs Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/node_reassignment_test.rs Update expected orchestrator metric name.
rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs New system test validating CloudEngine health on port 80 through ic-gateway.
rs/tests/consensus/orchestrator/Cargo.toml Add reqwest/serde_cbor deps and register new system-test binary.
rs/tests/consensus/orchestrator/BUILD.bazel Add Bazel target for the new system test (tagged manual).
rs/tests/consensus/cup_explorer_test.rs Update expected orchestrator metric name.
rs/orchestrator/src/upgrade.rs Switch upgrade loop to start/stop a multi-process manager instead of replica-only management.
rs/orchestrator/src/registry_helper.rs Add subnet-type lookup helper for process policy decisions.
rs/orchestrator/src/processes.rs New module: process definitions + generic ProcessManager + MultipleProcessesManager including gated ic-gateway.
rs/orchestrator/src/process_manager.rs Refactor to ProcessRunner and ensure child processes are spawned as their own process-group leaders.
rs/orchestrator/src/orchestrator.rs Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management.
rs/orchestrator/src/metrics.rs Replace replica-only start metric with per-process start/stop counters.
rs/orchestrator/src/lib.rs Register the new processes module.
rs/orchestrator/src/error.rs Remove now-unused domain-name-missing error variant.
rs/orchestrator/src/dashboard.rs Display both replica and ic-gateway PIDs via MultipleProcessesManager.
rs/orchestrator/src/boundary_node.rs Delegate ic-boundary lifecycle to IcBoundaryManager.
rs/orchestrator/src/args.rs Add required args for boundary/gateway env files; make ic_binary_directory non-optional.
rs/ic_os/release/BUILD.bazel Add ic-gateway to released/stripped binaries.
ic-os/guestos/envs/recovery/BUILD.bazel Increase size thresholds to accommodate image growth.
ic-os/guestos/envs/prod/BUILD.bazel Increase size thresholds to accommodate image growth.
ic-os/guestos/defs.bzl Include ic-gateway binary in GuestOS image dependencies.
ic-os/components/guestos/share/ic-gateway.env New env file configuring ic-gateway (port 80, proxy to 8080, domain).
ic-os/components/guestos/ic-replica.service Add orchestrator flags for boundary/gateway env file paths.
ic-os/components/guestos.bzl Install ic-gateway.env into /opt/ic/share.
Cargo.lock Add lock entries for new test dependencies.
Cargo.Bazel.json.lock Add ic-gateway as a generated binary crate.
bazel/rust.MODULE.bazel Add crate annotation to generate ic-gateway binary via Bazel.

Copilot's findings

  • Files reviewed: 41/43 changed files
  • Comments generated: 4

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs
Comment thread rs/orchestrator/src/processes.rs Outdated
Comment on lines +1141 to 1147
// Assume the metrics to check are prefix-free. This allows to specify a metric name
// prefix to check all metrics with that prefix.
let max_value = metrics_to_check
.get(name.split('(').next().unwrap())
.copied()
.iter()
.find(|(metric_name, _)| name.starts_with(**metric_name))
.map(|(_, max_value)| *max_value)
.unwrap_or_default();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is precisely why the documentation of default_[replica/orchestrator]_metrics mentions that the metrics should be prefix-free. Though I'm happy to revisit if the owners (IDX) think otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI_ALL_BAZEL_TARGETS Runs all bazel targets feat

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants