feat(orchestrator): run ic-gateway as a side-car to the replica in cloud engines#10499
feat(orchestrator): run ic-gateway as a side-car to the replica in cloud engines#10499pierugo-dfinity wants to merge 24 commits into
ic-gateway as a side-car to the replica in cloud engines#10499Conversation
There was a problem hiding this comment.
⚠️ Not ready to approve
MultipleProcessesManager::start_all does not actually stop ic-gateway when the launch gate is disabled, contradicting the intended behavior and potentially leaving the sidecar running unexpectedly.
Pull request overview
Refactors the orchestrator’s subnet-scoped process supervision from “replica-only + special-case ic-boundary” into a generic multi-process manager, and introduces ic-gateway as an (currently gated) sidecar process for CloudEngine nodes so public API traffic can be terminated locally on port 80 and proxied to the replica.
Changes:
- Replace replica-specific process management in the upgrade loop with
MultipleProcessesManager/ per-processProcessManager. - Add
ic-gatewayprocess definition + GuestOS packaging/env config; add a new (manual) system test verifying/api/v2/statusvia port 80 on CloudEngine nodes. - Rename orchestrator start-attempts metric to be per-process (
orchestrator_processes_start_attempts_total) and update affected tests.
File summaries
| File | Description |
|---|---|
| rs/tests/driver/src/driver/test_env_api.rs | Adjust metrics threshold matching to support prefix-based metric checks. |
| rs/tests/driver/src/driver/group.rs | Update default orchestrator metrics name and document prefix-based matching expectations. |
| rs/tests/consensus/subnet_splitting_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_nns_failover_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_same_nodes_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_provision_write_access_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_local_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_no_upgrade_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_large_no_upgrade_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_with_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/subnet_recovery/sr_app_failover_nodes_enable_chain_keys_test.rs | Update removed orchestrator metric name. |
| rs/tests/consensus/replica_determinism_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/orchestrator/node_reassignment_test.rs | Update expected orchestrator metric name. |
| rs/tests/consensus/orchestrator/cloud_engine_ic_gateway_test.rs | New system test validating CloudEngine health on port 80 through ic-gateway. |
| rs/tests/consensus/orchestrator/Cargo.toml | Add reqwest/serde_cbor deps and register new system-test binary. |
| rs/tests/consensus/orchestrator/BUILD.bazel | Add Bazel target for the new system test (tagged manual). |
| rs/tests/consensus/cup_explorer_test.rs | Update expected orchestrator metric name. |
| rs/orchestrator/src/upgrade.rs | Switch upgrade loop to start/stop a multi-process manager instead of replica-only management. |
| rs/orchestrator/src/registry_helper.rs | Add subnet-type lookup helper for process policy decisions. |
| rs/orchestrator/src/processes.rs | New module: process definitions + generic ProcessManager + MultipleProcessesManager including gated ic-gateway. |
| rs/orchestrator/src/process_manager.rs | Refactor to ProcessRunner and ensure child processes are spawned as their own process-group leaders. |
| rs/orchestrator/src/orchestrator.rs | Wire new process configs/managers; pass env file paths via args; integrate with dashboard/boundary management. |
| rs/orchestrator/src/metrics.rs | Replace replica-only start metric with per-process start/stop counters. |
| rs/orchestrator/src/lib.rs | Register the new processes module. |
| rs/orchestrator/src/error.rs | Remove now-unused domain-name-missing error variant. |
| rs/orchestrator/src/dashboard.rs | Display both replica and ic-gateway PIDs via MultipleProcessesManager. |
| rs/orchestrator/src/boundary_node.rs | Delegate ic-boundary lifecycle to IcBoundaryManager. |
| rs/orchestrator/src/args.rs | Add required args for boundary/gateway env files; make ic_binary_directory non-optional. |
| rs/ic_os/release/BUILD.bazel | Add ic-gateway to released/stripped binaries. |
| ic-os/guestos/envs/recovery/BUILD.bazel | Increase size thresholds to accommodate image growth. |
| ic-os/guestos/envs/prod/BUILD.bazel | Increase size thresholds to accommodate image growth. |
| ic-os/guestos/defs.bzl | Include ic-gateway binary in GuestOS image dependencies. |
| ic-os/components/guestos/share/ic-gateway.env | New env file configuring ic-gateway (port 80, proxy to 8080, domain). |
| ic-os/components/guestos/ic-replica.service | Add orchestrator flags for boundary/gateway env file paths. |
| ic-os/components/guestos.bzl | Install ic-gateway.env into /opt/ic/share. |
| Cargo.lock | Add lock entries for new test dependencies. |
| Cargo.Bazel.json.lock | Add ic-gateway as a generated binary crate. |
| bazel/rust.MODULE.bazel | Add crate annotation to generate ic-gateway binary via Bazel. |
Copilot's findings
- Files reviewed: 41/43 changed files
- Comments generated: 4
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Assume the metrics to check are prefix-free. This allows to specify a metric name | ||
| // prefix to check all metrics with that prefix. | ||
| let max_value = metrics_to_check | ||
| .get(name.split('(').next().unwrap()) | ||
| .copied() | ||
| .iter() | ||
| .find(|(metric_name, _)| name.starts_with(**metric_name)) | ||
| .map(|(_, max_value)| *max_value) | ||
| .unwrap_or_default(); |
There was a problem hiding this comment.
This is precisely why the documentation of default_[replica/orchestrator]_metrics mentions that the metrics should be prefix-free. Though I'm happy to revisit if the owners (IDX) think otherwise.
Cloud engines should be self-contained, but today their nodes rely on external routing to reach the replica. This PR runs
ic-gatewayas a side-car alongside the replica on each node, so the gateway terminates external traffic locally and proxies it to the replica.Launching
ic-gatewayis still gated behind a flag and only happens on cloud engines, so there's no behavioral change for now.Refactoring the orchestrator's process manager
The orchestrator's process management was written around a single process (the replica), with a parallel, partially-duplicated path for
ic-boundary. Adding a third managed process the same way would have meant copying that logic again.Instead, process management is now generic. The upgrade loop no longer knows what it's running — it just starts and stops a manager that owns the set of node processes. Adding a future process is a matter of describing how it's built and registering it; no changes to the upgrade loop.
Scope: this refactor is confined to the upgrade loop, i.e. processes whose lifetime is tied to a subnet. Unassigned nodes and API BNs are unaffected.
Testing
A new system test
cloud_engine_ic_gateway_test(now stillmanualbecause the feature flag is disabled) hitsapi/v2/statuson port 80 of each node, confirmingic-gatewayis up and correctly proxying to the replica.Implementation details for reviewers
MultipleProcessesManagerand callsstart_all/stop_all. It's process-agnostic, except forstop_replicawhich is (arguably) still needed during recoveries.ic-boundary) moved intoprocesses.rs. EachProcessdeclares how it's built viabuild(&Self::Config, Self::Args), separating static config from dynamic arguments that can change across the orchestrator's lifetime, likely derived from the registry.ProcessManager<Process>centralizes the logging, metrics, andOrchestratorResulthandling that was previously duplicated.MultipleProcessesManagerholds several of these and decides what to start/stop (e.g.ic-gatewayonly on cloud engines).ic-boundaryis an exception as it has some additional logic when the node's domain changes. Keep that logic by introducingIcBoundaryManager, a wrapper overProcessManager<IcBoundaryProcess>exposingensure_ic_boundary_running_and_restarted_on_domain_change.To add a new process: define
NewProcessConfigandNewProcessimplementingProcess, add aProcessManager<NewProcess>field toMultipleProcessesManager, and wire it into the start/stop methods.