Skip to content

Conversation

@Chr1st0ph3rTurn3r
Copy link
Contributor

No description provided.


## Download Failover Resiliency

SSR images can be downloaded from a variety of sources, depending on software access mode (eg. internet-only, prefer-conductor, conductor-only, offline-mode): the HA peer, both conductor nodes, artifactory, and the mist proxy to artifactory (cloud deployments only).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Mist be capitalized?


SSR images can be downloaded from a variety of sources, depending on software access mode (eg. internet-only, prefer-conductor, conductor-only, offline-mode): the HA peer, both conductor nodes, artifactory, and the mist proxy to artifactory (cloud deployments only).

To improve resiliency to network connectivity issues, the SSR queries available versions from all sources before beginning the download. It compiles a list of sources where the requested version is available and begins the download. If more than 50% of requests to a source fail within a window of 10 requests, the SSR marks that source unavailable and moves on to the next source. The following priority order is used for sources:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind the size of the window is more of an implementation detail and may be subject to change based on tuning. We may want to be less specific about that in case we decide to adjust it in the future. But this may be fine too. Not sure how likely we are to need to adjust it


In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.

When the timeout is enabled, the SSR waits for a configurable amount of time (default is 10800s) for the download to complete. When the timeout value is reached, the download is marked as **Failed** and the retry delay begins.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite accurate. The retry delay will begin once we have marked all download sources as unavailable, as described in the failover resilience section. If enabled, once this timeout is hit, the download will be entirely stopped and marked as a failure. Or in other words, the retries happen inside of this timeout, not after it.


### Sequenced HA Download

The SSR supports sequenced downloading; one node of an HA pair downloads an image from the remote repository, and the other node waits for it to complete. Once that download is complete, the second node downloads it from the first. When targeting an HA router, the download is sequenced by default. To disable this sequencing, use `request system software download simultaneous disable`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once note about the second node downloads it from the first. The peer is the first place that an HA router will attempt to download from, so in most cases this would be the case, but if for whatever reason the connection to the peer went down, the router would move on and continue downloading from the conductor or remote sources. Not sure if that needs to be clarified or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the download happen over the HA sync connection or the HA fabric?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's the HA sync connection


## Configuration

Three components: Onboarding conductor, router, Operational conductor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a customer specific tpopology. We shoudn;t limit this doc to just this use case. The doc should only talk about the router and conductor.


The next step in the process is to generate an onboarding token from conductor Web interface, command line, or using APIs. The generated tokens are signed by the conductor’s private key so that they cannot be altered once generated. The SSR supports two modes; Authority Wide and Router Specific tokens. These are mutually exclusive and are defined in the configuration.

#### Authority-Wide Tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This concept is removed from the FS and should be deleted from the doc. We will only support per router tokens.

@Chr1st0ph3rTurn3r Chr1st0ph3rTurn3r requested review from BenMatase and agrawalkaushik and removed request for plessard128 November 24, 2025 18:13
Copy link
Contributor

@BenMatase BenMatase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like there is some duplicate information in sco doc


### Prerequisites

- The `secure-conductor-onboarding mode` must be enabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it sound like there is only at the authority level. We don't have a mode at the authority at this time


To provide a secure and mutually authenticated onboarding mechanism, the following information must be configured.

- Pre-shared key: The onboarding pre-shared key is a 48-character alpha-numeric string, configured at the authority or the router level. This key is mandatory for the SCO process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at the authority level for now


- Pre-shared key: The onboarding pre-shared key is a 48-character alpha-numeric string, configured at the authority or the router level. This key is mandatory for the SCO process.
- Conductor Public certificate: A public-private key certificate.
- Conductor CA certificate: Optionally, you can configure a public certificate signed by a preferred CA signing authority.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not optional

After the user generates an onboarding token, enter the token and other onboarding details in the onboarding UI or using CLI commands. There are two methods to onboard a router:
- Using the Command line: `secure-conductor-onboarding-token` command and `onboarding-config.json`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- Using the Command line: `create secure-conductor-onboarding token` command and `onboarding-config.json`.

4. The router connects to the conductor over port 930 using the SSH keys exchanged in previous steps.
5. The router is prepped and initialized by the conductor. During this process, the system goes through the reboot cycle.
Once the secure SSH tunnels are established, the SCO workflow concludes. All future communication between the router and conductor will occur on standard SSR to conductor ports such as 930, 4505, 4506, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SCO happens, won't use 4505/4506 from that point on. Everything is over 930

`configure authority router secure-conductor-onboarding pre-shared-secret`
The pre-shared secret is a 48-character alpha-numeric string. When enabled, any empty PSK will auto generate a random 48-byte alphanumeric string using the FIPS-approved, highly secure DRBG function from OpenSSL. Once generated, the key does not automatically change. It can be updated by the user if necessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not complete yet

### Token Contents
The next step in the process is to generate an onboarding token from the conductor Web interface, command line, or using APIs. The generated tokens are signed by the conductor’s private key so that they cannot be altered once generated. The SSR supports two modes; Authority-wide and Router-specific tokens. These are mutually exclusive and are defined in the configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doc needs to be scrubbed of "authority wide" tokens for now

The following parameters are required, and are configured at the Router level.
`configure authority router secure-conductor-onboarding mode`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not match the func spec exactly, but the router level path is at configure authority router system secure-conductor-onboarding. This applies to the other paths in the doc

### Auto-resume Download on WAN Failures

In the event that all sources have reached the threshold of consecutive failures and a download attempt has failed, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.
In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off. Use the `software-update download enable-timeout` command to enable the retry feature.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable-timeout field is separate from retries. The only thing it enables is the timeout described in the next paragraph, and retries will happen regardless of whether the timeout is enabled

In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off. Use the `software-update download enable-timeout` command to enable the retry feature.

When the timeout is enabled, the SSR waits for a configurable amount of time (default is 10800s) for the download to complete. When the timeout value is reached, the download is marked as **Failed** and the retry delay begins.
When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth noting that the timeout is enabled by default?

The retry delay time is the longest time to wait between retry attempts. For example, the initial retry delay starts at 30 seconds. With each failure the delay is increased exponentially. However, when that calculated value reaches the maximum retry delay time, successive wait times for additional attempts do not exceed the maximium retry delay time. The default is 3600 seconds. A maximum number of times to retry can also be configured.

The retry timeout can be disabled. If it is disabled, the download will retry indefinitely.
If the retry timeout is disabled, the download will retry indefinitely

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, the timeout is a separate mechanism from the retries, so I wouldn't necessarily describe it as a retry timeout. And the download would only retry indefinitely if both the timeout is disabled and the attempts is configured to 0.


### Sequenced HA Download

The SSR supports sequenced downloading; one node of an HA pair downloads an image from the remote repository, and the other node waits for it to complete. Once that download is complete, the second node downloads it from the first. When targeting an HA router, the download is sequenced by default. To disable this sequencing, use `request system software download simultaneous disable`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I ended up making the download unsequenced by default. I may change that in the future, but in the beta we're giving Swift, it will be unsequenced.
In order to do a sequenced download, you would use request system software download router RouterName version SSR-X.Y.Z sequenced

After the user generates an onboarding token, enter the token and other onboarding details in the onboarding UI or using CLI commands. There are two methods to onboard a router:
- Using the Command line: `create secure-conductor-onboarding-token` command and `onboarding-config.json`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The command still needs to be fixed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with it? I copied your command from the earlier review. Am I missing something?

To enable this feature on the conductor, verify the following:
- The `secure conductor onboarding mode` should not be disabled (see above).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed. The conductor/whole authority doesn't have a mode

The CA certificate is read from disk at the location given in `secure-conductor-onboarding ca-certificate`.
## Token Management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is a dup of the Token Creation section and can be removed

In the event that all sources have reached the threshold of consecutive failures and a download attempt has returned an error, the SSR can be configured to wait for a specified amount of time and then retry the download. If a connection is successfully made, the download will resume where it left off.

When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".
The timeout is enabled by default (`software-update download enable-timeout true`). The SSR waits for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is accurate, but something I hadn't thought of when reviewing before is that the retry configuration in the paragraph below is probably more significant than the timeout configuration, so I might swap the two paragraphs.

When the timeout is enabled (software-update download enable-timeout true) the SSR will wait for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".
The timeout is enabled by default (`software-update download enable-timeout true`). The SSR waits for a configurable amount of time (default is 10800s) for the download to complete. If the timeout value is reached without successfully downloading the software, the download is marked as "Failed".

The retry delay time is the longest time to wait between retry attempts. For example, the initial retry delay starts at 30 seconds. With each failure the delay is increased exponentially. However, when that calculated value reaches the maximum retry delay time, successive wait times for additional attempts do not exceed the maximium retry delay time. The default is 3600 seconds. A maximum number of times to retry can also be configured.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in maximium


If the retry timeout is disabled, the download will retry indefinitely

Use the command `configure authority router system software-update download enable-timeout [enabled]` to enable auto-resume. The command parameters are listed below:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enable-timeout field doesn't really enable auto-resume. It's just a way you can tune the behavior to meet your needs. Maybe something along the lines of this would be more accurate?

Use the command configure authority router system software-update download to adjust the download retry behavior. The command parameters are listed below:

- `enable-timeout`: True/false, default is true. This enables a time limit for the overall download.
- `timeout`: Amount of time in seconds that the SSR waits for the software download to complete. When the timeout value is reached the download is marked as **Failed**, and the retry delay begins. The default download wait time is 10800s. Range is 1800s - 604800s.
- `attempts`: The maximum number of attempts to download before considering the download as failed. If set to 0, the SSR will retry the download until the timeout is hit. Default is 10.
- `max retry delay`: The maximum amount of time in seconds to wait in between retry attempts. The retry delay will start off low and back off exponentially up to this duration. Range is 0 to 86400s. Default is 3600s.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maximum-retry-delay

MichaelBaj
MichaelBaj previously approved these changes Dec 11, 2025
4. fscrypt uses the FEK to automatically unlock the necessary encrypted directories.

This systemd service handles the subsequent boots of the SSR after Configuration Integrity has been enabled. It runs a series of integrity checks, and identifies when the system is ready to continue operation after successful unlocking of the encrypted directories. When it is run, it performs the following sequence:
If any of these steps fail, it is interpreted as an integrity event. Network activities are blocked. An emergency log is generated and broadcast to all consoles on the system that the system integrity is compromised and it must be reprovisioned. The SSR will repeatedly try to start the integrity service to unlock the encrypted directories and fail, each time writing the emergency log.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after "system"


- The `secure-conductor-onboarding` must be enabled
- The `secure-conductor-onboarding public-key` field must be configured
- The `secure-conductor-onboarding ca-certificate` field must be configured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have added config time validation that the conductor nodes must also have asset ids configured if SCO is enabled. Not sure if we want to call it out here.

`configure authority router system secure-conductor-onboarding mode`
- `disabled`: Default is true, must be false to enable.
- `psk-only`: Configured on devices with no TPM, but which require the Secure Conductor Onboarding workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

psk-only has been removed as an option. Now weak will generate a self signed cert per authentication attempt for non-TPM devices.

To read the EK from the public cloud instance, run `tpm2_readpublic -c 0x81010001 -f DER -o /dev/stdout -Q | base64 -w0` and configure the contents in the endorsement-key field above.
:::
- Disable salt state on conductor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section can be removed. 4505 and 4506 will now be automatically closed after the SCO is enabled on the conductor and the conductors are restarted.

- `weak`: This setting enables SCO but allows the router to use a self-signed certificate. This conductor will skip the CA certificate validation for this router.
- `strong`: On SSR devices manufactured with a device ID (SSR400/SSR440), `strong` mode ensures that the asset-id matches the serial number field in the subject line of the router’s public certificate. For vTPM workflows, the router’s endorsement key must match the `endorsement-key` configuration.
`configure authority router system secure-conductor-onboarding pre-shared-secret`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is no longer required. It will be auto generated if not specified

- For devices with a built-in dev-id certificate
```
config authority router router1 system secure-conductor-onboarding mode strong
config authority router router1 system secure-conductor-onboarding pre-shared-secret (removed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to call out that this is now optional?

- For Public cloud VMs with vTPM
```
config authority router router1 system secure-conductor-onboarding mode strong
config authority router router1 system secure-conductor-onboarding pre-shared-secret (removed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this config is now optional

### Known Caveats
- During SCO onboarding of the router in an HA deployment, both the conductor nodes should be online and able to talk to each other.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer a caveat

exit
```
If any checks fail, the `create system connectivity` command returns an error explaining why. This command can be run as many times as needed for each node. All information to form the token is present in the configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If any checks fail, the `create system connectivity` command returns an error explaining why. This command can be run as many times as needed for each node. All information to form the token is present in the configuration.
If any checks fail, the `create secure-conductor-onboarding token` command returns an error explaining why. This command can be run as many times as needed for each router. All information to form the token is present in the configuration.

- Enable ssh-only for asset resiliency.
`configure authority asset-connection-resiliency ssh-only true `
- Enable SCO for each router.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Enable SCO for each router.
- Enable SCO for each router config on the conductor.

Don't love my wording, but I had someone confuse this as this config needs to be applied on the router itself. How can we make that more clear that this config is applied on the conductor?

:::note
In the current beta delivery (7.1.3-1r2) this step must be performed to disable ports 4505 and 4506 so any devices not using this feature will fail to onboard to the conductor.
Ports 4505 and 4506 are automatically closed after SCO is enabled on the conductor and the conductor is restarted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this implies that the conductor is automatically restarted. Maybe

Ports 4505 and 4506 are automatically closed after SCO is enabled on the conductor once a user restarts the conductors.

not sure

`configure authority router system secure-conductor-onboarding mode`
- `disabled`: Default is true, must be false to enable.
- `psk-only`: Configured on devices with no TPM, but which require the Secure Conductor Onboarding workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line should be removed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.