Skip to content

Commit 30fda79

Browse files
New CC token verification mechanism (#3829)
### Summary This PR introduces a new confidential computing token verification mechanism to replace the previous job-based verification approach. Previously, the verification mechanism was tied to specific jobs, which required generating a new set of tokens for each new job. This approach was inefficient and error-prone. The new mechanism provides a persistent, cross-site token validation system that ensures secure and consistent communication between components. ### Implementation Details 1. Client Registration When a client sends a registration request to the server: - The client includes its token in the request. - The server validates the client’s token. - The server responds with its own token. - The client validates the server’s token. 2. Periodic Cross-Site Validation Each site (server or client) periodically triggers a cross-site token validation event (e.g., every 5–10 minutes): - The initiating site (e.g., siteA) starts the validation event. - All sites, including siteA, generate new tokens for this event. - siteA validates tokens from all participating sites. 3. Failure Handling If any token validation fails: The affected site will shut itself down. Optionally, it may attempt to trigger a system-wide shutdown to prevent inconsistent states. 4. Benefits - Removes dependency on per-job token generation. - Enables periodic, automated validation to detect and isolate compromised sites. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
1 parent 02ae6e1 commit 30fda79

File tree

17 files changed

+786
-470
lines changed

17 files changed

+786
-470
lines changed

docs/user_guide/confidential_computing/attestation.rst

Lines changed: 37 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -76,43 +76,61 @@ You can configure the CC attestation components during the provision step. See t
7676
Runtime Behavior
7777
================
7878

79-
The attestation workflow consists of several phases during job lifecycle:
79+
80+
The Confidential Computing (CC) attestation workflow establishes continuous, system-wide trust between all federated learning participants.
8081

8182
1. System Bootstrap
8283
-------------------
8384

84-
When a participant (server or client) starts up, the ``CCManager`` responds to the ``EventType.SYSTEM_BOOTSTRAP`` event by generating its own CC token using the configured ``CCAuthorizers``.
85+
When the system starts, each CC-enabled site (server or client) initializes its confidential computing components and generates a CC token that identifies its trusted environment.
86+
8587

8688
2. Client Registration
8789
----------------------
8890

89-
When a client registers with the server, it includes its CC token as part of the registration data. If the registration is successful, the server collects and stores the client's CC token.
91+
During client registration:
92+
93+
- The client sends its token to the server.
94+
95+
- The server verifies the client’s token and responds with its own.
96+
97+
- The client then validates the server’s token.
98+
99+
This mutual verification ensures both sides trust each other before participating in any job.
90100

91-
The server's ``CCManager`` maintains both its own CC token and the tokens of all registered clients.
92101

93-
3. Job Deployment Verification
94-
-------------------------------
102+
3. Continuous Cross-Site Validation
103+
-----------------------------------
95104

96-
Once a job is submitted and scheduled for deployment, the server verifies the CC tokens of the clients listed in the job's deployment map, using its own security policy.
105+
After startup, all sites periodically perform cross-site token validation:
97106

98-
If all client tokens in the deployment map pass verification, the server sends the verified tokens to those clients for peer verification.
107+
Each site generates new CC tokens at regular intervals.
99108

100-
4. Peer Verification
101-
--------------------
109+
Sites exchange tokens through a secure communication channel.
102110

103-
Each client evaluates the received CC tokens (including the server's token and other clients' tokens) against its own security policy to decide whether it trusts the other participants.
111+
Every participant validates the tokens of all others.
104112

105-
Based on this evaluation, the client may choose to accept or reject participation in the job.
113+
If any CC-enabled site fails token validation, the system will shut down to maintain a trusted environment.
114+
Sites that are not CC-enabled are skipped during attestation checks.
106115

107-
If a client declines to join the job, the server excludes it from deployment.
108116

109-
5. Job Scheduling
117+
4. Job Scheduling
110118
-----------------
111119

112-
Finally, the server's job scheduler determines whether the job has sufficient resources to proceed. It finalizes the job's status based on:
120+
Before jobs run, the server confirms that all CC-enabled participants have valid, verified tokens.
121+
If validation fails, the system shuts down to prevent untrusted operation.
122+
Jobs involving untrusted code (for example, BYOC) are blocked in CC mode.
123+
124+
5. Summary
125+
----------
126+
127+
The attestation workflow provides:
128+
129+
- Continuous, system-wide token verification
130+
131+
- Mutual trust between server and clients
132+
133+
- Automatic shutdown on attestation failure
113134

114-
- Resource availability
115-
- Number of participants that passed verification
116-
- Any defined retry policies
135+
This ensures that all confidential computing participants operate only within secure and attested environments.
117136

118-
This multi-stage verification process ensures that all participants in a federated learning job operate in trusted, attested environments.

docs/user_guide/confidential_computing/cc_deployment_guide.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@ This guide covers the following deployment configuration:
2323

2424
Prerequisites
2525
=============
26+
For a complete and thorough setup guide covering Hardware IT, Host OS Administration, and VM Administration, please refer to [NVIDIA's Deployment Guide for SecureAI](https://docs.nvidia.com/cc-deployment-guide-snp.pdf)
2627

2728
Hardware Requirements
2829
---------------------

examples/advanced/cc_provision/README.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -140,3 +140,9 @@ tensorflow
140140
safetensors
141141
nv_attestation_sdk
142142
```
143+
144+
## 8. Notes on re-building initramfs with CVM image builder
145+
146+
1. Before re-building the initramfs for the CVM, remove the ``initrd.img`` file from the ``image_builder/base_images/`` directory.
147+
This ensures the Image Builder regenerates a fresh initramfs during the build process.
148+

examples/advanced/cc_provision/docker/Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@ COPY code/ /local/custom
1111
COPY requirements.txt .
1212
RUN pip install -r requirements.txt
1313

14-
ENTRYPOINT ["/user_config/nvflare/startup/sub_start.sh", "--verify"]
14+
ENTRYPOINT ["/user_config/nvflare/startup/sub_start.sh", "--once", "--verify"]
1515

examples/advanced/cc_provision/jobs/hello-pt_cifar10_fedavg/app_server/config/config_fed_server.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
"id": "controller",
66
"path": "nvflare.app_common.workflows.fedavg.FedAvg",
77
"args": {
8-
"num_clients": 1,
8+
"num_clients": 2,
99
"num_rounds": 2
1010
}
1111
}

examples/advanced/cc_provision/jobs/hello-pt_cifar10_fedavg/app_site-1/config/config_fed_client.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"executor": {
99
"path": "nvflare.app_opt.pt.in_process_client_api_executor.PTInProcessClientAPIExecutor",
1010
"args": {
11-
"task_script_path": "/local/custom/hello-pt_cifar10_fl.py"
11+
"task_script_path": "/local/custom/client.py"
1212
}
1313
}
1414
}
Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,14 @@
11
{
22
"name": "hello-pt_cifar10_fedavg",
33
"resource_spec": {},
4-
"min_clients": 1,
4+
"min_clients": 2,
55
"deploy_map": {
66
"app_server": [
77
"server"
88
],
99
"app_site-1": [
10-
"site-1"
10+
"site-1",
11+
"site-2"
1112
]
1213
}
13-
}
14+
}

examples/advanced/cc_provision/project.yml

Lines changed: 0 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -40,15 +40,6 @@ builders:
4040
# if not set, no app_validator is included in fed_server.json
4141
# app_validator: PATH_TO_YOUR_OWN_APP_VALIDATOR
4242

43-
# download_job_url is set to http://download.server.com/ as default in fed_server.json. You can override this
44-
# to different url.
45-
# download_job_url: http://download.server.com/
46-
4743
- path: nvflare.lighter.impl.cert.CertBuilder
4844
- path: nvflare.lighter.cc_provision.impl.cc.CCBuilder
4945
- path: nvflare.lighter.impl.signature.SignatureBuilder
50-
packager:
51-
path: nvflare.lighter.cc_provision.impl.onprem_packager.OnPremPackager
52-
args:
53-
# this needs to be replace with the real path of the image build scripts
54-
build_image_cmd: build_cvm_image.sh

nvflare/apis/fl_constant.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ class ReservedKey(object):
7373
WORKSPACE_ROOT = "__workspace_root__"
7474
APP_ROOT = "__app_root__"
7575
CLIENT_NAME = "__client_name__"
76+
CLIENT_TYPE = "__client_type__"
7677
TASK_NAME = "__task_name__"
7778
TASK_DATA = "__task_data__"
7879
TASK_RESULT = "__task_result__"
@@ -125,6 +126,7 @@ class FLContextKey(object):
125126
EVENT_SCOPE = ReservedKey.EVENT_SCOPE
126127
EXCEPTIONS = ReservedKey.EXCEPTIONS
127128
CLIENT_NAME = ReservedKey.CLIENT_NAME
129+
CLIENT_TYPE = ReservedKey.CLIENT_TYPE
128130
WORKSPACE_ROOT = ReservedKey.WORKSPACE_ROOT
129131
CURRENT_RUN = ReservedKey.RUN_NUM
130132
APP_ROOT = ReservedKey.APP_ROOT

0 commit comments

Comments
 (0)