▄████▄ ███████╗██████╗ ██╗ ██╗███████╗███╗ ███╗███████╗██████╗ █████╗ ██╗ ███╗ ███╗██╗
██▀██▀██ ██╔════╝██╔══██╗██║ ██║██╔════╝████╗ ████║██╔════╝██╔══██╗██╔══██╗██║ ████╗ ████║██║
██ ██ ██ █████╗ ██████╔╝███████║█████╗ ██╔████╔██║█████╗ ██████╔╝███████║██║ ██╔████╔██║██║
████████ ██╔══╝ ██╔═══╝ ██╔══██║██╔══╝ ██║╚██╔╝██║██╔══╝ ██╔══██╗██╔══██║██║ ██║╚██╔╝██║██║
██▄██▄██ ███████╗██║ ██║ ██║███████╗██║ ╚═╝ ██║███████╗██║ ██║██║ ██║███████╗██║ ╚═╝ ██║███████╗
▀ ▀▀ ▀ ╚══════╝╚═╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚══════╝
Confidential AI inference with hardware-backed attestation — multi-cloud
Run AI models where prompts and weights stay encrypted — even if the host is compromised. Deploys on AWS Nitro Enclaves, GCP Confidential Space (Intel TDX), and GPU TEEs (NVIDIA H100 CC-mode).
| Problem | Solution |
|---|---|
| Cloud hosts can see your data | TEE isolation — data decrypted only inside the enclave |
| "Trust me" isn't enough | Cryptographic attestation — verify code before sending secrets |
| No audit trail | Execution receipts — proof of what code processed your data |
Built for: Defense, GovCloud, Finance, Healthcare — anywhere "good enough" security isn't.
EphemeralML now includes AIR v1 (Attested Inference Receipt), a standards-aligned receipt format for proving a single AI inference happened in an attested confidential environment.
Naming / standards note:
-
AIR here means Attested Inference Receipt (EphemeralML), not the IHE Radiology AI Results (AIR) profile.
-
AIR v1 is an application-specific COSE/CWT + EAT-profile receipt format for confidential AI inference, including AI provenance claims such as
model_id/model_hashand request/response hash binding. -
AIR v1 is not an implementation of IETF EAR. AIR v1 is workload-emitted execution evidence; EAR is verifier-emitted attestation results. They are complementary in a RATS-based architecture.
-
Spec entrypoint:
spec/v1/README.md -
Interop quick start:
spec/v1/interop-kit.md -
CDDL schema:
spec/v1/cddl/air-v1.cddl -
Conformance vectors:
spec/v1/vectors/ -
Implementation status / known gaps:
spec/v1/implementation-status.md
AIR v1 is single-inference only (pipeline proof chaining is planned for vNEXT).
┌──────────────────────────────────────────┐
│ Pipeline Orchestrator │
┌─────────┐ HPKE │ ┌─────────┐ SecureChannel ┌────────┐ │
│ Client │◄───────────►│ │ Host │◄──────────────►│Enclave │ │
└─────────┘ encrypted │ │ (blind │ attestation- │Stage 0 │ │
│ │ relay) │ bound AEAD └────────┘ │
│ └─────────┘ │
└──────────────────────────────────────────┘
│ │ NSM
│ S3 ▼
┌──────┴──────┐ ┌───────────────┐
│ Encrypted │ │ AWS KMS │
│ Models │ │ (key release) │
└─────────────┘ └───────────────┘
┌─────────┐ TDX-attested ┌─────────────────────────────────────────┐
│ Client │◄────────────────►│ GCP Confidential Space CVM (TDX) │
└─────────┘ SecureChannel │ ┌───────────────────────────────────┐ │
│ │ EphemeralML Container │ │
│ │ - TDX attestation (configfs-tsm) │ │
│ │ - Inference + receipt signing │ │
│ │ - Direct HTTPS to GCS / Cloud KMS │ │
│ └───────────────────────────────────┘ │
└─────────────────────────────────────────┘
│ │ TDX quote
│ GCS ▼
┌──────┴──────┐ ┌──────────────────┐
│ Encrypted │ │ Cloud KMS (WIP) │
│ Models │ │ (key release) │
└─────────────┘ └──────────────────┘
┌─────────┐ TDX-attested ┌──────────────────────────────────────────────┐
│ Client │◄────────────────►│ GCP Confidential Space CVM (TDX + H100 CC) │
└─────────┘ SecureChannel │ ┌────────────────────────────────────────┐ │
│ │ EphemeralML Container (CUDA 12.2) │ │
│ │ - TDX attestation (configfs-tsm) │ │
│ │ - GGUF model loaded from GCS │ │
│ │ - GPU inference (candle-cuda, H100) │ │
│ │ - Receipt signing (Ed25519) │ │
│ └────────────────────────────────────────┘ │
└──────────────────────────────────────────────┘
│ │ TDX quote
│ GCS ▼
┌──────┴──────┐ ┌──────────────────┐
│ GGUF Model │ │ Cloud KMS (WIP) │
│ (≤16 GB) │ │ (key release) │
└─────────────┘ └──────────────────┘
Key insight: Host never has keys. On AWS, it just forwards ciphertext. On GCP, the entire CVM is the trust boundary — no host/enclave split, no VSock. GPU deployments use NVIDIA H100 in CC-mode (attestation confirms nvidia_gpu.cc_mode: ON). The pipeline layer (confidential-ml-pipeline) orchestrates multi-stage inference with per-stage attestation.
- ✅ Model weights (IP protection)
- ✅ Prompts & outputs (PII / classified data)
- ✅ Execution integrity (verified code)
- Attestation-gated key release — KMS releases DEK only if enclave measurements match policy (PCRs on Nitro, MRTD/RTMRs on TDX)
- Attestation-bound encrypted sessions — X25519 + HKDF + ChaCha20-Poly1305, host sees only ciphertext
- Ed25519 signed receipts — cryptographic proof of execution
- Cross-platform transport —
confidential-ml-transporthandles attestation-bound channels on both VSock (Nitro) and TCP (TDX)
- ✓ Compromised host OS → Protected (enclave isolation)
- ✓ Malicious cloud admin → Protected (can't decrypt)
- ✓ Supply chain attack → Detected (PCR verification)
- ✓ Model swap attack → Prevented (signed manifests)
- AWS Nitro Enclave integration with real NSM attestation and PCR-bound KMS key release
- GCP Confidential Space integration with Intel TDX attestation, MRTD/RTMR measurement pinning, and Cloud KMS key release
- Pipeline orchestration via
confidential-ml-pipeline— multi-stage inference with per-stage attestation, health checks, and graceful shutdown - Cross-platform transport via
confidential-ml-transport— attestation-bound SecureChannel with pluggable TCP/VSock backends - S3 model storage (AWS) and GCS model storage (GCP) with client-side encryption
- Candle-based transformer inference (MiniLM, BERT, Llama)
- GGUF support for quantized models (int4, int8) — used for GPU inference (Llama 3 8B Q4_K_M)
- CUDA 12.2 GPU inference via candle-cuda on NVIDIA H100 CC-mode (a3-highgpu-1g)
- BF16/safetensors format enforcement (CPU path)
- Memory-optimized for TEE constraints
- Attested Inference Receipts (AIR) — Ed25519-signed, CBOR-canonical, binding input/output hashes to enclave attestation
- Policy update system with signature verification and hot-reload
- Model format validation (safetensors, dtype enforcement)
- 500+ tests across the workspace and CI (including pipeline integration and GCP tests)
- Deterministic builds for reproducibility
Measured on AWS EC2 m6i.xlarge (4 vCPU, 16GB RAM) with MiniLM-L6-v2 (22.7M params), 3 independent runs of 100 iterations each. Commit b00bab1. Paper (\S7) uses canonical release-gate data from commit 057a85a. Raw JSON available in GitHub Releases.
| Metric | Bare Metal | Nitro Enclave | Overhead |
|---|---|---|---|
| Mean latency | 78.55ms | 88.45ms | +12.6% |
| P95 latency | 79.09ms | 89.58ms | +13.3% |
| Throughput | 12.73 inf/s | 11.31 inf/s | -11.2% |
| Stage | Time |
|---|---|
| NSM Attestation | 88ms |
| KMS Key Release | 76ms |
| Model Fetch (S3→VSock) | 6,716ms |
| Model Decrypt + Load | 139ms |
| Total | 7,052ms |
| Operation | Latency | Frequency |
|---|---|---|
| COSE attestation verification | 3.012ms | Once per session |
| HPKE session setup | 0.10ms | Once per session |
| HPKE encrypt + decrypt (1KB) | 0.006ms | Per inference |
| Receipt sign (CBOR + Ed25519) | 0.022ms | Per inference |
| Total per-inference crypto | 0.028ms | Per inference |
| Component | Latency |
|---|---|
| Per-request crypto (encrypt+decrypt+receipt) | 0.164ms |
| Session setup (keygen+HPKE) | 0.138ms |
| TCP handshake (ClientHello→ServerHello→HPKE) | 0.153ms |
| Threads | Throughput | Mean Latency | Scaling Efficiency |
|---|---|---|---|
| 1 | 12.75 inf/s | 78ms | 100% |
| 2 | 14.73 inf/s | 136ms | 57.8% |
| 4 | 14.66 inf/s | 270ms | 28.8% |
| 8 | 14.57 inf/s | 546ms | 14.3% |
| Metric | Bare Metal | Enclave |
|---|---|---|
| Cost per 1M inferences | $4.19 | $4.72 |
| Enclave cost multiplier | — | 1.13x |
- ~12.6% inference overhead — on par with AMD SEV-SNP BERT numbers (~16%), competitive with SGX/TDX
- Latest 3-model campaign (2026-02-05) — weighted mean overhead +12.9% (MiniLM-L6 +14.0%, MiniLM-L12 +12.9%, BERT-base +11.9%)
- Embedding quality preserved — near-identical embeddings (cosine similarity ≈ 1.0; tiny FP-level differences expected across CPU allocations)
- Per-inference crypto cost negligible — 0.028ms vs 88ms inference (0.03%)
- E2E crypto overhead — 0.164ms per request (0.19% of inference time)
- Throughput plateaus at ~14.7 inf/s — CPU-bound on 2 vCPUs; latency scales linearly with concurrency
- $4.72 per 1M inferences in enclave (1.13x bare metal cost)
- First published per-inference latency benchmark on AWS Nitro Enclaves
Measured on GCP a3-highgpu-1g (1x NVIDIA H100, TDX CC-mode ON) with Llama 3 8B Q4_K_M GGUF (4.6GB fetched from GCS at runtime).
| Metric | Value |
|---|---|
| Model | Llama 3 8B Q4_K_M (GGUF, 4.6GB) |
| Machine | a3-highgpu-1g (1x H100, TDX) |
| Boot to ready | ~3.5 min |
| 50 tokens generated | 12s (241ms/token) |
| Attestation | TDX quote, nvidia_gpu.cc_mode: ON |
| Receipt | Ed25519-signed, CBOR-canonical |
Critical: GCP Confidential Space GPU uses cos-gpu-installer v2.5.3, which installs driver 535.247.01. This driver supports CUDA <= 12.2 only. Using CUDA 12.6+ fails with CUDA_ERROR_UNSUPPORTED_PTX_VERSION. The Dockerfile.gpu must use nvidia/cuda:12.2.2-devel-ubuntu22.04 as the base image.
See docs/benchmarks.md for methodology, competitive analysis, and literature comparison.
Verified on real Nitro hardware (m6i.xlarge, Feb 2026) using a KMS key with kms:RecipientAttestation:ImageSha384 condition and key-policy-only evaluation (no root account statement, no IAM bypass path).
Debug vs non-debug mode: Enclaves launched with --debug-mode have all PCR values zeroed in their attestation documents. PCR-conditioned KMS policies cannot match in debug mode — the condition compares the policy's PCR0 hash against all-zeros, which never matches. Production (non-debug) enclaves carry real PCR values derived from the EIF contents.
PCR0 enforcement evidence (non-debug mode):
| Scenario | Result |
|---|---|
| Correct PCR0, valid attestation | Success (key released) |
| Wrong PCR0, valid attestation | AccessDeniedException |
| No attestation (recipient absent) | AccessDeniedException |
| Malformed attestation (random bytes) | ValidationException |
| Bit-flipped attestation (1 byte changed) | ValidationException |
CloudTrail confirms non-zero attestationDocumentEnclaveImageDigest for successful calls and no recipient data for denied calls.
Replay semantics: KMS accepts replayed attestation documents — resubmitting a previously successful attestation doc produces another successful key release. KMS validates the COSE_Sign1 signature and PCR values but does not enforce freshness (no nonce binding or timestamp check on the attestation document itself).
Use the single-command gate on your Nitro EC2 instance:
./scripts/final_release_gate.sh --runs 3 --model-id minilm-l6This chains:
scripts/run_final_kms_validation.shwith--require-kmsscripts/check_kms_integrity.shagainst producedrun_*directories- Final manifest + summary output
For ad-hoc auditing of existing result directories:
./scripts/check_kms_integrity.sh benchmark_results_final/kms_validation_*/run_*To publish benchmark evidence without requiring reader AWS access:
# 1) Package + scan for sensitive markers
./scripts/prepare_public_artifact.sh \
--input-dir benchmark_results_final/kms_validation_20260205_234917 \
--name kms_validation_20260205_234917.tar.gz
# 2) Upload to a GitHub Release tag
./scripts/publish_public_artifact.sh \
--tag v1.0.0 \
--artifact artifacts/public/kms_validation_20260205_234917.tar.gzSee docs/ARTIFACT_PUBLICATION.md for full details.
Run a working end-to-end demo locally — loads MiniLM-L6-v2, sends text, gets 384-dim embeddings + a signed Attested Execution Receipt:
bash scripts/demo.shOr manually:
# Terminal 1: Start enclave with model
cargo run --release --features mock --bin ephemeral-ml-enclave -- \
--model-dir test_assets/minilm --model-id stage-0
# Terminal 2: Run host inference
cargo run --release --features mock --bin ephemeral-ml-hostPrerequisites: AWS account with Nitro Enclave support, Rust 1.75+, Terraform.
# 1. Provision infrastructure
cd infra/hello-enclave
terraform init && terraform apply
# 2. Build enclave image
docker build -f enclave/Dockerfile.enclave -t ephemeral-ml-enclave .
nitro-cli build-enclave --docker-uri ephemeral-ml-enclave:latest --output-file enclave.eif
# 3. Run
nitro-cli run-enclave --eif-path enclave.eif --cpu-count 2 --memory 4096Prerequisites: GCP project with Confidential Computing API enabled, c3-standard-4 (TDX), Rust 1.75+.
# Build for GCP (no mock, no default features)
cargo build --release --no-default-features --features gcp -p ephemeral-ml-enclave
# Run on CVM (--gcp flag required to enter GCP code path)
./target/release/ephemeral-ml-enclave \
--gcp --model-dir /app/model --model-id stage-0Prerequisites: GCP project with a3-highgpu-1g quota, NVIDIA H100 CC-mode. Requires CUDA 12.2 (not 12.6+).
# Build GPU container (CUDA 12.2 base — required for CS driver 535.x)
docker build -f Dockerfile.gpu -t ephemeral-ml-gpu .
# Deploy to Confidential Space with GPU
bash scripts/gcp/deploy.sh --gpu \
--model-source gcs \
--model-format ggufExpected boot timeline: ~3.5 min (image pull + cos-gpu-installer + model fetch from GCS). Llama 3 8B Q4_K_M generates 50 tokens in 12s.
See QUICKSTART.md and docs/build-matrix.md for detailed instructions.
| Component | Status | Tests |
|---|---|---|
| Pipeline Orchestrator | ✅ Production | 10 |
| Stage Executor | ✅ Production | 1 |
| NSM Attestation (AWS) | ✅ Production | 11 |
| TDX Attestation (GCP) | ✅ Production | — |
| KMS Integration (AWS) | ✅ Production | — |
| GCP KMS / WIP | ⚠ Code exists, not wired into runtime | — |
| Inference Engine (Candle) | ✅ Production | 4 |
| Receipt Signing (Ed25519) | ✅ Production | 6 |
| Common / Types | ✅ Production | 42 |
| Host / Client | ✅ Production | 4 |
| Degradation Policies | ✅ Production | 3 |
| GCS Model Loader | ✅ Implemented | — |
| GPU Inference (H100 CC, CUDA 12.2) | ✅ Verified on hardware | — |
| TDX Verifier Bridge (Client) | ✅ Implemented | — |
v3.1 GPU Confidential — GPU inference on GCP Confidential Space (a3-highgpu-1g, NVIDIA H100 CC-mode) with Llama 3 8B Q4_K_M GGUF, CUDA 12.2, TDX attestation, and Ed25519-signed receipts. GCS loader supports up to 16GB models with Content-Length pre-check. CI green.
docs/design.md— Architecture & threat modeldocs/build-matrix.md— Deployment modes, feature flags & build commands (AWS, GCP, mock)docs/benchmarks.md— Benchmark methodology, results & competitive analysisdocs/BENCHMARK_SPEC.md— Benchmark specification (11-paper literature review)QUICKSTART.md— Deployment guideSECURITY_DEMO.md— Security walkthroughscripts/run_final_kms_validation.sh— Multi-run KMS-enforced benchmark validationscripts/check_kms_integrity.sh— Post-run KMS/commit/hardware integrity auditscripts/final_release_gate.sh— Single-command release gate for benchmark artifacts
Apache 2.0 — see LICENSE
Run inference like the host is already hacked.