diff --git a/docs/design/scalability-benchmarking.md b/docs/design/scalability-benchmarking.md new file mode 100644 index 000000000..c344b6009 --- /dev/null +++ b/docs/design/scalability-benchmarking.md @@ -0,0 +1,270 @@ +# Scalability Benchmarking + +**Status:** Draft — for team review +**Issue:** [#3671](https://github.com/near/mpc/issues/3671) +**Related:** [#1833](https://github.com/near/mpc/issues/1833) (test-framework vision), [#1825](https://github.com/near/mpc/issues/1825), [#399](https://github.com/near/mpc/issues/399) + +## Context + +Two of this year's objectives require benchmarking capabilities we don't yet have: + +1. **Avoid performance regressions** — catch a throughput/latency/asset-generation + regression introduced by a PR before it reaches production. +2. **Scale to 100 nodes** — gain confidence that the network sustains acceptable + signing performance as the participant set grows toward 100, including during + and after resharing and with degraded participants. + +Today we can spin up clusters and send load (`mpc-devnet`), and nodes export rich +Prometheus metrics, but nothing ties these together: load-test results are printed +to stdout and lost, there is no regression baseline, and no large-N scaling numbers +are recorded anywhere. This document proposes a design to close that gap and seeks +team agreement before implementation. + +## Goals + +- A repeatable way to measure end-to-end signing **throughput**, **latency**, and + **success rate** for a cluster of a given size under a given load shape. +- A **regression signal** on `main` that flags performance drops between commits. +- **Scaling curves** (performance vs. participant count). 100 nodes is the + target, not the starting point: early sweeps stop around 20 and the upper end + grows as bottlenecks are found and fixed. +- Coverage of the **system conditions** called out in #3671: steady state, + during/after resharing, several network sizes, some nodes down, and (stretch) + misbehaving nodes. +- Results stored somewhere they can be **analyzed over time** and **diffed across runs**. + +## Non-goals + +- A new general test framework. Scalability benchmarking slots into the unified + test-framework vision of #1833; it does not replace it. +- Component micro-benchmarks. Cryptographic primitive timing already lives in + `crates/threshold-signatures/benches/` (Criterion) and contract-gas regression in + `crates/contract/tests/sandbox/participants_gas.rs`. Those stay as-is and are + complementary to the system-level benchmarking proposed here. +- Production monitoring/alerting. Operator-facing metrics are covered by + [`docs/design/node-operator-metrics.md`](./node-operator-metrics.md). + +## What we already have + +| Asset | Location | Relevance | +| --- | --- | --- | +| Cluster lifecycle CLI (`mpc-devnet`) | `crates/devnet/` | Provisions GCP clusters via Terraform (infra-ops), deploys contract + Nomad jobs, does resharing/participant changes. | +| In-cluster localnet chain | `mpc-devnet ... deploy-chain` | Runs a `neard` validator inside the cluster — no testnet sync wait, no shared-RPC rate limits. Purpose-built for this (capacity at large N still needs validation, see harness ceiling). | +| Load generator | `crates/devnet/src/loadtest.rs` | `--qps`, `--duration`, `--parallel-sign-calls-per-domain`, multi-domain. Reports submitted/signed/end-to-end success rates. **stdout only — not persisted, and individual requests are not timed (no latency percentiles).** | +| Load-shape scenarios | `crates/devnet/scripts/loadtest-scenarios.sh` | sustained / high / spike shapes already scripted. | +| Node metrics | `crates/node/src/metrics.rs`, `requests/metrics.rs` | Latency **histograms** (`near_mpc_signature_time_elapsed`, `mpc_signature_request_response_latency_seconds`/`_blocks`, presig/triple/CKD timing), asset gauges, queue sizes, `mpc_num_fail_on_timeout_indexed`. Exposed at `GET /metrics` (port 8080). | +| Metrics scraping helper | `crates/e2e-tests/src/mpc_node.rs::get_metric` | Parses Prometheus text from `/metrics`. Reusable by a harness. | +| E2E harness | `crates/e2e-tests/` | Real `neard` sandbox + real `mpc-node` processes, small N. Good for cheap regression checks, not 100-node scale. | +| CI patterns | `.github/workflows/` | `warp-ubuntu-*` runners; `nightly_build.yml` shows the cron + `act10ns/slack` + `secrets.SLACK_WEBHOOK` failure-notify pattern; `changes.yml` path-filter gating. | + +## Key decision: extend `mpc-devnet`, do not build new + +The issue asks whether to "brush up Devnet or build something new." Recommendation: +**extend `mpc-devnet`.** It already owns cluster provisioning, the in-cluster +localnet chain (which removes the testnet-sync bottleneck that would otherwise +dominate benchmark turnaround), participant changes/resharing, and a working load +generator. Building a parallel system would duplicate all of that. This also keeps +us aligned with #1833, which wants fewer overlapping frameworks, not more. + +The work is therefore additive, not a rewrite: + +1. **Persist results.** Add a results-emitter to the load test: write a structured + JSON summary per run (see schema below) instead of only printing to stdout. +2. **Measure per-request latency in the load generator.** Record submission and + response timestamps per request and compute exact p50/p95/p99 client-side — + the node histograms are too coarse for this + (`mpc_signature_request_response_latency_seconds` has 1 s-wide buckets below + 10 s). Still scrape `/metrics` at run end for the internal breakdown + (presignature wait, signing time, store levels). +3. **A scenario manifest.** Promote `loadtest-scenarios.sh` to a declarative + manifest (cluster size, load shape, conditions, duration) so a run is fully + described by data and is reproducible. + +## Benchmark dimensions + +A benchmark run is the cross product of **cluster shape × load shape × domain mix +× condition**. + +- **Cluster size (participants):** e.g. 4, 8, 16, 20, 32, 64, 100; threshold + scales with it. Early phases sweep only to ~20. +- **Load shape:** sustained baseline, high steady-state, spike (warm→burst→cool) — + already in `loadtest-scenarios.sh`; parameterized by QPS, duration, parallel-sign batch. +- **Domain mix:** ECDSA (Secp256k1) is the asset-bound path (each signature + consumes a presignature, each presignature two triples); EdDSA (Ed25519) is + comparatively cheap. ECDSA-only is the minimum; mixed domains are closer to + production. Sign load only at first — CKD and foreign-chain verification are a + later extension. +- **System condition:** + - **Steady state** — baseline. + - **During resharing** — start load, trigger `vote-new-parameters`, measure the + dip and time-to-recovery while keys are redistributed. + - **After resharing** — steady state on the new participant set. + - **Some nodes down** — kill `f` nodes (up to threshold tolerance), measure + degraded throughput and whether the network keeps signing. + - **Misbehaving nodes** *(stretch)* — nodes that respond slowly or with garbage. + - **Injected WAN latency** *(stretch)* — `tc netem` delay between nodes; + single-zone links are unrealistically fast. Until then, testnet runs are the + cross-check. + +We do not need the full cross product on every run. Define a small **smoke matrix** +for nightly regression (small N, steady + spike) and a larger **scaling matrix** +run on-demand or weekly (the size sweep, plus resharing/nodes-down). + +**Pitfall — buffered assets flatter throughput.** A short ECDSA run can be served +entirely from stockpiled presignatures, measuring buffer depth rather than +sustainable rate. Distinguish **burst capacity** (buffer-absorbed spikes) from +**sustained capacity** (the generation-bound rate at store equilibrium): sustained +runs must last long enough for store levels to stabilize, and every run records +store levels at start/end so a draining buffer is visible. + +## Metrics / KPIs + +Recorded per run, sourced from node `/metrics` plus the load generator: + +- **Throughput:** sustained signatures/sec achieved vs. offered QPS. +- **Latency:** request→response p50/p95/p99, measured client-side by the load + generator; node histograms serve as cross-check and per-node breakdown. +- **Success rate:** end-to-end (already computed by `loadtest.rs`), plus + `mpc_num_fail_on_timeout_indexed` deltas. +- **Asset generation:** triple/presignature generation time, available-asset + gauges, and store levels at run start/end — the usual scaling bottleneck as N + grows. +- **Resharing recovery:** wall-clock from resharing start to throughput recovery. + Resharing is a strict superset of keygen, so this also covers keygen at scale. +- **Node resources:** per-node CPU, memory, network bandwidth. The P2P mesh is + O(N²) connections, making bandwidth a likely scaling ceiling. + +The scaling objective needs agreed **pass/fail targets** on these KPIs (e.g. +"p95 ≤ X s at Y QPS with 100 participants") — without them, "scales to 100" is +not falsifiable. See open questions. + +## Environments (tiered) + +- **Primary — in-cluster localnet.** Reproducible, no sync wait, no shared-RPC + limits. Used for regression gating and scaling curves where we want clean, + comparable numbers. +- **Secondary — testnet.** Periodic runs against a real cluster on testnet for + realism (real network latency, real RPC limits). Higher cost and noisier, so used + for validation rather than per-commit gating. + +The same harness and result schema serve both; only the target (`--mpc-network` +localnet cluster vs. `--mpc-contract` on testnet) differs. + +**Harness ceiling.** A single validator absorbs every sign and respond +transaction, and one load generator sustains the offered QPS. Measure the +harness's own ceiling (e.g. against a trivial contract) before trusting large-N +numbers, so harness saturation is not mistaken for MPC saturation. + +## Results storage (two complementary sinks) + +1. **Structured JSON/CSV summary per run — for regression gating.** One record per + run: commit SHA, timestamp, cluster size, condition, load shape, and the KPIs + above. Stored as CI artifacts and appended to a results branch/repo so runs are + diffable over time. Regression gating compares each fresh summary against the + checked-in baseline (see [Regression detection](#regression-detection)); the + accumulated history serves trend analysis. + + Proposed schema (sketch): + ```json + { + "commit": "b245cc29", "timestamp": "2026-06-26T00:00:00Z", + "env": "localnet", "instance_type": "n2d-standard-8", + "participants": 16, "threshold": 11, + "condition": "steady_state", "domains": ["Secp256k1"], + "load": { "qps": 10, "duration_s": 300, "batch": 10 }, + "repetition": 1, + "throughput_sps": 9.6, + "latency_s": { "p50": 1.8, "p95": 3.2, "p99": 4.1 }, + "success_rate": 0.998, + "timeouts": 1, + "presignatures": { "start": 8192, "end": 5120 } + } + ``` + +2. **Prometheus + Grafana — for live/time-series analysis.** Scrape cluster node + `/metrics` into a Prometheus instance (hosted in infra-ops) and visualize in + Grafana for during-run inspection and historical trends. Reuses the existing + metrics; no node changes required. Dashboards/scrape config live in infra-ops, + not this repo (consistent with `node-operator-metrics.md`). + +The JSON summary is the source of truth for pass/fail gating; Grafana is for humans +investigating *why* a run regressed. + +## Automation + +- **Manual, on any PR/commit.** An engineer runs the scenario manifest against a + devnet cluster (localnet or testnet) and gets a JSON summary — for investigating a + suspected regression or validating a perf-sensitive change. This is just the + extended CLI; no new infra. +- **Nightly on `main`.** A scheduled workflow (mirroring `nightly_build.yml`'s cron + + `act10ns/slack`/`SLACK_WEBHOOK` failure notification) runs the **smoke matrix** on a + localnet cluster, emits the JSON summary, and compares against the stored baseline. + Regression beyond a threshold → Slack alert + failed run. + - **Cluster lifecycle:** stop VMs between runs (a warm 24/7 cluster costs ~10×, + see costs); reset node state on start (`deploy-nomad --shutdown-and-reset`). + - **Triage:** a red nightly needs an owner who decides regression vs. noise, + then files the fix or updates the baseline. + - **Evidence capture:** on failure, bundle node logs and `/debug/tasks` output + as CI artifacts before teardown. + - **CI access:** devnet assumes a local infra-ops checkout and personal + `gcloud` auth; the workflow needs a service account and cluster RPC access — + part of this phase's scope. +- **Per-PR canary** *(optional)*. A small-N check on the e2e-tests harness catches + gross regressions pre-merge; the devnet benchmark stays authoritative. + +## Regression detection + +A run regresses if a KPI deviates from the baseline beyond a tolerance band +(to absorb cloud noise). Start simple — fixed percentage thresholds per KPI +(e.g. throughput −15%, p99 +25%, success rate < 99%) — and tighten once we observe +real run-to-run variance. + +**Baseline mechanism:** mirror the contract gas-threshold pattern +(`crates/contract/tests/sandbox/gas_thresholds.json`): a checked-in +`benchmark-baselines.json` keyed by matrix cell, holding expected KPI values and +per-KPI tolerances. The nightly job fails on breach; baseline updates are +deliberate PRs, so intentional performance changes are visible in review. + +**Variance control:** + +- Gated KPIs use the median of 3 repetitions of each smoke-matrix cell. +- Instance type is fixed per matrix cell and recorded in the result schema. +- Reused clusters are reset to a known state (`deploy-nomad --shutdown-and-reset`) + before a gated run, so RocksDB size and asset-store levels don't drift across + nights. + +**Attribution:** a nightly regression may span a day of merges; since a run is +fully described by manifest + commit, bisect by re-running the failing cell at +suspect commits. + +## Cost considerations + +GCP list prices (July 2026): `n2d-standard-8` — the type devnet already uses — +runs ≈ $0.34/h on-demand, ≈ $0.08/h spot; disks are negligible (~$0.014/h per +node for 100 GB pd-balanced). + +- **A 100-node run is cheap:** ~$35/h for the whole cluster, so a ~3 h session + (keygen + 3 repetitions) is ~$110–120, and a full size sweep (~1 h per size) is + ~$100 — even weekly that is ~$450/month. Cost is not a blocker at the target size. +- **Nightly smoke:** ~8 nodes × 2 h/night ≈ $170–200/month with VMs stopped + between runs; the same cluster warm 24/7 would be ~$1,400–2,000/month. +- **Spot** (~76% cheaper) suits exploratory runs only; preemption mid-run + invalidates a gated benchmark. +- **Quota:** 100 × n2d-standard-8 is 800 N2D vCPUs in one region — well above + default Compute Engine quotas. Request the increase ahead of the first large + sweep. +- Localnet-in-cluster avoids per-node testnet sync (~1h+ each), the dominant + turnaround cost for large N — strongly favored for the size sweep. + +## Open questions + +- Exact regression thresholds per KPI — needs a few baseline runs to set sensibly. +- Where the full run history accumulates for trend analysis (results branch vs. + dedicated repo vs. artifact retention) — gating itself needs only the checked-in + baseline file. +- Target SLOs: the pass/fail numbers for the 100-node objective (latency, + throughput, success rate). +- TEE representativeness: production nodes run in TDX; do gated numbers need + TDX-capable instances, or is a one-time TEE vs. non-TEE calibration run enough? +- Division of ownership for the Prometheus/Grafana piece between this repo and infra-ops. +- How much of this should converge with the #1833 framework vs. ship independently first.