Skip to content

feat(aep-43): expand workload utilization metrics spec#100

Draft
baktun14 wants to merge 1 commit into
mainfrom
feat/aep-43-utilization-metrics-spec
Draft

feat(aep-43): expand workload utilization metrics spec#100
baktun14 wants to merge 1 commit into
mainfrom
feat/aep-43-utilization-metrics-spec

Conversation

@baktun14
Copy link
Copy Markdown
Contributor

Summary

Rewrites AEP-43 — Workload Utilization Metrics from a one-paragraph stub into a full Interface specification.

Defines:

  • Metrics catalog — minimum required CPU, memory, storage, network, and GPU metrics every conformant provider exposes per lease, with normalized akash_lease_* names and mandatory labels (owner, dseq, gseq, oseq, provider, service, replica, plus GPU-specific labels).
  • Provider architecture — new operator/metrics package, off-the-shelf exporters (cAdvisor, kube-state-metrics, dcgm-exporter, node_exporter), VictoriaMetrics single-node as the reference TSDB (Prometheus permitted), with 30-day retention floor, per-tenant cardinality/QPS limits, and a safe MetricsClient interface that injects (owner, dseq, gseq, oseq) label filters at the gateway proxy.
  • API surfaceGET /lease/{dseq}/{gseq}/{oseq}/metrics REST endpoint and matching gRPC LeaseMetricsService, with JWT auth per AEP-64 (adds new metrics scope), resolution table from 15s up to 15m by range, and features.metrics advertised in ProviderStatus for discoverability.
  • Client aggregation pattern — Console Metrics tab fans out across the providers hosting a deployment, merges series, and overlays deployment-update markers; CLI subcommand provider-services lease-metrics for headless use.
  • Security/operational considerations — three-layer tenant isolation, query-cost DoS controls, side-channel guarantees, JWT replay mitigations, and an explicit non-goal for v1 around on-chain metric attestation.
  • Phased migration — additive, no SDL or on-chain change required. Behind a feature flag until a 4-week soak with opted-in providers.

Preserves original authors (Anil Murty, Artur Troian) and adds @baktun14 as primary author. Updates index.json via scripts/index.js.

Why

Tenants on Akash currently have no first-class way to observe realized utilization of their leases (CPU, RAM, storage, network, GPU/VRAM), which blocks right-sizing, peak-load awareness, diagnostic baselines, and cross-provider comparison. This AEP fills that gap and is a prerequisite for downstream work on autoscaling, provider SLAs/quality scoring, and the Provider Console capacity dashboards mentioned in AEP-32.

Test plan

  • CI passes (markdown/YAML/index validation).
  • node scripts/index.js produces a clean diff with the updated AEP-43 entry (description, updated author list, new discussions-to, updated date).
  • Render preview of spec/aep-43/README.md on GitHub for table/diagram readability.
  • Cross-link references to AEP-32, AEP-64, AEP-65 resolve correctly.

Rewrites the original AEP-43 draft from a one-paragraph summary into a
full Interface specification. Defines the per-lease metrics catalog
(CPU, memory, storage, network, GPU), provider-side architecture
(metrics-operator + VictoriaMetrics/Prometheus + gateway proxy),
versioned authenticated Provider REST/gRPC API, JWT scope addition,
client-side aggregation pattern, security/operational considerations,
SDL compatibility, backward compatibility, and a phased migration plan.
@baktun14 baktun14 requested a review from a team as a code owner May 11, 2026 19:36
@baktun14 baktun14 marked this pull request as draft May 11, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant