feat(aep-43): expand workload utilization metrics spec#100
Draft
baktun14 wants to merge 1 commit into
Draft
Conversation
Rewrites the original AEP-43 draft from a one-paragraph summary into a full Interface specification. Defines the per-lease metrics catalog (CPU, memory, storage, network, GPU), provider-side architecture (metrics-operator + VictoriaMetrics/Prometheus + gateway proxy), versioned authenticated Provider REST/gRPC API, JWT scope addition, client-side aggregation pattern, security/operational considerations, SDL compatibility, backward compatibility, and a phased migration plan.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rewrites AEP-43 — Workload Utilization Metrics from a one-paragraph stub into a full Interface specification.
Defines:
akash_lease_*names and mandatory labels (owner,dseq,gseq,oseq,provider,service,replica, plus GPU-specific labels).operator/metricspackage, off-the-shelf exporters (cAdvisor, kube-state-metrics, dcgm-exporter, node_exporter), VictoriaMetrics single-node as the reference TSDB (Prometheus permitted), with 30-day retention floor, per-tenant cardinality/QPS limits, and a safeMetricsClientinterface that injects(owner, dseq, gseq, oseq)label filters at the gateway proxy.GET /lease/{dseq}/{gseq}/{oseq}/metricsREST endpoint and matching gRPCLeaseMetricsService, with JWT auth per AEP-64 (adds newmetricsscope), resolution table from 15s up to 15m by range, andfeatures.metricsadvertised inProviderStatusfor discoverability.provider-services lease-metricsfor headless use.Preserves original authors (Anil Murty, Artur Troian) and adds @baktun14 as primary author. Updates
index.jsonviascripts/index.js.Why
Tenants on Akash currently have no first-class way to observe realized utilization of their leases (CPU, RAM, storage, network, GPU/VRAM), which blocks right-sizing, peak-load awareness, diagnostic baselines, and cross-provider comparison. This AEP fills that gap and is a prerequisite for downstream work on autoscaling, provider SLAs/quality scoring, and the Provider Console capacity dashboards mentioned in AEP-32.
Test plan
node scripts/index.jsproduces a clean diff with the updated AEP-43 entry (description, updated author list, newdiscussions-to, updated date).spec/aep-43/README.mdon GitHub for table/diagram readability.