Summary
The provider's chain-event publisher (pkg.akt.dev/go/util/events, repo akash-network/chain-sdk) only processes transaction events from each block and ignores block-level (FinalizeBlock/EndBlock) events. Leases closed by escrow depletion are closed in the escrow module's EndBlocker (no transaction), so their EventLeaseClosed is never published to the provider's bus. Provider cluster teardown is exclusively driven by that event with no reconciliation fallback, so escrow-closed leases are never garbage-collected: the Kubernetes namespace + manifest CR persist forever and deployment-monitor loops check result ok=false attempt=N indefinitely.
AEP-76 (uAKT→uACT, 2026-03-24) left a large population of legacy escrows depleted/unsettleable, producing a wave of escrow-EndBlocker lease closures and turning this latent bug into a fleet-wide leak. On one affected provider it stranded ~710 vCPU + 3 GPU at $0 and froze the bidengine (insufficient capacity on every order) - ~90% revenue loss until manually remediated. Reproduced on two independent clusters.
Confirmed end-to-end from source across akash-network/chain-sdk and akash-network/provider (versions below). Not a hypothesis.
Versions (all latest; nothing to upgrade to)
provider-services v0.12.0 (ghcr.io/akash-network/provider:0.12.0) - newest release
pkg.akt.dev/go v0.2.7 → github.com/akash-network/chain-sdk (tag go/v0.2.7) - pinned by provider v0.12.0 go.mod
- CometBFT
github.com/akash-network/cometbft v0.38.21-akash.1 (0.38 → FinalizeBlockEvents)
- Chain
akashnet-2, post-AEP-76 (uact)
- Clusters: A = 13-node kubespray + Rook-Ceph; B = 6-node k3s
Root cause (confirmed from source)
chain-sdk go/util/events/publish.go (tag go/v0.2.7):
NewEvents subscribes to block headers only:
// query.go
func blkHeaderQuery() pubsub.Query {
return tmquery.MustCompile(fmt.Sprintf("%s='%s'", tmtypes.EventTypeKey, tmtypes.EventNewBlockHeader))
}
On each new block header it calls processBlock, which only iterates transaction results:
// publish.go (~L133-152)
func (e *events) processBlock(height int64) {
blkResults, err := e.client.BlockResults(e.ctx, &height)
if err != nil { return }
for _, tx := range blkResults.TxsResults { // <-- ONLY TxsResults
if tx == nil { continue }
for _, ev := range tx.Events {
if mev, ok := processEvent(ev); ok { // processEvent handles *mtypes.EventLeaseClosed
if err := e.bus.Publish(mev); err != nil { return }
}
}
}
}
blkResults.FinalizeBlockEvents (CometBFT 0.38; equivalently Begin/EndBlock events) are never read. processEvent does handle *mtypes.EventLeaseClosed (publish.go ~L174) - it simply is never fed the EndBlock events.
provider cluster/service.go (v0.12.0): the cluster service event loop (~L338-428) reacts to exactly event.ManifestReceived and *mtypes.EventLeaseClosed → s.teardownLease(ev.ID) (L379-381). There is no periodic on-chain reconciliation - teardown is 100% dependent on that bus event.
provider cluster/monitor.go (v0.12.0): deployment-monitor only ever escalates to broadcasting MsgCloseBid (L195-210); it never deletes the namespace/manifest, and it logs check result ok=false attempt=N (L120) and keeps retrying (L128-132) for as long as its deploymentManager exists. The manager is only destroyed via teardownLease.
Therefore: lease closed by escrow EndBlocker ⇒ EventLeaseClosed is in FinalizeBlockEvents ⇒ dropped by processBlock ⇒ never on the bus ⇒ service.go never calls teardownLease ⇒ namespace + manifest persist forever ⇒ deployment-monitor loops ok=false attempt=N forever.
Tenant MsgCloseDeployment / provider MsgCloseBid closes are transactions → their EventLeaseClosed is in TxsResults → caught → torn down normally. This is exactly why tx-initiated closes reap fine while escrow-depletion closes leak.
(The single standard assumption: escrow-depletion EventLeaseClosed is emitted from the escrow EndBlocker / FinalizeBlock, not a tx - normal Cosmos EndBlocker behaviour, and corroborated by 100% of the evidence below.)
Evidence
Live, Cluster B, 2026-05-18 ~16:46 UTC - monitor looping on closed leases, attempt incrementing in real time (e.g. …/26702105: 161→162):
INF check result attempt=149 cmp=deployment-monitor lease=akash1n4uut3v…/26309715/1/1/<provider> module=provider-cluster ok=false
INF check result attempt=162 cmp=deployment-monitor lease=akash1rfz266y…/26702105/1/1/<provider> module=provider-cluster ok=false
INF check result attempt=148 cmp=deployment-monitor lease=akash1vmzgel6…/25824764/1/1/<provider> module=provider-cluster ok=false
All confirmed closed on-chain via provider-services query deployment get, all via escrow depletion (escrow state=closed, funds=0); 25824764 has ibc/170C…+uact in transferred (straddled AEP-76). Namespaces persisted for days; removed only by an external script. No tx-initiated closes among the leaked set.
Quantified impact (Cluster A, 2026-05-18): 110 namespaces vs 60 active leases → 50 dangling, all verified closed/overdrawn (0 false positives). ~710 vCPU + 3 GPU at $0; inventory-service counted the zombies → bidengine reserving resources error="insufficient capacity" on every order; provider available ≈ 0. Recovered to ~1,128 vCPU + 5 GPU only after externally deleting namespaces and the manifest CRs (provider reserves from the manifest CRs in the single lease ns; deleting the namespace alone doesn't free the reservation) + operator-inventory→akash-provider restart. ~10% of capacity earning while it persisted. Dangling-ns creation on A by month: 2026-02=1, 04=27, 05=22 (accelerating across AEP-76).
Secondary leak (likely same cause): stale open bids never closed - 1,853 (A) / 2,194 (B). EventOrderClosed/EventBidClosed emitted from EndBlock would be dropped the same way.
Suggested fix
- Primary (
chain-sdk processBlock): also run blkResults.FinalizeBlockEvents through processEvent/e.bus.Publish (and BeginBlock/EndBlock results for pre-0.38). One focused change; processEvent already handles the typed events.
- Defense-in-depth (
provider cluster/service): add a periodic reconciliation that lists the provider's active on-chain leases and tears down any managed deployment whose lease is no longer active - i.e. do internally what the community "dangling deployments" janitor script does. Protects against any future missed/dropped close event regardless of source.
Note on the existing workaround
The only thing reaping these today is the community "dangling deployments" cleanup script on a cron - a janitor, not a fix. It also didn't scale (its bid pass was O(n²): ~6 jq/bid each re-parsing the whole bid array + a per-bid kubectl get pods; ~1.8k stale bids ⇒ never completed, which also blocked its namespace/manifest cleanup). We patched it to O(n) (>6 min hang → ~51 s) and published a fork. Providers should not need a janitor to avoid stranding most of their capacity.
Summary
The provider's chain-event publisher (
pkg.akt.dev/go/util/events, repoakash-network/chain-sdk) only processes transaction events from each block and ignores block-level (FinalizeBlock/EndBlock) events. Leases closed by escrow depletion are closed in the escrow module'sEndBlocker(no transaction), so theirEventLeaseClosedis never published to the provider's bus. Provider cluster teardown is exclusively driven by that event with no reconciliation fallback, so escrow-closed leases are never garbage-collected: the Kubernetes namespace +manifestCR persist forever anddeployment-monitorloopscheck result ok=false attempt=Nindefinitely.AEP-76 (uAKT→uACT, 2026-03-24) left a large population of legacy escrows depleted/unsettleable, producing a wave of escrow-
EndBlockerlease closures and turning this latent bug into a fleet-wide leak. On one affected provider it stranded ~710 vCPU + 3 GPU at $0 and froze the bidengine (insufficient capacityon every order) - ~90% revenue loss until manually remediated. Reproduced on two independent clusters.Confirmed end-to-end from source across
akash-network/chain-sdkandakash-network/provider(versions below). Not a hypothesis.Versions (all latest; nothing to upgrade to)
provider-servicesv0.12.0 (ghcr.io/akash-network/provider:0.12.0) - newest releasepkg.akt.dev/gov0.2.7 →github.com/akash-network/chain-sdk(taggo/v0.2.7) - pinned by provider v0.12.0go.modgithub.com/akash-network/cometbft v0.38.21-akash.1(0.38 →FinalizeBlockEvents)akashnet-2, post-AEP-76 (uact)Root cause (confirmed from source)
chain-sdkgo/util/events/publish.go(taggo/v0.2.7):NewEventssubscribes to block headers only:On each new block header it calls
processBlock, which only iterates transaction results:blkResults.FinalizeBlockEvents(CometBFT 0.38; equivalently Begin/EndBlock events) are never read.processEventdoes handle*mtypes.EventLeaseClosed(publish.go ~L174) - it simply is never fed the EndBlock events.providercluster/service.go(v0.12.0): the cluster service event loop (~L338-428) reacts to exactlyevent.ManifestReceivedand*mtypes.EventLeaseClosed→s.teardownLease(ev.ID)(L379-381). There is no periodic on-chain reconciliation - teardown is 100% dependent on that bus event.providercluster/monitor.go(v0.12.0):deployment-monitoronly ever escalates to broadcastingMsgCloseBid(L195-210); it never deletes the namespace/manifest, and it logscheck result ok=false attempt=N(L120) and keeps retrying (L128-132) for as long as itsdeploymentManagerexists. The manager is only destroyed viateardownLease.Therefore: lease closed by escrow
EndBlocker⇒EventLeaseClosedis inFinalizeBlockEvents⇒ dropped byprocessBlock⇒ never on the bus ⇒service.gonever callsteardownLease⇒ namespace +manifestpersist forever ⇒deployment-monitorloopsok=false attempt=Nforever.Tenant
MsgCloseDeployment/ providerMsgCloseBidcloses are transactions → theirEventLeaseClosedis inTxsResults→ caught → torn down normally. This is exactly why tx-initiated closes reap fine while escrow-depletion closes leak.(The single standard assumption: escrow-depletion
EventLeaseClosedis emitted from the escrowEndBlocker/FinalizeBlock, not a tx - normal CosmosEndBlockerbehaviour, and corroborated by 100% of the evidence below.)Evidence
Live, Cluster B, 2026-05-18 ~16:46 UTC - monitor looping on closed leases,
attemptincrementing in real time (e.g.…/26702105: 161→162):All confirmed closed on-chain via
provider-services query deployment get, all via escrow depletion (escrowstate=closed,funds=0);25824764hasibc/170C…+uactintransferred(straddled AEP-76). Namespaces persisted for days; removed only by an external script. No tx-initiated closes among the leaked set.Quantified impact (Cluster A, 2026-05-18): 110 namespaces vs 60 active leases → 50 dangling, all verified closed/overdrawn (0 false positives). ~710 vCPU + 3 GPU at $0;
inventory-servicecounted the zombies → bidenginereserving resources error="insufficient capacity"on every order; provider available ≈ 0. Recovered to ~1,128 vCPU + 5 GPU only after externally deleting namespaces and themanifestCRs (provider reserves from themanifestCRs in the singleleasens; deleting the namespace alone doesn't free the reservation) +operator-inventory→akash-providerrestart. ~10% of capacity earning while it persisted. Dangling-ns creation on A by month: 2026-02=1, 04=27, 05=22 (accelerating across AEP-76).Secondary leak (likely same cause): stale
openbids never closed - 1,853 (A) / 2,194 (B).EventOrderClosed/EventBidClosedemitted from EndBlock would be dropped the same way.Suggested fix
chain-sdkprocessBlock): also runblkResults.FinalizeBlockEventsthroughprocessEvent/e.bus.Publish(andBeginBlock/EndBlockresults for pre-0.38). One focused change;processEventalready handles the typed events.providercluster/service): add a periodic reconciliation that lists the provider's active on-chain leases and tears down any managed deployment whose lease is no longer active - i.e. do internally what the community "dangling deployments" janitor script does. Protects against any future missed/dropped close event regardless of source.Note on the existing workaround
The only thing reaping these today is the community "dangling deployments" cleanup script on a cron - a janitor, not a fix. It also didn't scale (its bid pass was O(n²): ~6
jq/bid each re-parsing the whole bid array + a per-bidkubectl get pods; ~1.8k stale bids ⇒ never completed, which also blocked its namespace/manifest cleanup). We patched it to O(n) (>6 min hang → ~51 s) and published a fork. Providers should not need a janitor to avoid stranding most of their capacity.