Skip to content

Escrow-depletion lease closes are silently dropped: chain-sdk event publisher reads only TxsResults, never FinalizeBlockEvents → provider never tears down escrow-closed leases (mass leak post-AEP-76) #515

@cmartins88

Description

@cmartins88

Summary

The provider's chain-event publisher (pkg.akt.dev/go/util/events, repo akash-network/chain-sdk) only processes transaction events from each block and ignores block-level (FinalizeBlock/EndBlock) events. Leases closed by escrow depletion are closed in the escrow module's EndBlocker (no transaction), so their EventLeaseClosed is never published to the provider's bus. Provider cluster teardown is exclusively driven by that event with no reconciliation fallback, so escrow-closed leases are never garbage-collected: the Kubernetes namespace + manifest CR persist forever and deployment-monitor loops check result ok=false attempt=N indefinitely.

AEP-76 (uAKT→uACT, 2026-03-24) left a large population of legacy escrows depleted/unsettleable, producing a wave of escrow-EndBlocker lease closures and turning this latent bug into a fleet-wide leak. On one affected provider it stranded ~710 vCPU + 3 GPU at $0 and froze the bidengine (insufficient capacity on every order) - ~90% revenue loss until manually remediated. Reproduced on two independent clusters.

Confirmed end-to-end from source across akash-network/chain-sdk and akash-network/provider (versions below). Not a hypothesis.

Versions (all latest; nothing to upgrade to)

  • provider-services v0.12.0 (ghcr.io/akash-network/provider:0.12.0) - newest release
  • pkg.akt.dev/go v0.2.7github.com/akash-network/chain-sdk (tag go/v0.2.7) - pinned by provider v0.12.0 go.mod
  • CometBFT github.com/akash-network/cometbft v0.38.21-akash.1 (0.38 → FinalizeBlockEvents)
  • Chain akashnet-2, post-AEP-76 (uact)
  • Clusters: A = 13-node kubespray + Rook-Ceph; B = 6-node k3s

Root cause (confirmed from source)

chain-sdk go/util/events/publish.go (tag go/v0.2.7):

NewEvents subscribes to block headers only:

// query.go
func blkHeaderQuery() pubsub.Query {
    return tmquery.MustCompile(fmt.Sprintf("%s='%s'", tmtypes.EventTypeKey, tmtypes.EventNewBlockHeader))
}

On each new block header it calls processBlock, which only iterates transaction results:

// publish.go  (~L133-152)
func (e *events) processBlock(height int64) {
    blkResults, err := e.client.BlockResults(e.ctx, &height)
    if err != nil { return }
    for _, tx := range blkResults.TxsResults {        // <-- ONLY TxsResults
        if tx == nil { continue }
        for _, ev := range tx.Events {
            if mev, ok := processEvent(ev); ok {       // processEvent handles *mtypes.EventLeaseClosed
                if err := e.bus.Publish(mev); err != nil { return }
            }
        }
    }
}

blkResults.FinalizeBlockEvents (CometBFT 0.38; equivalently Begin/EndBlock events) are never read. processEvent does handle *mtypes.EventLeaseClosed (publish.go ~L174) - it simply is never fed the EndBlock events.

provider cluster/service.go (v0.12.0): the cluster service event loop (~L338-428) reacts to exactly event.ManifestReceived and *mtypes.EventLeaseCloseds.teardownLease(ev.ID) (L379-381). There is no periodic on-chain reconciliation - teardown is 100% dependent on that bus event.

provider cluster/monitor.go (v0.12.0): deployment-monitor only ever escalates to broadcasting MsgCloseBid (L195-210); it never deletes the namespace/manifest, and it logs check result ok=false attempt=N (L120) and keeps retrying (L128-132) for as long as its deploymentManager exists. The manager is only destroyed via teardownLease.

Therefore: lease closed by escrow EndBlockerEventLeaseClosed is in FinalizeBlockEvents ⇒ dropped by processBlock ⇒ never on the bus ⇒ service.go never calls teardownLease ⇒ namespace + manifest persist forever ⇒ deployment-monitor loops ok=false attempt=N forever.

Tenant MsgCloseDeployment / provider MsgCloseBid closes are transactions → their EventLeaseClosed is in TxsResults → caught → torn down normally. This is exactly why tx-initiated closes reap fine while escrow-depletion closes leak.

(The single standard assumption: escrow-depletion EventLeaseClosed is emitted from the escrow EndBlocker / FinalizeBlock, not a tx - normal Cosmos EndBlocker behaviour, and corroborated by 100% of the evidence below.)

Evidence

Live, Cluster B, 2026-05-18 ~16:46 UTC - monitor looping on closed leases, attempt incrementing in real time (e.g. …/26702105: 161→162):

INF check result attempt=149 cmp=deployment-monitor lease=akash1n4uut3v…/26309715/1/1/<provider> module=provider-cluster ok=false
INF check result attempt=162 cmp=deployment-monitor lease=akash1rfz266y…/26702105/1/1/<provider> module=provider-cluster ok=false
INF check result attempt=148 cmp=deployment-monitor lease=akash1vmzgel6…/25824764/1/1/<provider> module=provider-cluster ok=false

All confirmed closed on-chain via provider-services query deployment get, all via escrow depletion (escrow state=closed, funds=0); 25824764 has ibc/170C…+uact in transferred (straddled AEP-76). Namespaces persisted for days; removed only by an external script. No tx-initiated closes among the leaked set.

Quantified impact (Cluster A, 2026-05-18): 110 namespaces vs 60 active leases → 50 dangling, all verified closed/overdrawn (0 false positives). ~710 vCPU + 3 GPU at $0; inventory-service counted the zombies → bidengine reserving resources error="insufficient capacity" on every order; provider available ≈ 0. Recovered to ~1,128 vCPU + 5 GPU only after externally deleting namespaces and the manifest CRs (provider reserves from the manifest CRs in the single lease ns; deleting the namespace alone doesn't free the reservation) + operator-inventoryakash-provider restart. ~10% of capacity earning while it persisted. Dangling-ns creation on A by month: 2026-02=1, 04=27, 05=22 (accelerating across AEP-76).

Secondary leak (likely same cause): stale open bids never closed - 1,853 (A) / 2,194 (B). EventOrderClosed/EventBidClosed emitted from EndBlock would be dropped the same way.

Suggested fix

  1. Primary (chain-sdk processBlock): also run blkResults.FinalizeBlockEvents through processEvent/e.bus.Publish (and BeginBlock/EndBlock results for pre-0.38). One focused change; processEvent already handles the typed events.
  2. Defense-in-depth (provider cluster/service): add a periodic reconciliation that lists the provider's active on-chain leases and tears down any managed deployment whose lease is no longer active - i.e. do internally what the community "dangling deployments" janitor script does. Protects against any future missed/dropped close event regardless of source.

Note on the existing workaround

The only thing reaping these today is the community "dangling deployments" cleanup script on a cron - a janitor, not a fix. It also didn't scale (its bid pass was O(n²): ~6 jq/bid each re-parsing the whole bid array + a per-bid kubectl get pods; ~1.8k stale bids ⇒ never completed, which also blocked its namespace/manifest cleanup). We patched it to O(n) (>6 min hang → ~51 s) and published a fork. Providers should not need a janitor to avoid stranding most of their capacity.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions