Escrow-depletion lease closes are silently dropped: `chain-sdk` event publisher reads only `TxsResults`, never `FinalizeBlockEvents` → provider never tears down escrow-closed leases (mass leak post-AEP-76)

## Summary

The provider's chain-event publisher (`pkg.akt.dev/go/util/events`, repo `akash-network/chain-sdk`) only processes **transaction** events from each block and **ignores block-level (`FinalizeBlock`/EndBlock) events**. Leases closed by **escrow depletion** are closed in the escrow module's `EndBlocker` (no transaction), so their `EventLeaseClosed` is never published to the provider's bus. Provider cluster teardown is **exclusively** driven by that event with **no reconciliation fallback**, so escrow-closed leases are **never garbage-collected**: the Kubernetes namespace + `manifest` CR persist forever and `deployment-monitor` loops `check result ok=false attempt=N` indefinitely.

AEP-76 (uAKT→uACT, 2026-03-24) left a large population of legacy escrows depleted/unsettleable, producing a wave of escrow-`EndBlocker` lease closures and turning this latent bug into a fleet-wide leak. On one affected provider it stranded **~710 vCPU + 3 GPU at $0** and froze the bidengine (`insufficient capacity` on every order) - ~90% revenue loss until manually remediated. Reproduced on two independent clusters.

Confirmed end-to-end from source across `akash-network/chain-sdk` and `akash-network/provider` (versions below). Not a hypothesis.

## Versions (all latest; nothing to upgrade to)

- `provider-services` **v0.12.0** (`ghcr.io/akash-network/provider:0.12.0`) - newest release
- `pkg.akt.dev/go` **v0.2.7** → `github.com/akash-network/chain-sdk` (tag `go/v0.2.7`) - pinned by provider v0.12.0 `go.mod`
- CometBFT `github.com/akash-network/cometbft v0.38.21-akash.1` (0.38 → `FinalizeBlockEvents`)
- Chain `akashnet-2`, post-AEP-76 (`uact`)
- Clusters: A = 13-node kubespray + Rook-Ceph; B = 6-node k3s

## Root cause (confirmed from source)

**`chain-sdk` `go/util/events/publish.go` (tag `go/v0.2.7`):**

`NewEvents` subscribes to block headers only:
```go
// query.go
func blkHeaderQuery() pubsub.Query {
    return tmquery.MustCompile(fmt.Sprintf("%s='%s'", tmtypes.EventTypeKey, tmtypes.EventNewBlockHeader))
}
```
On each new block header it calls `processBlock`, which **only iterates transaction results**:
```go
// publish.go  (~L133-152)
func (e *events) processBlock(height int64) {
    blkResults, err := e.client.BlockResults(e.ctx, &height)
    if err != nil { return }
    for _, tx := range blkResults.TxsResults {        // <-- ONLY TxsResults
        if tx == nil { continue }
        for _, ev := range tx.Events {
            if mev, ok := processEvent(ev); ok {       // processEvent handles *mtypes.EventLeaseClosed
                if err := e.bus.Publish(mev); err != nil { return }
            }
        }
    }
}
```
`blkResults.FinalizeBlockEvents` (CometBFT 0.38; equivalently Begin/EndBlock events) are **never read**. `processEvent` *does* handle `*mtypes.EventLeaseClosed` (publish.go ~L174) - it simply is never fed the EndBlock events.

**`provider` `cluster/service.go` (v0.12.0):** the cluster service event loop (~L338-428) reacts to exactly `event.ManifestReceived` and `*mtypes.EventLeaseClosed` → `s.teardownLease(ev.ID)` (**L379-381**). There is **no periodic on-chain reconciliation** - teardown is 100% dependent on that bus event.

**`provider` `cluster/monitor.go` (v0.12.0):** `deployment-monitor` only ever escalates to broadcasting `MsgCloseBid` (L195-210); it **never deletes the namespace/`manifest`**, and it logs `check result ok=false attempt=N` (L120) and keeps retrying (L128-132) for as long as its `deploymentManager` exists. The manager is only destroyed via `teardownLease`.

**Therefore:** lease closed by escrow `EndBlocker` ⇒ `EventLeaseClosed` is in `FinalizeBlockEvents` ⇒ dropped by `processBlock` ⇒ never on the bus ⇒ `service.go` never calls `teardownLease` ⇒ namespace + `manifest` persist forever ⇒ `deployment-monitor` loops `ok=false attempt=N` forever.

Tenant `MsgCloseDeployment` / provider `MsgCloseBid` closes are **transactions** → their `EventLeaseClosed` is in `TxsResults` → caught → torn down normally. This is exactly why tx-initiated closes reap fine while escrow-depletion closes leak.

(The single standard assumption: escrow-depletion `EventLeaseClosed` is emitted from the escrow `EndBlocker` / `FinalizeBlock`, not a tx - normal Cosmos `EndBlocker` behaviour, and corroborated by 100% of the evidence below.)

## Evidence

**Live, Cluster B, 2026-05-18 ~16:46 UTC** - monitor looping on closed leases, `attempt` incrementing in real time (e.g. `…/26702105`: 161→162):
```
INF check result attempt=149 cmp=deployment-monitor lease=akash1n4uut3v…/26309715/1/1/<provider> module=provider-cluster ok=false
INF check result attempt=162 cmp=deployment-monitor lease=akash1rfz266y…/26702105/1/1/<provider> module=provider-cluster ok=false
INF check result attempt=148 cmp=deployment-monitor lease=akash1vmzgel6…/25824764/1/1/<provider> module=provider-cluster ok=false
```
All confirmed closed on-chain via `provider-services query deployment get`, **all via escrow depletion** (escrow `state=closed`, `funds=0`); `25824764` has `ibc/170C…`+`uact` in `transferred` (straddled AEP-76). Namespaces persisted for days; removed only by an external script. No tx-initiated closes among the leaked set.

**Quantified impact (Cluster A, 2026-05-18):** 110 namespaces vs 60 active leases → **50 dangling**, all verified closed/overdrawn (0 false positives). ~**710 vCPU + 3 GPU at $0**; `inventory-service` counted the zombies → bidengine `reserving resources error="insufficient capacity"` on every order; provider available ≈ 0. Recovered to **~1,128 vCPU + 5 GPU** only after externally deleting namespaces **and** the `manifest` CRs (provider reserves from the `manifest` CRs in the single `lease` ns; deleting the namespace alone doesn't free the reservation) + `operator-inventory`→`akash-provider` restart. ~10% of capacity earning while it persisted. Dangling-ns creation on A by month: 2026-02=1, 04=27, 05=22 (accelerating across AEP-76).

**Secondary leak (likely same cause):** stale `open` bids never closed - 1,853 (A) / 2,194 (B). `EventOrderClosed`/`EventBidClosed` emitted from EndBlock would be dropped the same way.

## Suggested fix

1. **Primary (`chain-sdk` `processBlock`):** also run `blkResults.FinalizeBlockEvents` through `processEvent`/`e.bus.Publish` (and `BeginBlock`/`EndBlock` results for pre-0.38). One focused change; `processEvent` already handles the typed events.
2. **Defense-in-depth (`provider` `cluster/service`):** add a periodic reconciliation that lists the provider's active on-chain leases and tears down any managed deployment whose lease is no longer active - i.e. do internally what the community "dangling deployments" janitor script does. Protects against any future missed/dropped close event regardless of source.

## Note on the existing workaround

The only thing reaping these today is the community "dangling deployments" cleanup script on a cron - a janitor, not a fix. It also didn't scale (its bid pass was O(n²): ~6 `jq`/bid each re-parsing the whole bid array + a per-bid `kubectl get pods`; ~1.8k stale bids ⇒ never completed, which also blocked its namespace/manifest cleanup). We patched it to O(n) (>6 min hang → ~51 s) and published a fork. Providers should not need a janitor to avoid stranding most of their capacity.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Escrow-depletion lease closes are silently dropped: `chain-sdk` event publisher reads only `TxsResults`, never `FinalizeBlockEvents` → provider never tears down escrow-closed leases (mass leak post-AEP-76) #515

Summary

Versions (all latest; nothing to upgrade to)

Root cause (confirmed from source)

Evidence

Suggested fix

Note on the existing workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Escrow-depletion lease closes are silently dropped: chain-sdk event publisher reads only TxsResults, never FinalizeBlockEvents → provider never tears down escrow-closed leases (mass leak post-AEP-76) #515

Description

Summary

Versions (all latest; nothing to upgrade to)

Root cause (confirmed from source)

Evidence

Suggested fix

Note on the existing workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Escrow-depletion lease closes are silently dropped: `chain-sdk` event publisher reads only `TxsResults`, never `FinalizeBlockEvents` → provider never tears down escrow-closed leases (mass leak post-AEP-76) #515