refactor(tracker): run tx-tracker GQL outside the DB transaction#309
Open
ipdae wants to merge 1 commit into
Open
refactor(tracker): run tx-tracker GQL outside the DB transaction#309ipdae wants to merge 1 commit into
ipdae wants to merge 1 commit into
Conversation
76b4e55 to
1bef842
Compare
track_tx() opened a session, SELECTed up to 200 unsettled claims, then held that transaction open while it fanned out a per-claim headless GQL batch (process()) and only committed at the very end. When a node was slow the session sat idle-in-transaction for minutes (observed 448s live via pg_stat_activity). It is a plain SELECT (AccessShareLock, no FOR UPDATE) so it blocked no other query, but the long-lived transaction pinned the xmin horizon and prevented autovacuum from reclaiming dead tuples on claim / user_season_pass (user_season_pass had never been autovacuumed; claim/user_season_pass dead tuples were piling into the hundreds of thousands). This is the same anti-pattern the API status paths fixed in #308, applied to the tracker: 1. Read the unsettled claims in a short transaction, capture (id, planet_id, tx_id) as plain scalars, and close the session before any RPC. 2. Run the headless GQL batch with no DB transaction held open. 3. Persist the resolved statuses in a fresh, short write transaction. Because the read->write gap is wider now (the GQL batch runs outside the transaction), the write is guarded: UPDATE ... WHERE id=:id AND tx_id=:tx_id AND tx_status IN (STAGED, INVALID). This skips a claim re-staged with a new tx in between, and never clobbers a row already finalized to SUCCESS/FAILURE by another path with a stale GQL result. process() returns the same tx_id it was given. Logging and the STAGED/INVALID selection are unchanged. Also drops a dead `import os`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1bef842 to
52e16aa
Compare
This was referenced Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
track_tx()opened a session,SELECTed up to 200 unsettled claims (tx_status IN (STAGED, INVALID)), then held that transaction open while it fanned out a per-claim headless GQL batch (process()), committing only at the very end. This splits it into three phases so no transaction is held across the RPC:(id, planet_id, tx_id)as plain scalars, andclose()the session before any RPC.Why
Found while investigating the mainnet seasonpass DB after the #308 deploy. Live
pg_stat_activityshowed this exact query sitting idle-in-transaction for 448s:It's a plain
SELECT(AccessShareLock, noFOR UPDATE), so it blocked no other query and caused no alarms — but the long-lived transaction pinned the xmin horizon and blocked autovacuum on the hot tables:user_season_pass—last_autovacuum = None(never autovacuumed), ~292k dead tuplesclaim— not vacuumed since the prior manual run, ~175k dead tuplesThis is the same anti-pattern the API status paths fixed in #308, just living in the tracker (which #308 didn't touch). The tracker runs
track_tx()on a 10s loop, so the xmin pin recurs continuously whenever a node is slow.Correctness notes
tx_idas well asid:UPDATE claim SET tx_status=... WHERE id=:id AND tx_id=:tx_id. Decoupling the read from the write widens the read→write gap, so a claim that was re-staged with a new tx in between is now left untouched (picked up next cycle) instead of being clobbered with a stale status.process()returns the sametx_idit was given, so this is exactly the tx we resolved.start_id/end_id/countlogging and theSTAGED/INVALIDselection are unchanged.process()semantics, planet conversion (PlanetID(...)), and thread pool (max_workers=10) are unchanged.import os.Testing
The tracker has no existing test suite and the models use postgres-only types (
ARRAY/ENUM/JSONB), so an in-memory harness is disproportionate here. Verified viapy_compile+ black/isort/autoflake (pre-commit). A tracker test harness is a reasonable follow-up. Behavior is otherwise a straight restructuring of the existing read→GQL→write flow.Independent of #306/#308 (touches
apps/trackeronly) — targetsmaindirectly.🤖 Generated with Claude Code