refactor(tracker): run tx-tracker GQL outside the DB transaction by ipdae · Pull Request #309 · planetarium/NineChronicles.SeasonPass

ipdae · 2026-06-29T09:49:19Z

What

track_tx() opened a session, SELECTed up to 200 unsettled claims (tx_status IN (STAGED, INVALID)), then held that transaction open while it fanned out a per-claim headless GQL batch (process()), committing only at the very end. This splits it into three phases so no transaction is held across the RPC:

Read the unsettled claims in a short transaction, capture (id, planet_id, tx_id) as plain scalars, and close() the session before any RPC.
GQL batch with no DB transaction held open.
Write the resolved statuses in a fresh, short transaction.

Why

Found while investigating the mainnet seasonpass DB after the #308 deploy. Live pg_stat_activity showed this exact query sitting idle-in-transaction for 448s:

SELECT claim.* FROM claim WHERE tx_status IN ('STAGED','INVALID') ORDER BY claim.id LIMIT 200

It's a plain SELECT (AccessShareLock, no FOR UPDATE), so it blocked no other query and caused no alarms — but the long-lived transaction pinned the xmin horizon and blocked autovacuum on the hot tables:

user_season_pass — last_autovacuum = None (never autovacuumed), ~292k dead tuples
claim — not vacuumed since the prior manual run, ~175k dead tuples

This is the same anti-pattern the API status paths fixed in #308, just living in the tracker (which #308 didn't touch). The tracker runs track_tx() on a 10s loop, so the xmin pin recurs continuously whenever a node is slow.

Correctness notes

The write now guards on tx_id as well as id: UPDATE claim SET tx_status=... WHERE id=:id AND tx_id=:tx_id. Decoupling the read from the write widens the read→write gap, so a claim that was re-staged with a new tx in between is now left untouched (picked up next cycle) instead of being clobbered with a stale status. process() returns the same tx_id it was given, so this is exactly the tx we resolved.
start_id/end_id/count logging and the STAGED/INVALID selection are unchanged.
The per-claim process() semantics, planet conversion (PlanetID(...)), and thread pool (max_workers=10) are unchanged.
Also drops a dead import os.

Testing

The tracker has no existing test suite and the models use postgres-only types (ARRAY/ENUM/JSONB), so an in-memory harness is disproportionate here. Verified via py_compile + black/isort/autoflake (pre-commit). A tracker test harness is a reasonable follow-up. Behavior is otherwise a straight restructuring of the existing read→GQL→write flow.

Independent of #306/#308 (touches apps/tracker only) — targets main directly.

🤖 Generated with Claude Code

track_tx() opened a session, SELECTed up to 200 unsettled claims, then held that transaction open while it fanned out a per-claim headless GQL batch (process()) and only committed at the very end. When a node was slow the session sat idle-in-transaction for minutes (observed 448s live via pg_stat_activity). It is a plain SELECT (AccessShareLock, no FOR UPDATE) so it blocked no other query, but the long-lived transaction pinned the xmin horizon and prevented autovacuum from reclaiming dead tuples on claim / user_season_pass (user_season_pass had never been autovacuumed; claim/user_season_pass dead tuples were piling into the hundreds of thousands). This is the same anti-pattern the API status paths fixed in #308, applied to the tracker: 1. Read the unsettled claims in a short transaction, capture (id, planet_id, tx_id) as plain scalars, and close the session before any RPC. 2. Run the headless GQL batch with no DB transaction held open. 3. Persist the resolved statuses in a fresh, short write transaction. Because the read->write gap is wider now (the GQL batch runs outside the transaction), the write is guarded: UPDATE ... WHERE id=:id AND tx_id=:tx_id AND tx_status IN (STAGED, INVALID). This skips a claim re-staged with a new tx in between, and never clobbers a row already finalized to SUCCESS/FAILURE by another path with a stale GQL result. process() returns the same tx_id it was given. Logging and the STAGED/INVALID selection are unchanged. Also drops a dead `import os`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ipdae force-pushed the yang/tx-tracker-gql-outside-txn branch from 76b4e55 to 1bef842 Compare June 29, 2026 10:01

ipdae changed the base branch from main to fix/gql-outside-db-txn June 29, 2026 10:01

ipdae force-pushed the yang/tx-tracker-gql-outside-txn branch from 1bef842 to 52e16aa Compare June 29, 2026 10:10

This was referenced Jun 29, 2026

fix(api): resolve level before mutating target in status backfill (drop early row lock) #310

Open

feat(api): make /invalid-claim return a stuck-claim count instead of 503-if-any #311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(tracker): run tx-tracker GQL outside the DB transaction#309

refactor(tracker): run tx-tracker GQL outside the DB transaction#309
ipdae wants to merge 1 commit into
fix/gql-outside-db-txnfrom
yang/tx-tracker-gql-outside-txn

ipdae commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ipdae commented Jun 29, 2026

What

Why

Correctness notes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant