Skip to content

refactor(tracker): run tx-tracker GQL outside the DB transaction#309

Open
ipdae wants to merge 1 commit into
fix/gql-outside-db-txnfrom
yang/tx-tracker-gql-outside-txn
Open

refactor(tracker): run tx-tracker GQL outside the DB transaction#309
ipdae wants to merge 1 commit into
fix/gql-outside-db-txnfrom
yang/tx-tracker-gql-outside-txn

Conversation

@ipdae

@ipdae ipdae commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

What

track_tx() opened a session, SELECTed up to 200 unsettled claims (tx_status IN (STAGED, INVALID)), then held that transaction open while it fanned out a per-claim headless GQL batch (process()), committing only at the very end. This splits it into three phases so no transaction is held across the RPC:

  1. Read the unsettled claims in a short transaction, capture (id, planet_id, tx_id) as plain scalars, and close() the session before any RPC.
  2. GQL batch with no DB transaction held open.
  3. Write the resolved statuses in a fresh, short transaction.

Why

Found while investigating the mainnet seasonpass DB after the #308 deploy. Live pg_stat_activity showed this exact query sitting idle-in-transaction for 448s:

SELECT claim.* FROM claim WHERE tx_status IN ('STAGED','INVALID') ORDER BY claim.id LIMIT 200

It's a plain SELECT (AccessShareLock, no FOR UPDATE), so it blocked no other query and caused no alarms — but the long-lived transaction pinned the xmin horizon and blocked autovacuum on the hot tables:

  • user_season_passlast_autovacuum = None (never autovacuumed), ~292k dead tuples
  • claim — not vacuumed since the prior manual run, ~175k dead tuples

This is the same anti-pattern the API status paths fixed in #308, just living in the tracker (which #308 didn't touch). The tracker runs track_tx() on a 10s loop, so the xmin pin recurs continuously whenever a node is slow.

Correctness notes

  • The write now guards on tx_id as well as id: UPDATE claim SET tx_status=... WHERE id=:id AND tx_id=:tx_id. Decoupling the read from the write widens the read→write gap, so a claim that was re-staged with a new tx in between is now left untouched (picked up next cycle) instead of being clobbered with a stale status. process() returns the same tx_id it was given, so this is exactly the tx we resolved.
  • start_id/end_id/count logging and the STAGED/INVALID selection are unchanged.
  • The per-claim process() semantics, planet conversion (PlanetID(...)), and thread pool (max_workers=10) are unchanged.
  • Also drops a dead import os.

Testing

The tracker has no existing test suite and the models use postgres-only types (ARRAY/ENUM/JSONB), so an in-memory harness is disproportionate here. Verified via py_compile + black/isort/autoflake (pre-commit). A tracker test harness is a reasonable follow-up. Behavior is otherwise a straight restructuring of the existing read→GQL→write flow.

Independent of #306/#308 (touches apps/tracker only) — targets main directly.

🤖 Generated with Claude Code

@ipdae ipdae force-pushed the yang/tx-tracker-gql-outside-txn branch from 76b4e55 to 1bef842 Compare June 29, 2026 10:01
@ipdae ipdae changed the base branch from main to fix/gql-outside-db-txn June 29, 2026 10:01
track_tx() opened a session, SELECTed up to 200 unsettled claims, then held
that transaction open while it fanned out a per-claim headless GQL batch
(process()) and only committed at the very end. When a node was slow the
session sat idle-in-transaction for minutes (observed 448s live via
pg_stat_activity). It is a plain SELECT (AccessShareLock, no FOR UPDATE) so it
blocked no other query, but the long-lived transaction pinned the xmin horizon
and prevented autovacuum from reclaiming dead tuples on claim / user_season_pass
(user_season_pass had never been autovacuumed; claim/user_season_pass dead
tuples were piling into the hundreds of thousands).

This is the same anti-pattern the API status paths fixed in #308, applied to
the tracker:
  1. Read the unsettled claims in a short transaction, capture (id, planet_id,
     tx_id) as plain scalars, and close the session before any RPC.
  2. Run the headless GQL batch with no DB transaction held open.
  3. Persist the resolved statuses in a fresh, short write transaction.

Because the read->write gap is wider now (the GQL batch runs outside the
transaction), the write is guarded: UPDATE ... WHERE id=:id AND tx_id=:tx_id AND
tx_status IN (STAGED, INVALID). This skips a claim re-staged with a new tx in
between, and never clobbers a row already finalized to SUCCESS/FAILURE by another
path with a stale GQL result. process() returns the same tx_id it was given.

Logging and the STAGED/INVALID selection are unchanged. Also drops a dead
`import os`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant