Skip to content

Fix critical pod readiness check bug causing immediate task termination#225

Open
ThomasBlock wants to merge 2 commits into
swanchain:mainfrom
ThomasBlock:main
Open

Fix critical pod readiness check bug causing immediate task termination#225
ThomasBlock wants to merge 2 commits into
swanchain:mainfrom
ThomasBlock:main

Conversation

@ThomasBlock

Copy link
Copy Markdown

The pod readiness check logic was incorrectly using PodReady conditions instead of checking pod phase status. This caused pods to be immediately terminated (within 7-9 seconds) before containers could start.

Changes:

  • Replace incorrect condition-based readiness check with pod phase check
  • Use PodRunning phase status (correct for batch jobs without readiness probes)
  • Fixed in 3 locations: ubi_service.go (2x) and space_service.go (1x)

This bug was causing all GPU/ZK tasks to fail immediately as the wait loop would error out, triggering namespace deletion in the defer cleanup function.

🤖 Generated with Claude Code

The pod readiness check logic was incorrectly using PodReady conditions
instead of checking pod phase status. This caused pods to be immediately
terminated (within 7-9 seconds) before containers could start.

Changes:
- Replace incorrect condition-based readiness check with pod phase check
- Use PodRunning phase status (correct for batch jobs without readiness probes)
- Fixed in 3 locations: ubi_service.go (2x) and space_service.go (1x)

This bug was causing all GPU/ZK tasks to fail immediately as the wait loop
would error out, triggering namespace deletion in the defer cleanup function.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant