Skip to content

[connector] Fix tp_coordinator race condition when finish event arrives before start event#208

Open
lpdink wants to merge 1 commit into
alibaba:mainfrom
lpdink:feature_tp_coordinator_race_fix
Open

[connector] Fix tp_coordinator race condition when finish event arrives before start event#208
lpdink wants to merge 1 commit into
alibaba:mainfrom
lpdink:feature_tp_coordinator_race_fix

Conversation

@lpdink

@lpdink lpdink commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fix a race condition in TpCoordinatorServer where SendBlockFinishedEvent from TP Workers could arrive before SendBlockStartEvent from the Scheduler, causing an assert crash. The fix buffers out-of-order finish events and merges them when the start event arrives, aligning the save flow with the load flow's existing tolerant pattern.

Motivation

The vllm Connector's save flow uses ZMQ for async communication between the Scheduler and TP Workers:

  1. Scheduler calls start_write_cache (HTTP), then sends SendBlockStartEvent via ZMQ to register the save context
  2. TP Workers perform data transfer, then send SendBlockFinishedEvent via ZMQ to report results

These two messages come from different ZMQ PUSH sockets connecting to the same PULL socket. ZMQ does not guarantee cross-sender message ordering. When data transfer is very fast (small data, local storage), the finish event can arrive before the start event, triggering:

assert save_id in running_save  # AssertionError → process crash

This is not backend-specific — it can happen with any storage backend when transfer latency is low enough.

How It Works

Before (fragile):

Worker finish ──ZMQ──→ coordinator: assert save_id in running_save → CRASH
Scheduler start ──ZMQ──→ coordinator: (never reached)

After (tolerant):

Worker finish ──ZMQ──→ coordinator: save_id not found → buffer to pending_save_finishes
Scheduler start ──ZMQ──→ coordinator: create SaveContext → merge pending finishes → callback if all ranks done

This mirrors the existing load flow, which already uses a tolerant if load_id not in running_load: create pattern.

Changes

Category Files Description
Core fix common/tp_coordinator.py Replace assert with pending buffer; merge on start event arrival
Tests test/test_tp_coordinator.py 4 test cases: normal order, out-of-order, mixed order, duplicate finish

2 files, +296 lines

Known Limitations

  1. Pending buffer has no timeout cleanup: If SendBlockStartEvent never arrives (upstream bug), buffered finish events remain in memory until the coordinator thread exits. This is acceptable because a missing start event indicates a more serious issue that should be investigated.

  2. Only affects vllm connector: The TpCoordinatorServer is specific to the vllm connector's TP coordination. sglang and trtllm connectors use different mechanisms and are not affected.

  3. Normal-order path unchanged: When messages arrive in the expected order (start before finish), behavior is identical to the previous implementation — no performance impact.

Testing

Unit tests (4 cases, all pass):

Test Scenario Result
test_normal_order_callback_triggered Start → Finish(rank 0) → Finish(rank 1) ✅ Callback with 2 ranks
test_finish_before_start_no_crash Finish(rank 0) → Finish(rank 1) → Start ✅ Callback with 2 ranks
test_mixed_order_multiple_ranks Finish(rank 0) → Start → Finish(rank 1) ✅ Callback with 2 ranks
test_duplicate_finish_ignored Start → Finish(rank 0) → Finish(rank 0) → Finish(rank 1) ✅ Callback with 2 ranks (idempotent)

Full test suite: All 83 bazel tests pass, no regressions.

…ore start

- Buffer SendBlockFinishedEvent in pending_save_finishes when SendBlockStartEvent
  has not arrived yet, instead of asserting and crashing
- Merge buffered finish events when SendBlockStartEvent arrives
- Aligns save flow with load flow's existing tolerant pattern
- Add unit tests covering normal order, out-of-order, mixed order, and
  duplicate finish scenarios

@qoderai qoderai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The core fix is sound — replacing the assert with a pending buffer correctly handles the ZMQ cross-sender ordering race, and the approach aligns well with the existing load-flow tolerant pattern. SaveContext.add_new_rank already deduplicates by rank, so duplicate finishes are handled safely in both the buffered and unbuffered paths.

Suggestions (non-blocking):

  1. Duplicated completion logic: The callback/completion block now appears in both the start-event and finish-event handlers. Extracting a helper would reduce the risk of them diverging over time.

  2. Test cleanup: tearDown doesn't join the coordinator thread or close ZMQ contexts/sockets, which can leak resources between tests.

  3. Fragile time.sleep(0.1): The server-readiness wait could fail under load; a polling approach would be more robust.

  4. Missing test coverage: No test exercises duplicate finish events buffered before the start event (both hits the pending_save_finishes path). add_new_rank deduplicates correctly, but a test confirming this would be valuable.


🤖 Generated by Qoder

Comment on lines +156 to +157
if running_save[save_id].get_size() == self._tp_world_size:
save_context = running_save.pop(save_id)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The completion-check and callback logic is now duplicated in two places: the start-event handler (lines 156-161) and the finish-event handler (lines 175-184). If someone updates one path but not the other, it could introduce subtle bugs. Consider extracting a small helper like _complete_save(save_id, content, running_save) to consolidate this.


🤖 Generated by Qoder

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wont fix.

self.assertEqual(session_id, "session-1")
self.assertEqual(save_context.get_size(), 2)

def tearDown(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tearDown sets _coordinator_running = False but doesn't join the coordinator thread or close the ZMQ client/server sockets. This can leave dangling threads and file descriptors between tests. Consider calling self.server._coordinator_thread.join(timeout=2) and closing self.client._zmq_context / the server's ZMQ context in tearDown.


🤖 Generated by Qoder

self.server = TpCoordinatorServer("127.0.0.1", self.port, 2, on_finished)
self.client = TpCoordinatorClient("127.0.0.1", self.port)
# Give coordinator thread time to bind
time.sleep(0.1)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time.sleep(0.1) is a fragile way to wait for server readiness — on a slow or loaded CI runner, 100ms may not be enough. A more robust approach is to poll with a short client connection attempt, or at minimum use a retry loop with a timeout.


🤖 Generated by Qoder

self.client = TpCoordinatorClient("127.0.0.1", self.port)
time.sleep(0.1)

def test_duplicate_finish_ignored(self):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only covers duplicate finish when start arrives first (the else branch in the finish handler). It doesn't test the case where duplicate finishes from the same rank are both buffered in pending_save_finishes before the start arrives. Since add_new_rank deduplicates by rank, this should work correctly, but it'd be good to have a test confirming that path — e.g., send two finish(rank=0) before start, then verify get_size() == 1 after start merges them.


🤖 Generated by Qoder

@github-actions github-actions Bot added the ai reviewed AI has reviewed this PR label Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai reviewed AI has reviewed this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant