[BugFix][Cherry-Pick] Fix race condition in async RL control request(#7430)#7470
[BugFix][Cherry-Pick] Fix race condition in async RL control request(#7430)#7470jackyYang6 wants to merge 1 commit intoPaddlePaddle:release/2.6from
Conversation
|
Thanks for your contribution! |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-17 19:25 CST
📋 Review 摘要
PR 概述:修复异步 RL 控制请求(pause/update_weights/resume)中的竞态条件,将响应通道注册移到请求发送之前
变更范围:entrypoints/engine_client.py — run_control_method 方法
影响面 Tag:APIServer Engine
问题
未发现阻塞性问题。
总体评价
修复方案清晰正确。原有代码中 send_json/send_pyobj 先于 get_connection 执行,导致引擎端可能在 API Server 注册 request_id 到 request_map 之前就完成控制请求并发回响应,响应进入 zmq_server 的 cached_results 缓存路径后无法被 response_queue.get() 接收,最终 600s 超时。本 PR 将 get_connection(注册响应通道)和 dealer.write 提前到请求发送之前,确保响应通道就绪后再发送控制请求,从根本上消除了竞态窗口。两条路径(ZMQ_SEND_BATCH_DATA 为 True/False)逻辑均正确,且不影响正常推理请求流程。
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7470 +/- ##
==============================================
Coverage ? 73.25%
==============================================
Files ? 376
Lines ? 52988
Branches ? 8276
==============================================
Hits ? 38815
Misses ? 11447
Partials ? 2726
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This PR cherry-picks #7430 to release/2.6.
💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)
💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)
Motivation
In the async RL weight update flow (
/v1/pause->/v1/update_weights->/v1/resume),/v1/resumecould occasionally time out after 600s.Example log:
At the same time, engine-side logs showed:
This indicates a race in the control path: the engine may finish a fast control request before the response channel registration is ready. In that case, the response first goes into the engine-side cache path, which can lead to the API server waiting on
response_queue.get()until timeout.Modifications
Updated
fastdeploy/entrypoints/engine_client.pyinrun_control_methodto register the control response channel before sending the control request.Before this change, the control request could be sent before the corresponding response path was ready. After this change, response registration happens first, and the control request is sent afterwards, reducing the race window for fast control methods such as
resume.This change only affects the control request path and does not change the normal inference request flow.
Usage or Command
Repro flow:
/v1/pause/v1/update_weights/v1/resumeBefore the fix,
/v1/resumecould occasionally hit the timeout race.After the fix, the control response can be matched and returned normally.
Accuracy Tests
No model output change. Accuracy testing is not required.
Unit Tests
No dedicated unit test was added in this PR.
This is a timing-sensitive control-path race fix and was verified through the async RL weight update flow.
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.