Skip to content

[BugFix][Cherry-Pick] Fix race condition in async RL control request(#7430)#7470

Open
jackyYang6 wants to merge 1 commit intoPaddlePaddle:release/2.6from
jackyYang6:fix/async-rl-2.6
Open

[BugFix][Cherry-Pick] Fix race condition in async RL control request(#7430)#7470
jackyYang6 wants to merge 1 commit intoPaddlePaddle:release/2.6from
jackyYang6:fix/async-rl-2.6

Conversation

@jackyYang6
Copy link
Copy Markdown
Contributor

This PR cherry-picks #7430 to release/2.6.

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Motivation

In the async RL weight update flow (/v1/pause -> /v1/update_weights -> /v1/resume), /v1/resume could occasionally time out after 600s.

Example log:

ERROR engine_client.py[line:571] Control request timed out: ControlResponse(..., error_message=Timeout waiting for control method response)

At the same time, engine-side logs showed:

WARNING zmq_server.py[line:270] req_id 'control-...' not in req_dict, caching response

This indicates a race in the control path: the engine may finish a fast control request before the response channel registration is ready. In that case, the response first goes into the engine-side cache path, which can lead to the API server waiting on response_queue.get() until timeout.

Modifications

Updated fastdeploy/entrypoints/engine_client.py in run_control_method to register the control response channel before sending the control request.

Before this change, the control request could be sent before the corresponding response path was ready. After this change, response registration happens first, and the control request is sent afterwards, reducing the race window for fast control methods such as resume.

This change only affects the control request path and does not change the normal inference request flow.

Usage or Command

Repro flow:

  1. /v1/pause
  2. /v1/update_weights
  3. /v1/resume

Before the fix, /v1/resume could occasionally hit the timeout race.
After the fix, the control response can be matched and returned normally.

Accuracy Tests

No model output change. Accuracy testing is not required.

Unit Tests

No dedicated unit test was added in this PR.
This is a timing-sensitive control-path race fix and was verified through the async RL weight update flow.

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 17, 2026

Thanks for your contribution!

Copy link
Copy Markdown

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-17 19:25 CST

📋 Review 摘要

PR 概述:修复异步 RL 控制请求(pause/update_weights/resume)中的竞态条件,将响应通道注册移到请求发送之前
变更范围entrypoints/engine_client.pyrun_control_method 方法
影响面 TagAPIServer Engine

问题

未发现阻塞性问题。

总体评价

修复方案清晰正确。原有代码中 send_json/send_pyobj 先于 get_connection 执行,导致引擎端可能在 API Server 注册 request_idrequest_map 之前就完成控制请求并发回响应,响应进入 zmq_servercached_results 缓存路径后无法被 response_queue.get() 接收,最终 600s 超时。本 PR 将 get_connection(注册响应通道)和 dealer.write 提前到请求发送之前,确保响应通道就绪后再发送控制请求,从根本上消除了竞态窗口。两条路径(ZMQ_SEND_BATCH_DATA 为 True/False)逻辑均正确,且不影响正常推理请求流程。

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release/2.6@185708b). Learn more about missing BASE report.

Additional details and impacted files
@@              Coverage Diff               @@
##             release/2.6    #7470   +/-   ##
==============================================
  Coverage               ?   73.25%           
==============================================
  Files                  ?      376           
  Lines                  ?    52988           
  Branches               ?     8276           
==============================================
  Hits                   ?    38815           
  Misses                 ?    11447           
  Partials               ?     2726           
Flag Coverage Δ
GPU 73.25% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants