-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[https://nvbugs/5508301][feat] Move D->H copies to a worker thread whe… #8463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis pull request introduces optional asynchronous host-copy threading for GPU sampling. When confidential compute is detected, a Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Creator as py_executor_creator
participant Util as _util
participant Executor as py_executor
participant Sampler as sampler
User->>Creator: create_py_executor_instance(...)
Creator->>Util: create_py_executor_instance(...)
Util->>Util: confidential_compute_enabled()
Util->>Util: instantiate_sampler(..., use_host_copy_thread)
Util->>Sampler: TorchSampler(..., use_host_copy_thread=True)
Sampler-->>Util: sampler instance
Util-->>Creator: py_executor with sampler
Creator->>Creator: maybe_start_sampler_host_copy_thread(sampler)
Creator->>Sampler: start_host_copy_thread()
Sampler-->>Sampler: spawn async host-copy thread
Creator->>Executor: start_worker()
Note over Creator,Sampler: During sampling phase
Executor->>Sampler: sample_async()
Sampler->>Sampler: launch async host transfer
Sampler-->>Executor: Future[SampleStateTensors]
Executor->>Executor: await or process Future result
Note over Creator,Sampler: During shutdown
Executor->>Executor: shutdown()
Executor->>Sampler: stop_host_copy_thread()
Sampler-->>Sampler: join host-copy thread
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes The changes span five files with interconnected logic across executor creation, sampler state management, and threading lifecycle. Key review areas include understanding: the confidential compute detection flow, how the Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
1918-1929: Same property pattern issues as SampleState.This repeats the shadow-field property pattern from
SampleState(lines 59-72) with the same concerns:
- Unconventional for dataclasses
- Blocking
.result()call without error handling- Static analysis warnings
Apply the same refactoring suggestions from the earlier comment.
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
59-72: Consider refactoring the property pattern and add error handling.The shadow-field pattern (line 60
_hostwithinit=False+ property wrapper) is unconventional for dataclasses and triggers static analysis warnings. More importantly:
- The property getter (line 67) calls
.result()which blocks until the Future completes, potentially causing unexpected stalls in code that accessesstate.host.- No error handling for Future resolution failures.
Consider these alternatives:
- Option 1: Keep host as
Future | SampleStateTensorsand require callers to explicitly resolve futures (more explicit about blocking).- Option 2: Add a separate
resolve()method that returns the resolved host, keeping the property for non-blocking access.- Option 3: If the property pattern is preferred, at minimum wrap
.result()in try-except.Example for Option 3:
@property def host(self) -> SampleStateTensors: if isinstance(self._host, Future): - return self._host.result() + try: + return self._host.result() + except Exception as e: + raise RuntimeError(f"Failed to resolve host tensor future: {e}") from e return self._host
1551-1560: Note cross-method Future dependency.Futures are stored in
request.py_topk_logprobs_valsandrequest.py_topk_logprobs_indices, which are later resolved inhandle_logprobs()(lines 1027-1035). This creates an implicit dependency wherehandle_logprobs()must be called after the Futures complete, but there's no explicit enforcement of this ordering.Consider adding documentation or assertions to make this dependency explicit.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
tensorrt_llm/_torch/pyexecutor/_util.py(5 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py(2 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py(17 hunks)tensorrt_llm/_utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_utils.pytensorrt_llm/_torch/pyexecutor/sampler.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_utils.pytensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/_torch/pyexecutor/_util.pytensorrt_llm/_torch/pyexecutor/py_executor.pytensorrt_llm/_utils.pytensorrt_llm/_torch/pyexecutor/sampler.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
start_host_copy_thread(905-920)start_host_copy_thread(2002-2017)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
tensorrt_llm/_utils.py (1)
confidential_compute_enabled(1201-1209)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
stop_host_copy_thread(922-925)stop_host_copy_thread(2019-2022)
tensorrt_llm/_torch/pyexecutor/sampler.py (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
shutdown(461-477)tensorrt_llm/_utils.py (1)
nvtx_range(904-923)
🪛 Ruff (0.14.0)
tensorrt_llm/_torch/pyexecutor/_util.py
13-13: Redefinition of unused str_dtype_to_binding from line 12
Remove definition: str_dtype_to_binding
(F811)
13-13: Redefinition of unused torch_dtype_to_str from line 12
Remove definition: torch_dtype_to_str
(F811)
tensorrt_llm/_torch/pyexecutor/sampler.py
65-65: Redefinition of unused host from line 59
(F811)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
475-477: LGTM! Clean shutdown lifecycle for host-copy thread.The defensive
hasattrcheck ensures compatibility with samplers that don't implement host-copy threading, and the cleanup aligns with the lifecycle introduced in sampler.py.tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
184-186: LGTM! Appropriate lifecycle hook for host-copy thread startup.The helper function correctly guards against samplers that don't support host-copy threading, and the placement (lines 693-695) ensures the thread starts before the worker.
tensorrt_llm/_torch/pyexecutor/_util.py (1)
854-854: Verify error handling in confidential_compute_enabled.This line calls
confidential_compute_enabled()which queries NVML hardware state. Ensure the function in tensorrt_llm/_utils.py (lines 1201-1209) includes proper error handling to prevent exceptions from propagating here.The review comment on _utils.py recommends adding try/except to
confidential_compute_enabled(). Once that change is applied, this call site will be safe. If that recommendation is not implemented, consider adding error handling here:try: use_host_copy_thread = confidential_compute_enabled() except Exception: logger.warning("Failed to query confidential compute state, disabling host copy thread") use_host_copy_thread = Falsetensorrt_llm/_torch/pyexecutor/sampler.py (5)
10-10: LGTM: Import added for threading support.The
ThreadPoolExecutorandFutureimports are appropriate for the new threaded host-copy feature.
879-879: LGTM: New parameter properly typed and defaulted.The
use_host_copy_threadparameter addition is well-typed and has a safe default.
1027-1035: Consistent Future handling, but note blocking behavior.The code correctly checks
host_copy_thread_active()and resolves Futures with.result(). However, this blocks the calling thread until the host copy completes, which may impact performance if called in hot paths.Consider whether this blocking is acceptable or if callers should be refactored to handle Futures asynchronously.
1368-1378: LGTM: Consistent conditional event handling.The code correctly uses Futures to replace
sampler_eventwhen the host copy thread is active, avoiding redundant synchronization.
2183-2233: Well-structured threaded copy implementation.The
_copy_tensors_to_hostfunction is well-organized and correctly handles:
- Event synchronization with
copy_ready.synchronize()(line 2191)- Proper
non_blockingflag toggling based on threading mode (lines 2192-2194)- Conditional logprobs handling (lines 2206-2209)
- Consistent Future vs. direct return based on thread mode (lines 2219-2233)
The approach cleanly encapsulates the copy logic and makes the threading behavior explicit.
9af940a to
447e521
Compare
447e521 to
78d62fe
Compare
ixlmar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No functional concerns, but some suggestions to reduce complexity introduced into other parts of sampler.py.
d43f4a7 to
16ab0c6
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #24040 [ run ] triggered by Bot. Commit: |
|
PR_Github #24040 [ run ] completed with state |
fed5100 to
abef7de
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #24055 [ run ] triggered by Bot. Commit: |
|
PR_Github #24055 [ run ] completed with state |
abef7de to
0b24a61
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #24323 [ run ] triggered by Bot. Commit: |
|
PR_Github #24323 [ run ] completed with state |
0b24a61 to
86f69b1
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #24337 [ run ] triggered by Bot. Commit: |
Signed-off-by: Dan Hansen <[email protected]>
…nto the class itself to improve composability Signed-off-by: Dan Hansen <[email protected]>
Signed-off-by: Dan Hansen <[email protected]>
…re logically belongs Signed-off-by: Dan Hansen <[email protected]>
Signed-off-by: Dan Hansen <[email protected]>
Signed-off-by: Dan Hansen <[email protected]>
83894ea to
3818121
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #24826 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #24944 [ run ] triggered by Bot. Commit: |
|
PR_Github #24826 [ run ] completed with state |
|
PR_Github #24944 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25065 [ run ] triggered by Bot. Commit: |
|
PR_Github #25065 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25141 [ run ] triggered by Bot. Commit: |
|
PR_Github #25141 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25222 [ run ] triggered by Bot. Commit: |
|
PR_Github #25222 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25284 [ run ] triggered by Bot. Commit: |
|
PR_Github #25284 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25366 [ run ] triggered by Bot. Commit: |
|
PR_Github #25366 [ run ] completed with state |
…n confidential compute is active
Summary by CodeRabbit
New Features
Bug Fixes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.