⚡️ Speed up method RFDetrForObjectDetectionTorch.post_process
by 17% in PR #1250 (feature/inference-v1-models
)
#1276
+39
−19
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
⚡️ This pull request contains optimizations for PR #1250
If you approve this dependent PR, these changes will be merged into the original PR branch
feature/inference-v1-models
.📄 17% (0.17x) speedup for
RFDetrForObjectDetectionTorch.post_process
ininference/v1/models/rfdetr/rfdetr_object_detection_pytorch.py
⏱️ Runtime :
25.0 milliseconds
→21.3 milliseconds
(best of11
runs)📝 Explanation and details
Here’s a faster version, focusing on eliminating Python loops in favor of vectorized/tensor operations, batching, minimizing temporary allocations, and avoiding repeated work. Function signatures and return values are unchanged.
Key speedups:
torch.tensor(orig_sizes, ...)
directly on device, with explicit dtype, for a small gain.torch.where
andindex_select
(which are generally faster and more robust than boolean indexing when batches are empty or on CUDA)..append
perDetections
, and empty tensors created in a way that does not allocate new memory if there is nothing to keep.This preserves original function signatures and input/output, and will run faster especially on GPU (but also on CPU).
✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-pr1250-2025-05-14T12.42.25
and push.