[mlir-tensorrt] Integrate internal changes #651

christopherbate · 2025-06-06T21:57:15Z

--
b7c0ac3c95aac78b563cfcf1926fcfb0de20ef67 by Chris Bate [email protected]:

[compiler] Fix bufferization for trtrt.enqueue

This fixes a bug where trtrt.enqueue could incorrectly allow the same buffer to be used as both an input and an output. This is technically possible since for many TensorRT programs, the inputs will be read before any outputs are written. However, in general we cannot actually make this assumption, which is codified in the bufferization infrastructure via the bufferizesToElementwiseAccess method of the BufferizableOpInterface. Previously, we were returning true for this method for trtrt.enqueue despite not analyzing the corresponding function.

Since the ability to re-use an input pointer avoids an allocation, it can be critical to good performance in certain cases, especially in loops. Therefore, in the future, we need to add back the ability to specify true when some basic analysis reduces the risk of that being incorrect.

--
6fab2aa80a7776fe9afd15a59fcce916b61570cd by Chris Bate [email protected]:

Refactor registration of dialects and passes

Improve uniformity of registration for dialects/passes/extensions.
Remove some CMake flags which we never set to OFF and probably never should.
Improve format of preprocessor-gaurded code blocks by generating a project-wide header Features.h header containing some convenience macros that look a lot nicer than #ifdef..#endif blocks.

--
96f45d6c6f3f2f1354984944f194fdc8a857ff34 by Sagar Shelke [email protected]:

[tensorrt] Fix a bug in trtSetWeights function in adaptor.

This MR fixes a bug in trtSetWeights method in NvInferAdaptor. This method stores weights in weight map in order to keep them alive until engine is built. Previously, a new weight in map was created but original weight was never copied. This created issues like getting zeros in the output to all the way getting NaNs. Change copies original weights into map.

--
08fc170f769d9cedce2541da14b101e0947175e2 by Chris Bate [email protected]:

[compiler] Add explicit memory space assignment

Previously, the 'plan-module-bufferize' pass used the encodings of tensor types to infer the memory space of memref types. However, not every tensor had an encoding. There was a notion of a default memory space (#plan.memory_space<device>) which was used to deduce the memory space when no tensor encoding was present. However, this could result in situations where the program could not be bufferized correctly, for example if we had tensor.cast operations that added or removed the encoding. In such cases, we are relying on brittle logic of the bufferization infrastructure to somehow deduce the correct memory space, and this may not always work.

Therefore, this change adds a new pass plan-assign-memory-spaces which explicitly assigns a #plan.memory_space encoding to all tensor types. In proceeds in two steps. The first step simply assigns the 'device' space to all tensors. Then the second step performs some minor optimizations to avoid unnecessary host-device transfers after bufferization. An end-to-end test case for the bufferization pipeline is added which was previously failing. Interestingly, the addition of the new pass results in better bufferization for cases where we can't detensorize while loops.

--
1be85552b63d7a06d9e27b7d9bd57e2e62589356 by Christopher Bate [email protected]:

[compiler] Add additional IR printing flags to the compiler API

Exposes -mlir-elide-elementsattrs-if-larger and -mlir-elide-resource-strings-if-larger to the compiler API via DebugOptions provider.

--
cfc7a93555f2778711610468b2f56e9f2fb28148 by Chris Bate [email protected]:

[tensorrt] Make the TensorRT builder logger global

This change makes the TensorRT builder logger a global singleton. This change has been made due to an issue with the TRT API -- the logger does not actually get associated with the TRT object created via createInferBuilder or createInferRuntime -- instead the TRT API maintains a global logger that is populated with push/op mechanism, except the stack size can only be 1. So if the Builder object lifetime overlaps the Runtime object lifetime, then the Runtime will ignore the logger passed to createInferRuntime and instead use the global logger populated by createInferBuilder. This is clearly problematic if the API user (e.g. MLIR-TRT compiler) thinks they can destroy the logger used to create the Builder object after the Builder object is destroyed. In such a case, the Runtime will still try to use the builder's logger, which results in use-after-free.

To reduce the likelihood of this ocurring, we make the builder logger a global singleton. Note, however, that we can't guarantee that the logger is not destroyed when the compiler library is closed via dlclose, and we can't make any guaruntees about global shutdown order.

--
4f215aa9513d7751572cd83d8575f251c5a61edb by Christopher Bate [email protected]:

NFC: Fix two compiler warnings

'captured structured bindings are a C++20 extension'
Unused function when MLIR_TRT_ENABLE_NCCL=ON

--
6f4cc4ad0eb6a4c554e5e36fc23a4b3b56ba607e by Christopher Bate [email protected]:

NFC: update test runner configs to take into account system memory

This change updates LIT parallelism setting logic to parallelism based on system memory.

GitOrigin-RevId: 6f4cc4ad0eb6a4c554e5e36fc23a4b3b56ba607e

-- b7c0ac3c95aac78b563cfcf1926fcfb0de20ef67 by Chris Bate <[email protected]>: [compiler] Fix bufferization for `trtrt.enqueue` This fixes a bug where `trtrt.enqueue` could incorrectly allow the same buffer to be used as both an input and an output. This is technically possible since for many TensorRT programs, the inputs will be read before any outputs are written. However, in general we cannot actually make this assumption, which is codified in the bufferization infrastructure via the `bufferizesToElementwiseAccess` method of the BufferizableOpInterface. Previously, we were returning true for this method for `trtrt.enqueue` despite not analyzing the corresponding function. Since the ability to re-use an input pointer avoids an allocation, it can be critical to good performance in certain cases, especially in loops. Therefore, in the future, we need to add back the ability to specify `true` when some basic analysis reduces the risk of that being incorrect. -- 6fab2aa80a7776fe9afd15a59fcce916b61570cd by Chris Bate <[email protected]>: Refactor registration of dialects and passes - Improve uniformity of registration for dialects/passes/extensions. - Remove some CMake flags which we never set to OFF and probably never should. - Improve format of preprocessor-gaurded code blocks by generating a project-wide header `Features.h` header containing some convenience macros that look a lot nicer than `#ifdef..#endif` blocks. -- 96f45d6c6f3f2f1354984944f194fdc8a857ff34 by Sagar Shelke <[email protected]>: [tensorrt] Fix a bug in `trtSetWeights` function in adaptor. This MR fixes a bug in `trtSetWeights` method in NvInferAdaptor. This method stores weights in weight map in order to keep them alive until engine is built. Previously, a new weight in map was created but original weight was never copied. This created issues like getting zeros in the output to all the way getting NaNs. Change copies original weights into map. -- 08fc170f769d9cedce2541da14b101e0947175e2 by Chris Bate <[email protected]>: [compiler] Add explicit memory space assignment Previously, the 'plan-module-bufferize' pass used the encodings of tensor types to infer the memory space of memref types. However, not every tensor had an encoding. There was a notion of a default memory space (`#plan.memory_space<device>`) which was used to deduce the memory space when no tensor encoding was present. However, this could result in situations where the program could not be bufferized correctly, for example if we had `tensor.cast` operations that added or removed the encoding. In such cases, we are relying on brittle logic of the bufferization infrastructure to somehow deduce the correct memory space, and this may not always work. Therefore, this change adds a new pass `plan-assign-memory-spaces` which explicitly assigns a `#plan.memory_space` encoding to all tensor types. In proceeds in two steps. The first step simply assigns the 'device' space to all tensors. Then the second step performs some minor optimizations to avoid unnecessary host-device transfers after bufferization. An end-to-end test case for the bufferization pipeline is added which was previously failing. Interestingly, the addition of the new pass results in better bufferization for cases where we can't detensorize while loops. -- 1be85552b63d7a06d9e27b7d9bd57e2e62589356 by Christopher Bate <[email protected]>: [compiler] Add additional IR printing flags to the compiler API Exposes `-mlir-elide-elementsattrs-if-larger` and `-mlir-elide-resource-strings-if-larger` to the compiler API via DebugOptions provider. -- cfc7a93555f2778711610468b2f56e9f2fb28148 by Chris Bate <[email protected]>: [tensorrt] Make the TensorRT builder logger global This change makes the TensorRT builder logger a global singleton. This change has been made due to an issue with the TRT API -- the logger does not actually get associated with the TRT object created via `createInferBuilder` or `createInferRuntime` -- instead the TRT API maintains a global logger that is populated with push/op mechanism, except the stack size can only be 1. So if the Builder object lifetime overlaps the Runtime object lifetime, then the Runtime will ignore the logger passed to `createInferRuntime` and instead use the global logger populated by `createInferBuilder`. This is clearly problematic if the API user (e.g. MLIR-TRT compiler) thinks they can destroy the logger used to create the Builder object after the Builder object is destroyed. In such a case, the Runtime will still try to use the builder's logger, which results in use-after-free. To reduce the likelihood of this ocurring, we make the builder logger a global singleton. Note, however, that we can't guarantee that the logger is not destroyed when the compiler library is closed via `dlclose`, and we can't make any guaruntees about global shutdown order. -- 4f215aa9513d7751572cd83d8575f251c5a61edb by Christopher Bate <[email protected]>: NFC: Fix two compiler warnings - 'captured structured bindings are a C++20 extension' - Unused function when MLIR_TRT_ENABLE_NCCL=ON -- 6f4cc4ad0eb6a4c554e5e36fc23a4b3b56ba607e by Christopher Bate <[email protected]>: NFC: update test runner configs to take into account system memory This change updates LIT parallelism setting logic to parallelism based on system memory. GitOrigin-RevId: 19b40f42d66916f0322fe8e078cbc072890cb56e

christopherbate self-assigned this Jun 6, 2025

christopherbate requested a review from shelkesagar29 as a code owner June 6, 2025 21:57

christopherbate force-pushed the move_internal_changes-dev branch 2 times, most recently from 1036c76 to 5495e99 Compare June 7, 2025 01:56

christopherbate force-pushed the move_internal_changes-dev branch from 5495e99 to 0917d53 Compare June 7, 2025 03:27

pranavm-nvidia approved these changes Jun 9, 2025

View reviewed changes

christopherbate merged commit 30f9816 into main Jun 9, 2025
1 check passed

christopherbate deleted the move_internal_changes-dev branch June 9, 2025 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mlir-tensorrt] Integrate internal changes #651

[mlir-tensorrt] Integrate internal changes #651

Uh oh!

christopherbate commented Jun 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[mlir-tensorrt] Integrate internal changes #651

[mlir-tensorrt] Integrate internal changes #651

Uh oh!

Conversation

christopherbate commented Jun 6, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants