Skip to content

Conversation

@christopherbate
Copy link
Collaborator

--
b7c0ac3c95aac78b563cfcf1926fcfb0de20ef67 by Chris Bate [email protected]:

[compiler] Fix bufferization for trtrt.enqueue

This fixes a bug where trtrt.enqueue could incorrectly allow the same buffer to be used as both an input and an output. This is technically possible since for many TensorRT programs, the inputs will be read before any outputs are written. However, in general we cannot actually make this assumption, which is codified in the bufferization infrastructure via the bufferizesToElementwiseAccess method of the BufferizableOpInterface. Previously, we were returning true for this method for trtrt.enqueue despite not analyzing the corresponding function.

Since the ability to re-use an input pointer avoids an allocation, it can be critical to good performance in certain cases, especially in loops. Therefore, in the future, we need to add back the ability to specify true when some basic analysis reduces the risk of that being incorrect.

--
6fab2aa80a7776fe9afd15a59fcce916b61570cd by Chris Bate [email protected]:

Refactor registration of dialects and passes

  • Improve uniformity of registration for dialects/passes/extensions.
  • Remove some CMake flags which we never set to OFF and probably never should.
  • Improve format of preprocessor-gaurded code blocks by generating a project-wide header Features.h header containing some convenience macros that look a lot nicer than #ifdef..#endif blocks.

--
96f45d6c6f3f2f1354984944f194fdc8a857ff34 by Sagar Shelke [email protected]:

[tensorrt] Fix a bug in trtSetWeights function in adaptor.

This MR fixes a bug in trtSetWeights method in NvInferAdaptor. This method stores weights in weight map in order to keep them alive until engine is built. Previously, a new weight in map was created but original weight was never copied. This created issues like getting zeros in the output to all the way getting NaNs. Change copies original weights into map.

--
08fc170f769d9cedce2541da14b101e0947175e2 by Chris Bate [email protected]:

[compiler] Add explicit memory space assignment

Previously, the 'plan-module-bufferize' pass used the encodings of tensor types to infer the memory space of memref types. However, not every tensor had an encoding. There was a notion of a default memory space (#plan.memory_space<device>) which was used to deduce the memory space when no tensor encoding was present. However, this could result in situations where the program could not be bufferized correctly, for example if we had tensor.cast operations that added or removed the encoding. In such cases, we are relying on brittle logic of the bufferization infrastructure to somehow deduce the correct memory space, and this may not always work.

Therefore, this change adds a new pass plan-assign-memory-spaces which explicitly assigns a #plan.memory_space encoding to all tensor types. In proceeds in two steps. The first step simply assigns the 'device' space to all tensors. Then the second step performs some minor optimizations to avoid unnecessary host-device transfers after bufferization. An end-to-end test case for the bufferization pipeline is added which was previously failing. Interestingly, the addition of the new pass results in better bufferization for cases where we can't detensorize while loops.

--
1be85552b63d7a06d9e27b7d9bd57e2e62589356 by Christopher Bate [email protected]:

[compiler] Add additional IR printing flags to the compiler API

Exposes -mlir-elide-elementsattrs-if-larger and -mlir-elide-resource-strings-if-larger to the compiler API via DebugOptions provider.

--
cfc7a93555f2778711610468b2f56e9f2fb28148 by Chris Bate [email protected]:

[tensorrt] Make the TensorRT builder logger global

This change makes the TensorRT builder logger a global singleton. This change has been made due to an issue with the TRT API -- the logger does not actually get associated with the TRT object created via createInferBuilder or createInferRuntime -- instead the TRT API maintains a global logger that is populated with push/op mechanism, except the stack size can only be 1. So if the Builder object lifetime overlaps the Runtime object lifetime, then the Runtime will ignore the logger passed to createInferRuntime and instead use the global logger populated by createInferBuilder. This is clearly problematic if the API user (e.g. MLIR-TRT compiler) thinks they can destroy the logger used to create the Builder object after the Builder object is destroyed. In such a case, the Runtime will still try to use the builder's logger, which results in use-after-free.

To reduce the likelihood of this ocurring, we make the builder logger a global singleton. Note, however, that we can't guarantee that the logger is not destroyed when the compiler library is closed via dlclose, and we can't make any guaruntees about global shutdown order.

--
4f215aa9513d7751572cd83d8575f251c5a61edb by Christopher Bate [email protected]:

NFC: Fix two compiler warnings

  • 'captured structured bindings are a C++20 extension'
  • Unused function when MLIR_TRT_ENABLE_NCCL=ON

--
6f4cc4ad0eb6a4c554e5e36fc23a4b3b56ba607e by Christopher Bate [email protected]:

NFC: update test runner configs to take into account system memory

This change updates LIT parallelism setting logic to parallelism based on system memory.

GitOrigin-RevId: 6f4cc4ad0eb6a4c554e5e36fc23a4b3b56ba607e

@christopherbate christopherbate self-assigned this Jun 6, 2025
@christopherbate christopherbate force-pushed the move_internal_changes-dev branch 2 times, most recently from 1036c76 to 5495e99 Compare June 7, 2025 01:56
--
b7c0ac3c95aac78b563cfcf1926fcfb0de20ef67 by Chris Bate <[email protected]>:

[compiler] Fix bufferization for `trtrt.enqueue`

This fixes a bug where `trtrt.enqueue` could incorrectly allow the same buffer
to be used as both an input and an output. This is technically possible since
for many TensorRT programs, the inputs will be read before any outputs are
written. However, in general we cannot actually make this assumption, which is
codified in the bufferization infrastructure via the
`bufferizesToElementwiseAccess` method of the BufferizableOpInterface.
Previously, we were returning true for this method for `trtrt.enqueue` despite
not analyzing the corresponding function.

Since the ability to re-use an input pointer avoids an allocation, it
can be critical to good performance in certain cases, especially in
loops. Therefore, in the future, we need to add back the ability to
specify `true` when some basic analysis reduces the risk of that being
incorrect.

--
6fab2aa80a7776fe9afd15a59fcce916b61570cd by Chris Bate <[email protected]>:

Refactor registration of dialects and passes

- Improve uniformity of registration for dialects/passes/extensions.
- Remove some CMake flags which we never set to OFF and probably never
  should.
- Improve format of preprocessor-gaurded code blocks by generating a
  project-wide header `Features.h` header containing some convenience
  macros that look a lot nicer than `#ifdef..#endif` blocks.

--
96f45d6c6f3f2f1354984944f194fdc8a857ff34 by Sagar Shelke <[email protected]>:

[tensorrt] Fix a bug in `trtSetWeights` function in adaptor.

This MR fixes a bug in `trtSetWeights` method in NvInferAdaptor. This method
stores weights in weight map in order to keep them alive until engine is built.
Previously, a new weight in map was created but original weight was never copied.
This created issues like getting zeros in the output to all the way getting NaNs.
Change copies original weights into map.

--
08fc170f769d9cedce2541da14b101e0947175e2 by Chris Bate <[email protected]>:

[compiler] Add explicit memory space assignment

Previously, the 'plan-module-bufferize' pass used the encodings of
tensor types to infer the memory space of memref types. However, not
every tensor had an encoding. There was a notion of a default memory
space (`#plan.memory_space<device>`) which was used to deduce the
memory space when no tensor encoding was present. However, this could
result in situations where the program could not be bufferized correctly,
for example if we had `tensor.cast` operations that added or removed the
encoding. In such cases, we are relying on brittle logic of the bufferization
infrastructure to somehow deduce the correct memory space, and this
may not always work.

Therefore, this change adds a new pass `plan-assign-memory-spaces` which
explicitly assigns a `#plan.memory_space` encoding to all tensor types.
In proceeds in two steps. The first step simply assigns the 'device'
space to all tensors. Then the second step performs some minor
optimizations to avoid unnecessary host-device transfers after
bufferization. An end-to-end test case for the bufferization pipeline
is added which was previously failing. Interestingly, the addition of the
new pass results in better bufferization for cases where we can't
detensorize while loops.

--
1be85552b63d7a06d9e27b7d9bd57e2e62589356 by Christopher Bate <[email protected]>:

[compiler] Add additional IR printing flags to the compiler API

Exposes `-mlir-elide-elementsattrs-if-larger` and `-mlir-elide-resource-strings-if-larger`
to the compiler API via DebugOptions provider.

--
cfc7a93555f2778711610468b2f56e9f2fb28148 by Chris Bate <[email protected]>:

[tensorrt] Make the TensorRT builder logger global

This change makes the TensorRT builder logger a global singleton. This
change has been made due to an issue with the TRT API -- the logger does
not actually get associated with the TRT object created via `createInferBuilder`
or `createInferRuntime` -- instead the TRT API maintains a global logger
that is populated with push/op mechanism, except the stack size can only
be 1. So if the Builder object lifetime overlaps the Runtime object lifetime,
then the Runtime will ignore the logger passed to `createInferRuntime`
and instead use the global logger populated by `createInferBuilder`. This
is clearly problematic if the API user (e.g. MLIR-TRT compiler) thinks they
can destroy the logger used to create the Builder object after the Builder
object is destroyed. In such a case, the Runtime will still try to use
the builder's logger, which results in use-after-free.

To reduce the likelihood of this ocurring, we make the builder logger
a global singleton. Note, however, that we can't guarantee that the
logger is not destroyed when the compiler library is closed via `dlclose`,
and we can't make any guaruntees about global shutdown order.

--
4f215aa9513d7751572cd83d8575f251c5a61edb by Christopher Bate <[email protected]>:

NFC: Fix two compiler warnings

- 'captured structured bindings are a C++20 extension'
- Unused function when MLIR_TRT_ENABLE_NCCL=ON

--
6f4cc4ad0eb6a4c554e5e36fc23a4b3b56ba607e by Christopher Bate <[email protected]>:

NFC: update test runner configs to take into account system memory

This change updates LIT parallelism setting logic to parallelism based
on system memory.

GitOrigin-RevId: 19b40f42d66916f0322fe8e078cbc072890cb56e
@christopherbate christopherbate force-pushed the move_internal_changes-dev branch from 5495e99 to 0917d53 Compare June 7, 2025 03:27
@christopherbate christopherbate merged commit 30f9816 into main Jun 9, 2025
1 check passed
@christopherbate christopherbate deleted the move_internal_changes-dev branch June 9, 2025 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants