Add option to memory map .ORT model loads#28164
Conversation
…ap benchmarking - Clean up benchmark_mmap_ort.py: remove unused code, simplify multi-process approach to use native perf_test processes instead of Python wrappers - Add --hold_ms_after_session_creation flag to onnxruntime_perf_test to keep sessions alive for multi-process memory measurement - Print SESSION_READY marker when holding so benchmark script knows when to sample Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
8c170c9 to
50bc19e
Compare
|
Thanks for a great PR. I will address a few things here and will ask @skottmckay for review. |
|
The comment about copying the initializers is not entirely correct. We strive to create tensors on top of the flatbuffer memory. |
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in session configuration to load .ort (ORT-format) models via memory-mapped I/O, so the InferenceSession owns the mapping and can optionally keep initializer tensors referencing the mapped flatbuffer bytes.
Changes:
- Introduces
session.use_memory_mapped_ort_modeland wires it into ORT-format model loading. - Keeps the memory mapping alive for the session when using direct initializers, and releases it after
Initialize()otherwise. - Adds unit tests for memory-mapped
.ortloading and extendsonnxruntime_perf_testplus a Python benchmark helper for multi-process memory measurement.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tools/python/benchmark_mmap_ort.py | Adds a developer benchmark script to compare standard vs mmap ORT model loads (single- and multi-process). |
| onnxruntime/test/perftest/test_configuration.h | Adds hold_ms_after_session_creation to keep perf-test processes alive for memory measurements. |
| onnxruntime/test/perftest/main.cc | Implements SESSION_READY signaling + sleep when -n is used with the new hold flag. |
| onnxruntime/test/perftest/command_args_parser.cc | Adds --hold_ms_after_session_creation flag wiring and help text. |
| onnxruntime/test/framework/ort_model_only_test.cc | Adds coverage for .ort loading via mmap (with/without initializer-bytes usage). |
| onnxruntime/core/session/inference_session.h | Stores an Env::MappedMemoryPtr in InferenceSession for mmap-backed ORT-format model bytes. |
| onnxruntime/core/session/inference_session.cc | Adds mmap load path for ORT-format models and releases mapping during Initialize() cleanup when applicable. |
| include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h | Defines the new session.use_memory_mapped_ort_model config key and updates related docs. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
Clean, well-scoped feature. The implementation cleverly leverages existing infrastructure — ort_format_model_bytes_data_holder_.empty() naturally enables the initializer-from-bytes path for mmap without additional flags. The cleanup in Initialize() correctly resets ort_format_model_mapped_memory_ when initializers don't use the bytes. reinterpret_cast is correct for char* → uint8_t* (agreeing with yuslepukhin's comment).
A couple of suggestions below. The doc comment style (// vs ///) and benchmark timeout concerns from the earlier automated review are also worth addressing.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
tianleiwu
left a comment
There was a problem hiding this comment.
Thanks for the updates. I found one remaining measurement issue in the helper script: on POSIX, the multi-process private metric falls back to RSS, which double-counts shared mmap pages and can hide the benefit this benchmark is meant to measure. I left that inline.
I did not open a duplicate comment for the existing file sizing/opening thread; I replied there with what still seems unresolved after the latest changes.
chwarr
left a comment
There was a problem hiding this comment.
I won't have time for an in depth review for a few weeks. Please don't block waiting for me.
tianleiwu
left a comment
There was a problem hiding this comment.
I found one low-level bounds-check issue in the new file-mapping validation. The mmap loading flow and tests otherwise look reasonable to me, but the requested-end arithmetic should be made overflow-safe before merge.
tianleiwu
left a comment
There was a problem hiding this comment.
All previously raised concerns have been addressed in the latest commits:
- Overflow-safe bounds check: Both POSIX and Windows
MapFileIntoMemorynow useSafeInt<size_t>(offset) + lengthinstead of raw arithmetic, preventing silent wrap-around before the file-size comparison. - POSIX memory measurement: The benchmark script now uses
memory_full_info().usson POSIX for accurate private memory accounting, with RSS fallback properly labeled. - Robustness: Timeout handling,
try/finallycleanup, anderrors="replace"for decode safety.
The implementation is clean and well-scoped. Session owns the mapping via Env::MappedMemoryPtr (RAII), lifecycle is correct (freed in Initialize() unless initializers reference the mapped bytes), and the test coverage is adequate.
see chwar's comments above:
I won't have time for an in depth review for a few weeks. Please don't block waiting for me.
This cherry-picks the following commits for the release: | Commit ID | PR Number | Commit Title | |-----------|-----------|-------------| | 9d1492a | #28164 | Add option to memory map .ORT model loads | Co-authored-by: Kevin Taha <tahakevin@gmail.com> Co-authored-by: Kevin Taha <kevintaha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Dmitri Smirnov <dmitrism@microsoft.com> Co-authored-by: Dmitri Smirnov <yuslepukhin@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Addressing issue #25524 (MS internal: 60577894)
Today, the closest method callers have to loading models from a shared resource is by mapping the model themselves and using use_ort_model_bytes_directly - this puts the responsibility on the caller to ensure the validity of the mapping as well. These changes introduce use_memory_mapped_ort_model, a session option for using memory-mapped I/O to load ORT format models directly inside OnnxRuntime. The mapping in this case is owned by the InferenceSession. The changes to implement this are simple and minimal and use ORT's existing platform-agnostic memory mapping helpers, and if we choose to make this the default behavior could mean automatic memory savings for multi-process usage.
Note about memory implications & sharing model bytes:
The reality of this change is that using use_memory_mapped_ort_model alone doesn't have a long-running memory usage advantage because ORT will ultimately copy the model bytes from the mapped pages into Tensors. Using it in coordination with session.use_ort_model_bytes_for_initializers ensures that that initializers point directly to the flatbuffer bytes and avoids the extra copy. This would be the expected usage for multi-process sharing of a single model. This introduces questions around what the default behavior should be - the changes I made in this PR are conservative and retain all existing defaults at this time.
Changes
on initializer gating to note mmap case
Benchmark Examples
Note that the benchmark is largely written by GHCP and may not be perfect, but I've validated some of its results.
Single-Proc
Here is a sample result from a single-process benchmark using resnet50 (converted to ORT format). Note that these measure peaks during construction and not end-states, and the measurements may be imperfect.
python tools/python/benchmark_mmap_ort.py --perf-test build\Windows\Release\Release\onnxruntime_perf_test.exe --model resnet50.ort --iterations 15Multi-Proc
The multi-proc benchmark shows that total memory bandwidth gains for shared models can only be obtained alongside use_ort_model_bytes_for_initializers_