Skip to content

Commit 718ca7f

Browse files
duli2012xadupreyufengleeedgchen1zhanghuanrong
authored
Second round of cherry-pick (#6083)
* Fix PR #5550 reverted in #5911 (performance improvment for operator Transpose) (#5916) * Improves implementation of transpose operator * Fix issue mentioned in #5911 * adding unit test for function DoTransposeImpl * Make operator TreeEnsemble 5x faster for batches of size 100.000 (#5965) * improves processing time by 10 * extend coverage unit test coverage * better implementation for the multi regression case * better comment, keep parallelization by trees when not enough trees * Initialize a structure in operator ReduceSum (#6005) * fix initialisation issue * Fuse MatMulIntegerToFloat only when scales are scalar (#6008) MatMulIntegerToFloat fusion fuses per-row and per-column MatMulInteger, which is not supported by the MatMulIntegerToFloat kernel now. Limit the fusion to per-matrix only before we supporting the per-channel fully. * Disable Python 3.9 for training Python packaging build. (#6012) Disable Python 3.9 for training Python packaging build. Python 3.9 is not supported by the PyTorch dependency. * Fix bugs for 1: Calibrator should check model inputs; 2: (#6017) quantize_inupts forgot to use parameter initializer_use_weight_qtyp. * Bump highlight.js from 10.2.1 to 10.4.1 in /nodejs Bumps [highlight.js](https://github.com/highlightjs/highlight.js) from 10.2.1 to 10.4.1. - [Release notes](https://github.com/highlightjs/highlight.js/releases) - [Changelog](https://github.com/highlightjs/highlight.js/blob/master/CHANGES.md) - [Commits](highlightjs/highlight.js@10.2.1...10.4.1) Signed-off-by: dependabot[bot] <[email protected]> * work around of the build break in mac (#6069) * Fix the build break in macos release * revert android change * Bump up API version for 1.6 release (#6076) * Update version to 1.6.0 (#6041) * Update version to 1.6.0 * Add v 1.5.3 info * Updating WindowsAI and ONNX version Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net> * Rsevert "Fuse MatMulIntegerToFloat only when scales are scalar (#6008)" This reverts commit beb950e. Co-authored-by: Xavier Dupré <[email protected]> Co-authored-by: Yufeng Li <[email protected]> Co-authored-by: Edward Chen <[email protected]> Co-authored-by: Zhang Lei <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Pranav Sharma <[email protected]> Co-authored-by: Du Li <duli@OrtTrainingDev0.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
1 parent c38f762 commit 718ca7f

File tree

20 files changed

+672
-217
lines changed

20 files changed

+672
-217
lines changed

VERSION_NUMBER

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.5.2
1+
1.6.0

docs/Versioning.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,11 +26,13 @@ For more details on ONNX Release versions, see [this page](https://github.com/on
2626

2727
| ONNX Runtime release version | ONNX release version | ONNX opset version | ONNX ML opset version | Supported ONNX IR version | [Windows ML Availability](https://docs.microsoft.com/en-us/windows/ai/windows-ml/release-notes/)|
2828
|------------------------------|--------------------|--------------------|----------------------|------------------|------------------|
29-
| 1.5.2 | **1.7** down to 1.2 | 12 | 2 | 6 | Windows AI 1.5+ |
30-
| 1.5.1 | **1.7** down to 1.2 | 12 | 2 | 6 | Windows AI 1.5+ |
31-
| 1.4.0 | **1.7** down to 1.2 | 12 | 2 | 6 | Windows AI 1.4+ |
32-
| 1.3.1 | **1.7** down to 1.2 | 12 | 2 | 6 | Windows AI 1.4+ |
33-
| 1.3.0 | **1.7** down to 1.2 | 12 | 2 | 6 | Windows AI 1.3+ |
29+
| 1.6.0 | **1.8** down to 1.2 | 13 | 2 | 7 | Windows AI 1.6+ |
30+
| 1.5.3 | **1.7** down to 1.2 | 12 | 2 | 7 | Windows AI 1.5+ |
31+
| 1.5.2 | **1.7** down to 1.2 | 12 | 2 | 7 | Windows AI 1.5+ |
32+
| 1.5.1 | **1.7** down to 1.2 | 12 | 2 | 7 | Windows AI 1.5+ |
33+
| 1.4.0 | **1.7** down to 1.2 | 12 | 2 | 7 | Windows AI 1.4+ |
34+
| 1.3.1 | **1.7** down to 1.2 | 12 | 2 | 7 | Windows AI 1.4+ |
35+
| 1.3.0 | **1.7** down to 1.2 | 12 | 2 | 7 | Windows AI 1.3+ |
3436
| 1.2.0<br>1.1.2<br>1.1.1<br>1.1.0 | **1.6** down to 1.2 | 11 | 2 | 6 | Windows AI 1.3+ |
3537
| 1.0.0 | **1.6** down to 1.2 | 11 | 2 | 6 | Windows AI 1.3+ |
3638
| 0.5.0 | **1.5** down to 1.2 | 10 | 1 | 5 | Windows AI 1.3+ |

docs/python/README.rst

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,16 @@ For more information on ONNX Runtime, please see `aka.ms/onnxruntime <https://ak
88
Changes
99
-------
1010

11+
1.6.0
12+
^^^^^
13+
14+
Release Notes : https://github.com/Microsoft/onnxruntime/releases/tag/v1.6.0
15+
16+
1.5.3
17+
^^^^^
18+
19+
Release Notes : https://github.com/Microsoft/onnxruntime/releases/tag/v1.5.3
20+
1121
1.5.2
1222
^^^^^
1323

include/onnxruntime/core/session/onnxruntime_c_api.h

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
#include <string.h>
88

99
// This value is used in structures passed to ORT so that a newer version of ORT will still work with them
10-
#define ORT_API_VERSION 5
10+
#define ORT_API_VERSION 6
1111

1212
#ifdef __cplusplus
1313
extern "C" {

nodejs/package-lock.json

Lines changed: 5 additions & 5 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

nodejs/package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"name": "onnxruntime",
33
"description": "Node.js binding of ONNXRuntime",
4-
"version": "1.5.2",
4+
"version": "1.6.0",
55
"main": "./lib/index.js",
66
"types": "./types/lib/index.d.ts",
77
"scripts": {
@@ -69,4 +69,4 @@
6969
"dependencies": {
7070
"prebuild-install": "^5.3.5"
7171
}
72-
}
72+
}

onnxruntime/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
For more information on ONNX Runtime, please see `aka.ms/onnxruntime <https://aka.ms/onnxruntime/>`_
88
or the `Github project <https://github.com/microsoft/onnxruntime/>`_.
99
"""
10-
__version__ = "1.5.2"
10+
__version__ = "1.6.0"
1111
__author__ = "Microsoft"
1212

1313
from onnxruntime.capi._pybind_state import get_all_providers, get_available_providers, get_device, set_seed, \

onnxruntime/core/providers/cpu/ml/tree_ensemble_classifier.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ template <typename T>
139139
TreeEnsembleClassifier<T>::TreeEnsembleClassifier(const OpKernelInfo& info)
140140
: OpKernel(info),
141141
tree_ensemble_(
142-
100,
142+
80,
143143
50,
144144
info.GetAttrOrDefault<std::string>("aggregate_function", "SUM"),
145145
info.GetAttrsOrDefault<float>("base_values"),

onnxruntime/core/providers/cpu/ml/tree_ensemble_common.h

Lines changed: 139 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -262,126 +262,180 @@ void TreeEnsembleCommon<ITYPE, OTYPE>::ComputeAgg(concurrency::ThreadPool* ttp,
262262
const ITYPE* x_data = X->template Data<ITYPE>();
263263
OTYPE* z_data = Z->template MutableData<OTYPE>();
264264
int64_t* label_data = label == nullptr ? nullptr : label->template MutableData<int64_t>();
265+
auto max_num_threads = concurrency::ThreadPool::DegreeOfParallelism(ttp);
265266

266267
if (n_targets_or_classes_ == 1) {
267268
if (N == 1) {
268269
ScoreValue<OTYPE> score = {0, 0};
269-
if (n_trees_ <= parallel_tree_) {
270+
if (n_trees_ <= parallel_tree_) { /* section A: 1 output, 1 row and not enough trees to parallelize */
270271
for (int64_t j = 0; j < n_trees_; ++j) {
271272
agg.ProcessTreeNodePrediction1(score, *ProcessTreeNodeLeave(roots_[j], x_data));
272273
}
273-
} else {
274-
std::vector<ScoreValue<OTYPE>> scores_t(n_trees_, {0, 0});
274+
} else { /* section B: 1 output, 1 row and enough trees to parallelize */
275+
std::vector<ScoreValue<OTYPE>> scores(n_trees_, {0, 0});
275276
concurrency::ThreadPool::TryBatchParallelFor(
276277
ttp,
277278
SafeInt<int32_t>(n_trees_),
278-
[this, &scores_t, &agg, x_data](ptrdiff_t j) {
279-
agg.ProcessTreeNodePrediction1(scores_t[j], *ProcessTreeNodeLeave(roots_[j], x_data));
279+
[this, &scores, &agg, x_data](ptrdiff_t j) {
280+
agg.ProcessTreeNodePrediction1(scores[j], *ProcessTreeNodeLeave(roots_[j], x_data));
280281
},
281282
0);
282283

283-
for (auto it = scores_t.cbegin(); it != scores_t.cend(); ++it) {
284+
for (auto it = scores.cbegin(); it != scores.cend(); ++it) {
284285
agg.MergePrediction1(score, *it);
285286
}
286287
}
287-
288288
agg.FinalizeScores1(z_data, score, label_data);
289-
} else {
290-
if (N <= parallel_N_) {
291-
ScoreValue<OTYPE> score;
292-
size_t j;
293-
294-
for (int64_t i = 0; i < N; ++i) {
295-
score = {0, 0};
296-
for (j = 0; j < static_cast<size_t>(n_trees_); ++j) {
297-
agg.ProcessTreeNodePrediction1(score, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
298-
}
299-
300-
agg.FinalizeScores1(z_data + i * n_targets_or_classes_, score,
301-
label_data == nullptr ? nullptr : (label_data + i));
289+
} else if (N <= parallel_N_) { /* section C: 1 output, 2+ rows but not enough rows to parallelize */
290+
ScoreValue<OTYPE> score;
291+
size_t j;
292+
293+
for (int64_t i = 0; i < N; ++i) {
294+
score = {0, 0};
295+
for (j = 0; j < static_cast<size_t>(n_trees_); ++j) {
296+
agg.ProcessTreeNodePrediction1(score, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
302297
}
303-
} else {
304-
concurrency::ThreadPool::TryBatchParallelFor(
305-
ttp,
306-
SafeInt<int32_t>(N),
307-
[this, &agg, x_data, z_data, stride, label_data](ptrdiff_t i) {
308-
ScoreValue<OTYPE> score = {0, 0};
309-
for (size_t j = 0; j < static_cast<size_t>(n_trees_); ++j) {
310-
agg.ProcessTreeNodePrediction1(score, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
311-
}
312298

313-
agg.FinalizeScores1(z_data + i * n_targets_or_classes_, score,
314-
label_data == nullptr ? nullptr : (label_data + i));
315-
},
316-
0);
299+
agg.FinalizeScores1(z_data + i, score,
300+
label_data == nullptr ? nullptr : (label_data + i));
317301
}
302+
} else if (n_trees_ > max_num_threads) { /* section D: 1 output, 2+ rows and enough trees to parallelize */
303+
auto num_threads = std::min<int32_t>(max_num_threads, SafeInt<int32_t>(n_trees_));
304+
std::vector<ScoreValue<OTYPE>> scores(num_threads * N);
305+
concurrency::ThreadPool::TrySimpleParallelFor(
306+
ttp,
307+
num_threads,
308+
[this, &agg, &scores, num_threads, x_data, N, stride](ptrdiff_t batch_num) {
309+
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, this->n_trees_);
310+
for (int64_t i = 0; i < N; ++i) {
311+
scores[batch_num * N + i] = {0, 0};
312+
}
313+
for (auto j = work.start; j < work.end; ++j) {
314+
for (int64_t i = 0; i < N; ++i) {
315+
agg.ProcessTreeNodePrediction1(scores[batch_num * N + i], *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
316+
}
317+
}
318+
});
319+
320+
concurrency::ThreadPool::TrySimpleParallelFor(
321+
ttp,
322+
num_threads,
323+
[&agg, &scores, num_threads, label_data, z_data, N](ptrdiff_t batch_num) {
324+
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, N);
325+
for (auto i = work.start; i < work.end; ++i) {
326+
for (int64_t j = 1; j < num_threads; ++j) {
327+
agg.MergePrediction1(scores[i], scores[j * N + i]);
328+
}
329+
agg.FinalizeScores1(z_data + i, scores[i],
330+
label_data == nullptr ? nullptr : (label_data + i));
331+
}
332+
});
333+
} else { /* section E: 1 output, 2+ rows, parallelization by rows */
334+
concurrency::ThreadPool::TryBatchParallelFor(
335+
ttp,
336+
SafeInt<int32_t>(N),
337+
[this, &agg, x_data, z_data, stride, label_data](ptrdiff_t i) {
338+
ScoreValue<OTYPE> score = {0, 0};
339+
for (size_t j = 0; j < static_cast<size_t>(n_trees_); ++j) {
340+
agg.ProcessTreeNodePrediction1(score, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
341+
}
342+
343+
agg.FinalizeScores1(z_data + i, score,
344+
label_data == nullptr ? nullptr : (label_data + i));
345+
},
346+
0);
318347
}
319348
} else {
320-
if (N == 1) {
321-
std::vector<ScoreValue<OTYPE>> scores(n_targets_or_classes_, {0, 0});
322-
if (n_trees_ <= parallel_tree_) {
349+
if (N == 1) { /* section A2: 2+ outputs, 1 row, not enough trees to parallelize */
350+
if (n_trees_ <= parallel_tree_) { /* section A2 */
351+
std::vector<ScoreValue<OTYPE>> scores(n_targets_or_classes_, {0, 0});
323352
for (int64_t j = 0; j < n_trees_; ++j) {
324353
agg.ProcessTreeNodePrediction(scores, *ProcessTreeNodeLeave(roots_[j], x_data));
325354
}
326-
} else {
327-
// split the work into one block per thread so we can re-use the 'private_scores' vector as much as possible
328-
// TODO: Refine the number of threads used
329-
auto num_threads = std::min<int32_t>(concurrency::ThreadPool::DegreeOfParallelism(ttp), SafeInt<int32_t>(n_trees_));
330-
OrtMutex merge_mutex;
355+
agg.FinalizeScores(scores, z_data, -1, label_data);
356+
} else { /* section B2: 2+ outputs, 1 row, enough trees to parallelize */
357+
auto num_threads = std::min<int32_t>(max_num_threads, SafeInt<int32_t>(n_trees_));
358+
std::vector<std::vector<ScoreValue<OTYPE>>> scores(num_threads);
331359
concurrency::ThreadPool::TrySimpleParallelFor(
332360
ttp,
333361
num_threads,
334-
[this, &agg, &scores, &merge_mutex, num_threads, x_data](ptrdiff_t batch_num) {
335-
std::vector<ScoreValue<OTYPE>> private_scores(n_targets_or_classes_, {0, 0});
362+
[this, &agg, &scores, num_threads, x_data](ptrdiff_t batch_num) {
363+
scores[batch_num].resize(n_targets_or_classes_, {0, 0});
336364
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, n_trees_);
337365
for (auto j = work.start; j < work.end; ++j) {
338-
agg.ProcessTreeNodePrediction(private_scores, *ProcessTreeNodeLeave(roots_[j], x_data));
366+
agg.ProcessTreeNodePrediction(scores[batch_num], *ProcessTreeNodeLeave(roots_[j], x_data));
339367
}
340-
341-
std::lock_guard<OrtMutex> lock(merge_mutex);
342-
agg.MergePrediction(scores, private_scores);
343368
});
369+
for (size_t i = 1; i < scores.size(); ++i) {
370+
agg.MergePrediction(scores[0], scores[i]);
371+
}
372+
agg.FinalizeScores(scores[0], z_data, -1, label_data);
344373
}
345-
346-
agg.FinalizeScores(scores, z_data, -1, label_data);
347-
} else {
348-
if (N <= parallel_N_) {
349-
std::vector<ScoreValue<OTYPE>> scores(n_targets_or_classes_);
350-
size_t j;
351-
352-
for (int64_t i = 0; i < N; ++i) {
353-
std::fill(scores.begin(), scores.end(), ScoreValue<OTYPE>({0, 0}));
354-
for (j = 0; j < roots_.size(); ++j) {
355-
agg.ProcessTreeNodePrediction(scores, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
356-
}
357-
358-
agg.FinalizeScores(scores, z_data + i * n_targets_or_classes_, -1,
359-
label_data == nullptr ? nullptr : (label_data + i));
374+
} else if (N <= parallel_N_) { /* section C2: 2+ outputs, 2+ rows, not enough rows to parallelize */
375+
std::vector<ScoreValue<OTYPE>> scores(n_targets_or_classes_);
376+
size_t j;
377+
378+
for (int64_t i = 0; i < N; ++i) {
379+
std::fill(scores.begin(), scores.end(), ScoreValue<OTYPE>({0, 0}));
380+
for (j = 0; j < roots_.size(); ++j) {
381+
agg.ProcessTreeNodePrediction(scores, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
360382
}
361-
} else {
362-
// split the work into one block per thread so we can re-use the 'scores' vector as much as possible
363-
// TODO: Refine the number of threads used.
364-
auto num_threads = std::min<int32_t>(concurrency::ThreadPool::DegreeOfParallelism(ttp), SafeInt<int32_t>(N));
365-
concurrency::ThreadPool::TrySimpleParallelFor(
366-
ttp,
367-
num_threads,
368-
[this, &agg, num_threads, x_data, z_data, label_data, N, stride](ptrdiff_t batch_num) {
369-
size_t j;
370-
std::vector<ScoreValue<OTYPE>> scores(n_targets_or_classes_);
371-
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, N);
372-
373-
for (auto i = work.start; i < work.end; ++i) {
374-
std::fill(scores.begin(), scores.end(), ScoreValue<OTYPE>({0, 0}));
375-
for (j = 0; j < roots_.size(); ++j) {
376-
agg.ProcessTreeNodePrediction(scores, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
377-
}
378-
379-
agg.FinalizeScores(scores,
380-
z_data + i * n_targets_or_classes_, -1,
381-
label_data == nullptr ? nullptr : (label_data + i));
382-
}
383-
});
383+
384+
agg.FinalizeScores(scores, z_data + i * n_targets_or_classes_, -1,
385+
label_data == nullptr ? nullptr : (label_data + i));
384386
}
387+
} else if (n_trees_ >= max_num_threads) { /* section: D2: 2+ outputs, 2+ rows, enough trees to parallelize*/
388+
auto num_threads = std::min<int32_t>(max_num_threads, SafeInt<int32_t>(n_trees_));
389+
std::vector<std::vector<ScoreValue<OTYPE>>> scores(num_threads * N);
390+
concurrency::ThreadPool::TrySimpleParallelFor(
391+
ttp,
392+
num_threads,
393+
[this, &agg, &scores, num_threads, x_data, N, stride](ptrdiff_t batch_num) {
394+
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, this->n_trees_);
395+
for (int64_t i = 0; i < N; ++i) {
396+
scores[batch_num * N + i].resize(n_targets_or_classes_, {0, 0});
397+
}
398+
for (auto j = work.start; j < work.end; ++j) {
399+
for (int64_t i = 0; i < N; ++i) {
400+
agg.ProcessTreeNodePrediction(scores[batch_num * N + i], *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
401+
}
402+
}
403+
});
404+
405+
concurrency::ThreadPool::TrySimpleParallelFor(
406+
ttp,
407+
num_threads,
408+
[this, &agg, &scores, num_threads, label_data, z_data, N](ptrdiff_t batch_num) {
409+
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, N);
410+
for (auto i = work.start; i < work.end; ++i) {
411+
for (int64_t j = 1; j < num_threads; ++j) {
412+
agg.MergePrediction(scores[i], scores[j * N + i]);
413+
}
414+
agg.FinalizeScores(scores[i], z_data + i * this->n_targets_or_classes_, -1,
415+
label_data == nullptr ? nullptr : (label_data + i));
416+
}
417+
});
418+
} else { /* section E2: 2+ outputs, 2+ rows, parallelization by rows */
419+
auto num_threads = std::min<int32_t>(max_num_threads, SafeInt<int32_t>(N));
420+
concurrency::ThreadPool::TrySimpleParallelFor(
421+
ttp,
422+
num_threads,
423+
[this, &agg, num_threads, x_data, z_data, label_data, N, stride](ptrdiff_t batch_num) {
424+
size_t j;
425+
std::vector<ScoreValue<OTYPE>> scores(n_targets_or_classes_);
426+
auto work = concurrency::ThreadPool::PartitionWork(batch_num, num_threads, N);
427+
428+
for (auto i = work.start; i < work.end; ++i) {
429+
std::fill(scores.begin(), scores.end(), ScoreValue<OTYPE>({0, 0}));
430+
for (j = 0; j < roots_.size(); ++j) {
431+
agg.ProcessTreeNodePrediction(scores, *ProcessTreeNodeLeave(roots_[j], x_data + i * stride));
432+
}
433+
434+
agg.FinalizeScores(scores,
435+
z_data + i * n_targets_or_classes_, -1,
436+
label_data == nullptr ? nullptr : (label_data + i));
437+
}
438+
});
385439
}
386440
}
387441
} // namespace detail

onnxruntime/core/providers/cpu/ml/treeregressor.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ template <typename T>
2424
TreeEnsembleRegressor<T>::TreeEnsembleRegressor(const OpKernelInfo& info)
2525
: OpKernel(info),
2626
tree_ensemble_(
27-
100,
27+
80,
2828
50,
2929
info.GetAttrOrDefault<std::string>("aggregate_function", "SUM"),
3030
info.GetAttrsOrDefault<float>("base_values"),

0 commit comments

Comments
 (0)