Integrate `MetadataPayloadExchange` into `Shuffler` #649

pentschev · 2025-11-10T22:43:49Z

Follow-up on #437 by switching Shuffler to use the new MetadataPayloadExchange. This helps simplify the code in Shuffler, and simultaneously remove the need for an "ack" message, opening also the possiblity for using an Active Message-based communicator (TBD).

madsbk

Overall looks great, @pentschev.

It would be good to run some benchmarks. I assume we expect the performance to remain unchanged?

cpp/src/shuffler/shuffler.cpp

madsbk · 2025-11-11T07:16:38Z

cpp/src/shuffler/shuffler.cpp

+                            size, br_->stream_pool().get_stream(), br_
+                        );
+                    },
+                    statistics


statistics might have been moved in line 349. Change line 349 to statistics_{statistics).

But statistics is a shared_ptr and that only gets moved after we initialized MetadataPayloadExchange, by that time MetadataPayloadExchange has already a copy of the shared_ptr and increase its refcount. I think this is not necessary.

No, the order in which members are initialized in a constructor is guaranteed (it follows their declaration order in the class), but the evaluation order of the arguments passed to those members is not. Therefore, std::move(statistics) might be evaluated first.

pentschev

Thanks @madsbk , I ran bench_shuffle with 4 workers (I don't currently have exclusive access to an 8xH100 machine) as you suggested, and below are some results:

Before (run 1)

[PRINT:0:0] means: 0.01 sec | local throughput: 134.29 GiB/s | global throughput: 537.17 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:1:0] means: 0.01 sec | local throughput: 128.22 GiB/s | global throughput: 512.87 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:3:0] means: 0.01 sec | local throughput: 127.86 GiB/s | global throughput: 511.45 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:2:0] means: 0.01 sec | local throughput: 124.80 GiB/s | global throughput: 499.18 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4

After (run 1)

[PRINT:2:0] means: 0.01 sec | local throughput: 132.01 GiB/s | global throughput: 528.06 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:1:0] means: 0.01 sec | local throughput: 127.69 GiB/s | global throughput: 510.74 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:0:0] means: 0.01 sec | local throughput: 129.64 GiB/s | global throughput: 518.55 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:3:0] means: 0.01 sec | local throughput: 128.10 GiB/s | global throughput: 512.41 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4

As you may notice, there's some variability that makes it difficult to draw a line on whether performance is better or worse, but I think it's fair to say it has no regressions, the minimum throughput is generally higher, but maximum is lower. If we run multiple times we may see a different picture, in the results below we see both minimum and maximum throughput higher than before.

Before (run 2)

[PRINT:1:0] means: 0.01 sec | local throughput: 131.85 GiB/s | global throughput: 527.39 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:0:0] means: 0.01 sec | local throughput: 131.35 GiB/s | global throughput: 525.39 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:2:0] means: 0.01 sec | local throughput: 124.80 GiB/s | global throughput: 499.19 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:3:0] means: 0.01 sec | local throughput: 124.88 GiB/s | global throughput: 499.53 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4

After (run 2)

[PRINT:1:0] means: 0.01 sec | local throughput: 129.12 GiB/s | global throughput: 516.48 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:0:0] means: 0.01 sec | local throughput: 129.59 GiB/s | global throughput: 518.36 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:2:0] means: 0.01 sec | local throughput: 134.02 GiB/s | global throughput: 536.08 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4
[PRINT:3:0] means: 0.01 sec | local throughput: 129.40 GiB/s | global throughput: 517.61 GiB/s | in_parts: 1 | out_parts: 2 | nranks: 4

With that said, I propose we get this merged when we're happy with the way the code looks and tests pass and observe performance over the next few days in our nightly runs to ensure there were at least no regressions. WDYT?

pentschev · 2025-11-11T09:50:51Z

cpp/src/shuffler/shuffler.cpp

+                            size, br_->stream_pool().get_stream(), br_
+                        );
+                    },
+                    statistics


But statistics is a shared_ptr and that only gets moved after we initialized MetadataPayloadExchange, by that time MetadataPayloadExchange has already a copy of the shared_ptr and increase its refcount. I think this is not necessary.

pentschev added 2 commits November 10, 2025 14:42

Integrate MetadataPayloadExchange in Shuffler

6a4a494

Fix send of zero-sized payloads

095c14e

pentschev self-assigned this Nov 10, 2025

pentschev added the improvement Improves an existing functionality label Nov 10, 2025

pentschev requested a review from a team as a code owner November 10, 2025 22:43

pentschev added the non-breaking Introduces a non-breaking change label Nov 10, 2025

pentschev requested a review from a team as a code owner November 10, 2025 22:43

madsbk requested changes Nov 11, 2025

View reviewed changes

Add a helper function to convert Chunks into Messages

5cf22a1

pentschev commented Nov 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate `MetadataPayloadExchange` into `Shuffler` #649

Integrate `MetadataPayloadExchange` into `Shuffler` #649

pentschev commented Nov 10, 2025

Uh oh!

madsbk left a comment

Uh oh!

Uh oh!

madsbk Nov 11, 2025

Uh oh!

pentschev Nov 11, 2025

Uh oh!

madsbk Nov 12, 2025

Uh oh!

pentschev left a comment

Uh oh!

pentschev Nov 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate MetadataPayloadExchange into Shuffler #649

Are you sure you want to change the base?

Integrate MetadataPayloadExchange into Shuffler #649

Conversation

pentschev commented Nov 10, 2025

Uh oh!

madsbk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

madsbk Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

pentschev Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

madsbk Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

pentschev Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Integrate `MetadataPayloadExchange` into `Shuffler` #649

Integrate `MetadataPayloadExchange` into `Shuffler` #649