About the issue of interleaved pipeline scheduling #1558

lxgsbqylbk · 2025-04-29T14:34:57Z

lxgsbqylbk
Apr 29, 2025

I reviewed the Megatron-LM documentation related to pipeline parallelism. Is the following understanding correct?

Assume I have 4 GPUs, num_layers=8, and pipeline_parallel_size=2. Then each GPU will be assigned consecutive layers:
GPU0: layers [0,1]
GPU1: layers [2,3]
GPU2: layers [4,5]
GPU3: layers [6,7]

If I enable virtual_pipeline_size=2 (interleaved), the paper says each physical pipeline stage will be assigned non-consecutive layers:
GPU0: layers [0,4]
GPU1: layers [1,5]
GPU2: layers [2,6]
GPU3: layers [3,7]

However, when I trained two different tasks (one with vpp_size=1, the other with vpp_size=2), and checked the final checkpoint weights, I found that GPU0 still held layer 0 and 1, GPU1 still had layer 2 and 3, only they were distributed into different model chunks(model0, model1).
This seems inconsistent with what the paper described — why is that?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

About the issue of interleaved pipeline scheduling #1558

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

About the issue of interleaved pipeline scheduling #1558

Uh oh!

lxgsbqylbk Apr 29, 2025

Replies: 0 comments

lxgsbqylbk
Apr 29, 2025