Support llama3 autoparallel + pipelining #1657

wconstab · 2025-08-28T23:28:58Z

so far just tested locally
LOG_RANK=4 CONFIG_FILE=././torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name llama3_auto_parallel --parallelism.pipeline_parallel_degree 2 --training.steps 100

Runs and loss converges.

Left one TODO about global-batch-size and gradient accumulation

so far just tested locally `LOG_RANK=4 CONFIG_FILE=././torchtitan/models/deepseek_v3/train_configs/debug_model.toml ./run_train.sh --model.name llama3_auto_parallel --parallelism.pipeline_parallel_degree 2 --training.steps 100` Runs and loss converges. Left one TODO about global-batch-size and gradient accumulation

xmfan · 2025-08-29T04:18:42Z

torchtitan/experiments/auto_parallel/parallelize_llama.py


+    pp_degree = job_config.parallelism.pipeline_parallel_degree


unused pp degree config, should probably raise error when its not local world size

xmfan · 2025-08-29T04:23:33Z

torchtitan/experiments/auto_parallel/parallelize_llama.py

+            spmd_dims.append("tp")
+        spmd_mesh = world_mesh[spmd_dims]
+
+        dp_degree = 1


same, config could specify dp_degree

fegin · 2025-08-29T04:32:10Z

torchtitan/train.py

-                        inputs, target=targets, losses=losses, input_batch=inputs
+                        # TODO: input_batch kwarg only needed for CP, but
+                        # autoparallel doesn't accept kwargs in its forward
+                        inputs, target=targets, losses=losses #, input_batch=inputs


Curious, why does CP need input_batch?

I assumed you would know. Am I wrong?

wconstab · 2025-08-30T15:53:07Z

torchtitan/experiments/auto_parallel/parallelize_llama.py


+    pp_degree = job_config.parallelism.pipeline_parallel_degree
+    local_batch_size = job_config.training.local_batch_size
+    spmd_batch_size = local_batch_size


oops this is a bug for the non-pp case. should be local *dp degree and put in an 'else' branch

wconstab requested review from tianyu-l, fegin and wwwjn as code owners August 28, 2025 23:28

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 28, 2025

wconstab requested review from fmassa, ezyang, sanketpurandare, bdhirsh and xmfan and removed request for fegin, wwwjn and tianyu-l August 28, 2025 23:29

xmfan reviewed Aug 29, 2025

View reviewed changes

xmfan approved these changes Aug 29, 2025

View reviewed changes

fegin reviewed Aug 29, 2025

View reviewed changes

wconstab commented Aug 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llama3 autoparallel + pipelining #1657

Support llama3 autoparallel + pipelining #1657

Uh oh!

wconstab commented Aug 28, 2025

Uh oh!

xmfan Aug 29, 2025

Uh oh!

xmfan Aug 29, 2025

Uh oh!

fegin Aug 29, 2025

Uh oh!

wconstab Aug 30, 2025

Uh oh!

wconstab Aug 30, 2025

Uh oh!

Uh oh!

Support llama3 autoparallel + pipelining #1657

Are you sure you want to change the base?

Support llama3 autoparallel + pipelining #1657

Uh oh!

Conversation

wconstab commented Aug 28, 2025

Uh oh!

xmfan Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

xmfan Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

fegin Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

wconstab Aug 30, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!