Add support for resharding for fbgemm configs #2387

jerryzh168 · 2025-06-16T20:44:27Z

Summary:
added transpose and cat op support, and also some custom transpose/reshape/unflatten support for resharding.

In the future we should probably provide examples for using distributed checkpoint for resharding

Test Plan:
python test/dtypes/test_fbgemm_int4.py -k test_transpose python test/dtypes/test_fbgemm_int4.py -k test_cat python test/dtypes/test_fbgemm_fp8.py -k test_transpose python test/dtypes/test_fbgemm_fp8.py -k test_cat

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-06-16T20:44:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2387

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 23c46f4 with merge base 6243040 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t a19ff86e9ded7f0e71d14bd73545d458e26952456bdf0065c2dd42efbef7ef69 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2025-06-16T20:52:46Z

Why are these ops needed? is it for DCP?

drisspg · 2025-06-16T21:47:35Z

test/dtypes/test_fbgemm_fp8.py

+        cat_weight1 = torch.cat([linear1.weight, linear2.weight], dim=0)
+        cat_weight2 = torch.cat([linear1.weight, linear2.weight], dim=1)
+        self.assertTrue(cat_weight1.shape, (512, 128))
+        self.assertTrue(cat_weight2.shape, (256, 256))


can you also assert equality of bits

torchao/dtypes/fbgemm_fp8_tensor.py

drisspg · 2025-06-16T22:04:30Z

torchao/dtypes/fbgemm_fp8_tensor.py

+        data_to_scale_dim: the dim mapping from float8_data to scale, e.g.
+          float8_data: (batch_size, output_channel, input_channel)
+          scale: (batch_size, output_channel) (since it's per row quantization)
+          data_to_scale_dim: {0: 0, 1: 1}


This explanation isn't very helpful / I dont know what this is doing

this is a bit confusing, removed

drisspg · 2025-06-16T22:05:55Z

torchao/dtypes/fbgemm_fp8_tensor.py

+        )
+
+    def _transpose_and_reshape(self):
+        """This is added for resharding support, since the resharding logic for the model we are


Do these these next two functions need to be methods or can they be implementations of the actual ops

these should be methods, it's specific for the hack we are doing

drisspg · 2025-06-16T22:06:28Z

torchao/dtypes/fbgemm_fp8_tensor.py

+        assert len(self.shape) == 3, (
+            f"Only expected to be used when the Tensor is 3D, got {len(self.shape)}"
+        )
+        dim0, dim1, dim2 = self.shape


I dont understand this

this is specific to the hack, we'll transpose the weight first and then quantize, so (dim0, dim2, dim1) is the original shape

we are restoring the shape to original shape to resharding here

torchao/dtypes/fbgemm_int4_tensor.py

Summary: added transpose and cat op support, and also some custom transpose/reshape/unflatten support for resharding. In the future we should probably provide examples for using distributed checkpoint for resharding Test Plan: python test/dtypes/test_fbgemm_int4.py -k test_transpose python test/dtypes/test_fbgemm_int4.py -k test_cat python test/dtypes/test_fbgemm_fp8.py -k test_transpose python test/dtypes/test_fbgemm_fp8.py -k test_cat Reviewers: Subscribers: Tasks: Tags:

vkuzo · 2025-06-17T11:39:24Z

torchao/dtypes/fbgemm_fp8_tensor.py

    TODO: needs padding for cutlass kernels
+
+    Tensor Attributes:
+        float8_data: float8 raw data, dtype torchao.float8.config.e4m3_dtype


to clarify, does this mean there is a dependency on torchao.float8.config.e4m3_dtype? If so, I think the dependency should be refactored away to a common utility, it's not expected for that config to affect anything other than the torchao.float8 workflow.

vkuzo · 2025-06-17T11:40:57Z

torchao/dtypes/fbgemm_int4_tensor.py

+    Groupwise int4 weight only quantization
+
+    Tensor Attributes:
+        packed_weight: packed int4 weight, either 2D (N, K/2) or 3D (B, N, K/2), last dimension is packed


when is a weight a 3D tensor, and why is the batch dimension in here? could you share a specific example

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2025

jerryzh168 requested a review from drisspg June 16, 2025 20:44

jerryzh168 added the topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) label Jun 16, 2025

jerryzh168 force-pushed the fbgemm-reshard branch 2 times, most recently from 6525c15 to 9e128c1 Compare June 16, 2025 20:49

drisspg reviewed Jun 16, 2025

View reviewed changes

torchao/dtypes/fbgemm_fp8_tensor.py Show resolved Hide resolved

drisspg reviewed Jun 16, 2025

View reviewed changes

torchao/dtypes/fbgemm_fp8_tensor.py Outdated Show resolved Hide resolved

drisspg reviewed Jun 16, 2025

View reviewed changes

torchao/dtypes/fbgemm_int4_tensor.py Show resolved Hide resolved

jerryzh168 force-pushed the fbgemm-reshard branch 2 times, most recently from bed8189 to 4e562e2 Compare June 17, 2025 01:22

jerryzh168 force-pushed the fbgemm-reshard branch from 4e562e2 to 23c46f4 Compare June 17, 2025 01:36

vkuzo reviewed Jun 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add support for resharding for fbgemm configs #2387

Add support for resharding for fbgemm configs #2387

jerryzh168 commented Jun 16, 2025

Uh oh!

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

Uh oh!

drisspg commented Jun 16, 2025

Uh oh!

drisspg Jun 16, 2025 •

edited

Loading

Uh oh!

jerryzh168 Jun 16, 2025

Uh oh!

Uh oh!

Uh oh!

drisspg Jun 16, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

drisspg Jun 16, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

drisspg Jun 16, 2025

Uh oh!

jerryzh168 Jun 17, 2025

Uh oh!

Uh oh!

vkuzo Jun 17, 2025

Uh oh!

vkuzo Jun 17, 2025

Uh oh!

Uh oh!

Add support for resharding for fbgemm configs #2387

Are you sure you want to change the base?

Add support for resharding for fbgemm configs #2387

Conversation

jerryzh168 commented Jun 16, 2025

Uh oh!

pytorch-bot bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2387

❌ 1 New Failure

Uh oh!

drisspg commented Jun 16, 2025

Uh oh!

drisspg Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

drisspg Jun 16, 2025 •

edited

Loading