Add multicast tensor #346

joydddd · 2025-07-22T02:55:28Z

Stacked PRs:

Add multicast tensor

stack-info: PR: #346, branch: joydddd/stack/17

drisspg · 2025-07-22T19:13:38Z

helion/language/multicast_tensor.py

+    from .._compiler.variable_origin import Origin
+
+
+class MulticastTensor(NamedTuple):


If we don't inherit from NamedTuple does this still work?

Yep, removed the inheritation.

Actually, no, we might want to keep the NamedTuple inheritation. fake_fn for hl ops is called during both type propagation and device_ir tracing. For type propagation multicast tensors are passed in the original MulticastTensor type and in device_ir tracing we call prepare_args to unpack the MulticastTensor into tuples before calling fake_fn. It is nicer make MulticastTensor a NamedTuple so that in both cases it is a tuple.

Unless we find a better way to deal with MulticastTensor constructor in device_ir, and we don't need to unpack it into a tuple.

helion/_compiler/device_ir.py

drisspg · 2025-07-22T23:25:55Z

helion/_compiler/indexing_strategy.py

@@ -289,6 +295,134 @@ def codegen_store(
        )


+class MulticastIndexingStrategy:
+    @staticmethod


Can you add more detail on the semantics of this indexing strategy

drisspg · 2025-07-22T23:28:27Z

helion/language/memory_ops.py

+        return state.device_function.indexing_strategy.codegen_store(
+            state, tensor, [*subscript], value, extra_mask
+        )
+    if isinstance(tensor, tuple):


I still don't totally follow why we convert to tuple instead of keeping as multicast tensor type feels

MulticastTensor constructor is not a device function and should not show up in the deviceIR fx graph (we have no lowering path for it). Therefore to avoid our tracer seeing that, we unpack it to a tuple entering the tracer.

Any ideas on how to do this in a nicer way to avoid type checking for a tuple? Maybe add a typename item in the tuple and check for that?

jansel

Will multicast tensors work with arbitrary operations (for example inductor ones)?

Are the semantics always that we repeat an op once for every tensor?

I worry that needing to add an if multicast case to every op will add a lot of complexity. Could we implement this as an FX graph pass that duplicates every op once per each sub-tensor?

helion/_compiler/device_ir.py

jansel · 2025-07-22T21:38:38Z

helion/_compiler/roll_reduction.py

+    def has_multicast_tensor_with_rdim(self, graph: torch.fx.Graph) -> bool:
+        """Check if a graph contains multicast tensors with rdim inputs."""
+
+        def is_multicast_with_rdim(node: torch.fx.Node) -> bool:


Move to global scope?

Not sure what do you mean by moving to global scope? We check if self.rdim of the roller matches any of the multicasted dims. @yf225 can you take a look?

helion/language/__init__.py

jansel · 2025-07-23T00:12:44Z

helion/language/multicast_tensor.py

+    tensor_like: torch.Tensor
+    dev_ptrs: torch.Tensor


Can you talk about the motivation for this representation?

joydddd · 2025-07-23T01:16:27Z

Will multicast tensors work with arbitrary operations (for example inductor ones)?

Are the semantics always that we repeat an op once for every tensor?

I worry that needing to add an if multicast case to every op will add a lot of complexity. Could we implement this as an FX graph pass that duplicates every op once per each sub-tensor?

Multicast tensors are meant to a host tensors that can only be accessed by memory operations (namely hl.load, hl.store, hl.atomic_add, hl.signal & hl.wait). For these cases we are manually handling indexing, masks and offset calculation. Arbitrary operations (for example inductor ones) applies only to device tensors, return from these memory operations.

Any pointers to where I should check to make sure Multicast Tensors are not used for non memory ops?

Can you talk about the motivation for this representation?
 tensor_like: torch.Tensor
    dev_ptrs: torch.Tensor

Primary use case for this is shared tensor on symmetric memory, i.e. each device hold a version of a_shared tensor. MulticastTensor creates a virtual concatenation of these tensors without realizing a copy. "Symmetric" means each distinct tensor has the same shape, stride & dtype, and therefore same index calculation. tensor_like is the example tensor used for indexing. In the symmetric memory case, this is usually the local tensor.

I'm considering adding more APIs to create MulticastTensors: e.g. multicast(dev_ptr, shape, dtype, stride, ...) and create an empty (fake) host tensor inside Helion as the indexing reference.

Future extensions: Multimem Pointer: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=multimem#data-movement-and-conversion-instructions-multimem could be an alternative backend to MulticastTensor than dev_ptrs tensor. However, currently Triton does not support that and we don't know the interface.

jansel · 2025-07-23T03:29:30Z

But if you load a multicast tensor, don't you get a multicast device tensor? (Which you can run ops on.)

joydddd · 2025-07-23T21:38:08Z

No, hl.load(multicast, index... ) return a device tensor (or kernel tensor? something living in registers / smem) of type torch.Tensor. A slice is load from each buffer and stacked together to make the hl.load return value. In this case multicast_t[1] is a normal 1D tensor of shape (num_buffers, ), and you can do all normal tensor ops on it.

jansel · 2025-07-25T04:36:42Z

Ah I misunderstood.

How does this work with multiple dimensions? Can you have a 3D multicast tensor? Or are we requiring 1D tensors on each device, so the entire thing acts like a 2D tensor.

Does the same thing happen in reverse on store?

joydddd · 2025-07-25T17:28:30Z

How does this work with multiple dimensions? Can you have a 3D multicast tensor? Or are we requiring 1D tensors on each device, so the entire thing acts like a 2D tensor.

It works with any number of dimensions for dev_pts & example tensor.
Say the dev_ptrs are 2D, and the example tensor is 3D. Then the entire thing acts like 2D + 3D -> 5D tensor.

Does the same thing happen in reverse on store?

Yep, happens reverse on store. For 1D dev_prts & 1D example tensor tile, the store value must be 2D.

stack-info: PR: #346, branch: joydddd/stack/17

joydddd · 2025-07-25T17:50:06Z

test/test_multicast_tensor.py

+    def test_multicast_load_2d_tensors(self):
+        @helion.kernel
+        def multicast_load_kernel(
+            dev_ptrs: torch.Tensor,
+            example_tensor: torch.Tensor,
+        ) -> torch.Tensor:
+            M = dev_ptrs.size(0)
+            N1, N2 = example_tensor.size()
+            out = torch.empty(M, N1, N2, dtype=torch.bfloat16, device=dev_ptrs.device)
+
+            for tile1, tile2 in hl.tile([N1, N2]):
+                ptr_tile = dev_ptrs[:]
+                tensors = hl.multicast_like(example_tensor, ptr_tile)
+                out[:, tile1, tile2] = tensors[tile1, tile2]
+            return out
+


e.g. tensors are 2D.

joydddd · 2025-07-25T17:50:20Z

test/test_multicast_tensor.py

+    def test_multicast_load_2d_dev_ptrs(self):
+        @helion.kernel
+        def multicast_load_kernel_2d(
+            dev_ptrs: torch.Tensor,
+            example_tensor: torch.Tensor,
+        ) -> torch.Tensor:
+            M1, M2 = dev_ptrs.size()
+            N = example_tensor.size(0)
+            out = torch.empty(M1, M2, N, dtype=torch.bfloat16, device=dev_ptrs.device)
+
+            for tile in hl.tile(N, block_size=4):
+                ptr_tile = dev_ptrs[:, :]
+                tensors = hl.multicast_like(example_tensor, ptr_tile)
+                out[:, :, tile] = tensors[tile]
+            return out
+


e.g. dev_ptrs are 2D

joydddd · 2025-07-25T17:50:57Z

test/test_multicast_tensor.py

+                x_tile = x[tile]
+                tensors[tile] = x_tile[None, :]


Broadcast x_tile to all tensors at store.

joydddd · 2025-07-25T17:51:20Z

test/test_multicast_tensor.py

+                x = hl.arange(M)
+                tensors[i] = x


Do reverse of stack at store

joydddd added a commit that referenced this pull request Jul 22, 2025

Add multicast tensor

0bcfcca

stack-info: PR: #346, branch: joydddd/stack/17

joydddd force-pushed the joydddd/stack/17 branch from 1e986c5 to 0bcfcca Compare July 22, 2025 02:55

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 22, 2025

joydddd force-pushed the joydddd/stack/17 branch from 0bcfcca to 5609bbf Compare July 22, 2025 02:57

joydddd added a commit that referenced this pull request Jul 22, 2025

Add multicast tensor

5609bbf

stack-info: PR: #346, branch: joydddd/stack/17

joydddd marked this pull request as ready for review July 22, 2025 04:03

joydddd requested review from jansel, yf225, drisspg and oulgen and removed request for jansel, yf225, drisspg and oulgen July 22, 2025 04:04

joydddd marked this pull request as draft July 22, 2025 04:07

joydddd added a commit that referenced this pull request Jul 22, 2025

Add multicast tensor

1749db5

stack-info: PR: #346, branch: joydddd/stack/17

joydddd force-pushed the joydddd/stack/17 branch from 5609bbf to 1749db5 Compare July 22, 2025 19:11

joydddd requested review from jansel, oulgen, yf225 and drisspg and removed request for jansel July 22, 2025 19:11

joydddd marked this pull request as ready for review July 22, 2025 19:11

drisspg reviewed Jul 22, 2025

View reviewed changes

joydddd force-pushed the joydddd/stack/17 branch from 1749db5 to fd02b59 Compare July 22, 2025 21:36

This was referenced Jul 22, 2025

[BC breaking] Add MulticastTensor support to hl.signal & hl.wait (as_ptrs) #261

Open

One shot all reduce Example #245

Draft

joydddd force-pushed the joydddd/stack/17 branch from fd02b59 to ba94f3f Compare July 22, 2025 22:18

drisspg reviewed Jul 22, 2025

View reviewed changes

helion/_compiler/device_ir.py Outdated Show resolved Hide resolved

drisspg reviewed Jul 22, 2025

View reviewed changes

jansel requested changes Jul 23, 2025

View reviewed changes

joydddd force-pushed the joydddd/stack/17 branch from ba94f3f to 7cc53a9 Compare July 24, 2025 23:44

joydddd requested review from jansel and drisspg July 24, 2025 23:44

joydddd force-pushed the joydddd/stack/17 branch from 7cc53a9 to bf0db57 Compare July 25, 2025 00:16

joydddd mentioned this pull request Jul 25, 2025

[Pending multi device CI] Add symmetric memory sync test. #375

Draft

Add multicast tensor

8ee7615

stack-info: PR: #346, branch: joydddd/stack/17

joydddd force-pushed the joydddd/stack/17 branch from bf0db57 to 8ee7615 Compare July 25, 2025 17:49

joydddd commented Jul 25, 2025

View reviewed changes

test/test_multicast_tensor.py

Comment on lines +230 to +231

x = hl.arange(M)

tensors[i] = x

Copy link

Contributor Author

joydddd Jul 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do reverse of stack at store

		from .._compiler.variable_origin import Origin


		class MulticastTensor(NamedTuple):

Add multicast tensor #346

Are you sure you want to change the base?

Add multicast tensor #346

Conversation

joydddd commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!