NVIDIA
diff --git a/‎examples/disaggregated/README.md‎
Lines changed: 34 additions & 2 deletions b/‎examples/disaggregated/README.md‎
Lines changed: 34 additions & 2 deletions
diff --git a/‎examples/disaggregated/slurm/service_discovery_example/launch.slurm‎
Lines changed: 73 additions & 0 deletions b/‎examples/disaggregated/slurm/service_discovery_example/launch.slurm‎
Lines changed: 73 additions & 0 deletions
diff --git a/‎tensorrt_llm/_torch/models/modeling_qwen2vl.py‎
Lines changed: 2 additions & 79 deletions b/‎tensorrt_llm/_torch/models/modeling_qwen2vl.py‎
Lines changed: 2 additions & 79 deletions
diff --git a/‎tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py‎
Lines changed: 67 additions & 23 deletions b/‎tensorrt_llm/_torch/modules/fused_moe/fused_moe_cutlass.py‎
Lines changed: 67 additions & 23 deletions
@@ -204,7 +204,39 @@ srun -A <account> -p <partition> -t <time> \
 Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/simple_example/).
 
 
-## Dynamic scaling (Prototype)
+## Dynamic scaling 
+  
+### Service discovery method
+
+Disaggregated server also supports dynamic service-discovery and auto-scaling of context/generation servers. This can be achieved by setting `disagg_cluster` section in the configurations of both context/generation servers and disagg-server. In this case, the context/generation servers must include an extra command line of `--server-role=[context|generation]`, also the `context/genration_servers` section of disaggregated server must be removed. You can simplify context/generation servers' config section by only passing `--disagg_cluster_uri=<disagg_cluster_uri>` in the command line (but disaggregated server's config must have this section). The omitted fields will use the defaults shown below. 
+
+```yaml
+disagg_cluster:
+  cluster_uri: <your_cluster_uri>
+  cluster_name: ""
+  minimal_instances: 
+    context_servers: 1
+    generation_servers: 1
+  heartbeat_interval_sec: 5
+  inactive_interval_sec: 10
+```
+- `cluster_uri`: the http address of disagg-server like `http://<your-disagg-server-host>:<your-disagg-server-port>` or a pre-configured Etcd server address like `etcd://<your-etcd-host>:2379`.
+- `cluster_name` : optional namespace to isolate multiple disagg-clusters in Etcd.
+- `minimal_instances`: the equivalence of `num_instances` in the auto-scaling concept, disagg-server will reject requests when 
+the active context/generation servers is below the corresponding threshold.
+- `heartbeat_interval_sec`: frequency at which context/generation servers send heartbeats to the disagg-server.
+- `inactive_interval_sec`: A server is marked inactive if no heartbeat is received within this interval (set higher than the heartbeat interval).
+
+Note that the disaggregated server and all the context/generation servers should have the same `disagg_cluster` configuration values, or the disaggregated server may not be able to keep alive or detect inactivity the other servers properly. If `disagg_cluster` section is specified, 
+
+Additionally, we offer a fully executable script—please refer to [Disaggregated SLURM Scripts](./slurm/service_discovery_example/).
+
+#### Dynamically adding servers
+
+To add servers dynamically, you can start more context/generation workers with the same `disagg_cluster`, then the disaggregated server can discover the new servers and dispatch requests to them automatically. If a context/generation server becomes inactive, the disaggregated server will also detect this and stop routing requests to it.
+
+
+### Metadata server method (Prototype)
 
 Currently, trtllm supports dynamic addition and removal of servers by leveraging ETCD. To enable this feature, you should start the context and generation servers with an additional flag ```--metadata_server_config_file``` and ```--server_role```.
 Before launching the context and generation servers, you should first start the ETCD server. By default, the ETCD server listens for client requests at ```localhost:2379```.
@@ -240,7 +272,7 @@ refersh_interval: 10.0
 
 The ```hostname``` and ```port``` must match those used when starting the ETCD server. The ```health_check_timeout``` parameter specifies how long a server will be considered dead if no healthy response is received. By default, trtllm will perform two checks before marking a server as dead. The ```refresh_interval``` parameter determines how often the latest server list is fetched from the ETCD server.
 
-### Dynamically adding servers
+#### Dynamically adding servers
 
 Users can add servers by directly launching them with trtllm-serve. For example, you can start an additional generation server as follows:
 
 
@@ -0,0 +1,73 @@
+bin/bash
+#SBATCH --partition=${partition}
+#SBATCH --account=${account}
+#SBATCH --job-name=${job_name}
+#SBATCH --time=02:00:00
+
+container_image="${container_image:-}"
+mount_paths="${mount_paths:-}"
+work_path="${work_path:-}"
+enable_etcd="${enable_etcd:-0}"
+disagg_port="8000"
+ctx_port="8001"
+gen_port="8002"
+
+# use the first node as the disaggregated server node
+disagg_server_node=$(head -n 1 <(scontrol show hostnames $SLURM_JOB_NODELIST))
+
+if [[ "$enable_etcd" == "1" ]]; then
+     # you can optionally launch a etcd server, the container image must have etcd installed
+     disagg_cluster_uri="etcd://${disagg_server_node}:2379"
+     srun --container-image=${container_image} \
+          --container-mounts=${mount_paths} \
+          -w $disagg_server_node -N 1 --ntasks-per-node=1 \
+          --mpi=pmix \
+          bash -c "etcd" &
+     sleep 5 # wait for etcd to start
+else
+     # or use the disaggregated server's http address as built-in service discovery server
+     disagg_cluster_uri="http://${disagg_server_node}:${disagg_port}"
+fi
+
+cat >${work_path}/disagg_config.yaml << EOL
+hostname: localhost
+port: ${disagg_port}
+backend: pytorch
+disagg_cluster:
+  cluster_uri: ${disagg_cluster_uri}
+  cluster_name: example_cluster
+EOL
+
+cat >${work_path}/ctx_extra-llm-api-config.yaml << EOL
+disable_overlap_scheduler: True
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
+EOL
+
+cat >${work_path}/gen_extra-llm-api-config.yaml << EOL
+cache_transceiver_config:
+  backend: UCX
+  max_tokens_in_buffer: 2048
+EOL
+
+# Launch a proxy without any context/generation servers.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -w $disagg_server_node -N 1 --ntasks-per-node=1 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve disaggregated -c ${work_path}/disagg_config.yaml" &
+
+# Launch a context with `tp_size=8` using two 4-GPU nodes, and register itself through disagg_cluster_uri
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 2 --ntasks-per-node=4 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tp_size 8 --host 0.0.0.0 --port ${ctx_port} --extra_llm_api_options ${work_path}/ctx_extra-llm-api-config.yaml --disagg_cluster_uri ${disagg_cluster_uri} --server-role context" &
+
+# Launch a generation with `tp_size=4` using one 4-GPU node.
+srun --container-image=${container_image} \
+     --container-mounts=${mount_paths} \
+     -N 1 --ntasks-per-node=4 \
+     --mpi=pmix \
+     bash -c "trtllm-llmapi-launch trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --tp_size 4 --host 0.0.0.0 --port ${gen_port} --extra_llm_api_options ${work_path}/gen_extra-llm-api-config.yaml --disagg_cluster_uri ${disagg_cluster_uri} --server-role generation" &
@@ -2,10 +2,8 @@
 import os
 from typing import Any, Dict, List, Optional, Tuple, Union
 
-import numpy as np
 import torch
 import torch.nn as nn
-from PIL import Image
 from torch.nn import functional as F
 from transformers import (AutoProcessor, AutoTokenizer, PretrainedConfig,
                           PreTrainedModel)
@@ -31,7 +29,6 @@
                        ExtraProcessedInputs, InputProcessor,
                        MultimodalPlaceholderMetadata,
                        MultimodalPlaceholderPlacement, TextPrompt,
-                       default_multimodal_input_loader,
                        register_input_processor)
 from ...logger import logger
 from ...sampling_params import SamplingParams
@@ -95,6 +92,8 @@ def __init__(self,
                  model_config: PretrainedConfig,
                  tokenizer: AutoTokenizer,
                  trust_remote_code: bool = True):
+
+        super().__init__()
         self.model_config = model_config
         self.tokenizer = tokenizer if tokenizer is not None else AutoTokenizer.from_pretrained(
             model_path)
@@ -284,81 +283,6 @@ def get_rope_index(
             mrope_position_deltas, device=input_ids.device).unsqueeze(1)
         return position_ids, mrope_position_deltas
 
-    def get_dummy_text(self, input_seq_len: int) -> str:
-        ids = np.random.randint(
-            low=0,
-            high=int(
-                self.model_config.vocab_size),  # high is exclusive in NumPy
-            size=input_seq_len,
-        ).tolist()
-        return self.tokenizer.decode(ids, skip_special_tokens=True)
-
-    def get_dummy_image(self, max_width: int, max_height: int):
-        image = Image.new("RGB", (max_width, max_height), color=255)
-        return image
-
-    def get_dummy_prompt(self, input_seq_len: int):
-        text = ""
-        # we use the max resolution as starting point
-        img_max_dim = 3584
-        image = self.get_dummy_image(max_width=img_max_dim,
-                                     max_height=img_max_dim)
-
-        test_mm_prompt = default_multimodal_input_loader(
-            tokenizer=self.tokenizer,
-            model_dir=self.model_path,
-            model_type=self.model_config.model_type,
-            modality="image",
-            prompts=[text],
-            media=[[image]],
-            image_data_format="pt")[0]
-
-        prompt_token_ids_single_img, _ = self(test_mm_prompt, None)
-
-        # if the max img resolution results in a number of tokens greater then
-        # input_seq_len, we keep lowering the resolution such as to find the
-        # max resolution such as it does not exceed the input_seq_len
-        while len(prompt_token_ids_single_img) > input_seq_len:
-            # reduce img resolution
-            img_max_dim = img_max_dim >> 1
-
-            image = self.get_dummy_image(max_width=img_max_dim,
-                                         max_height=img_max_dim)
-
-            test_mm_prompt = default_multimodal_input_loader(
-                tokenizer=self.tokenizer,
-                model_dir=self.model_path,
-                model_type=self.model_config.model_type,
-                modality="image",
-                prompts=[text],
-                media=[[image]],
-                image_data_format="pt")[0]
-
-            prompt_token_ids_single_img, _ = self(test_mm_prompt, None)
-
-        len_prompt_tokens_ids = len(prompt_token_ids_single_img)
-        # There are corner cases where if we strictly try to generate a text based
-        # on how many tokens we need to complete the input_seq_len, the output of
-        # default_multimodal_input_loader may give more tokens then the input_seq_len and this
-        # can lead to errors.
-        # That is why we try to clip the variable text_token_left to a lower threshold
-        # but close enough to the actual input_seq_len
-        text_generation_perc_threshold = 0.95
-        text_token_left = int((input_seq_len - len_prompt_tokens_ids) *
-                              text_generation_perc_threshold)
-
-        if text_token_left > 0:
-            text = self.get_dummy_text(text_token_left)
-
-        return default_multimodal_input_loader(
-            tokenizer=self.tokenizer,
-            model_dir=self.model_path,
-            model_type=self.model_config.model_type,
-            modality="image",
-            prompts=[text],
-            media=[[image]],
-            image_data_format="pt")[0]
-
     def _preprocess(self, text: dict[str, any], mm_data: dict[str, any],
                     mm_processor_kwargs: Dict[str, Any]):
         images = mm_data.get("image")
@@ -1018,7 +942,6 @@ def forward(
 
             mm_embeds = find_input_mm_embeds(
                 mm_embeds, multimodal_params[:num_context_requests])
-
             if not self.model_config.pretrained_config.disable_fuse_rope:
                 mrope_config = self.prepare_mrope_config(
                     multimodal_params, num_context_requests)
 
@@ -6,11 +6,12 @@
 
 from tensorrt_llm._mnnvl_utils import MnnvlMemory, MnnvlMoe
 from tensorrt_llm._torch.distributed.moe_alltoall import MoeAlltoAll
+from tensorrt_llm.logger import logger
 
 from ...distributed import allgather
 from ...model_config import ModelConfig
 from ...utils import AuxStreamType, EventType, Fp4QuantizedTensor, ceil_div
-from .interface import MoE
+from .interface import AlltoallMethodType, MoE
 
 # isort: off
 from .quantization import (
@@ -140,28 +141,44 @@ def __init__(
         self.has_been_profiled_min_latency = False
 
         # TODO: AlltoAll code is largely duplicated with WideEPMoE. Consider refactor and reuse in the future.
+        self.alltoall_method_type = self.select_alltoall_method_type()
+        logger.info_once(
+            f"{self.__class__.__name__} selects alltoall_method_type {self.alltoall_method_type!r}",
+            key="alltoall_method_type")
         self.alltoall_workspace = None
         self.alltoall_prepare_workspace = None
+        self.use_low_precision_combine = False
         if self.enable_alltoall:
-            if self.moe_alltoall_backend == "mnnvllatency":
-                MnnvlMemory.initialize()
-                self.alltoall_workspace = MnnvlMoe.get_moe_workspaces(
-                    model_config.mapping)
-                self.alltoall_prepare_workspace = MnnvlMoe.get_moe_prepare_workspace(
-                    model_config.mapping)
-            elif self.moe_alltoall_backend == "mnnvlthroughput":
-                workspace_mb = int(
-                    os.environ.get("TRTLLM_MOE_A2A_WORKSPACE_MB", "512"))
-                self.moe_a2a = MoeAlltoAll(
-                    mapping=self.mapping,
-                    max_num_tokens_per_rank=model_config.max_num_tokens,
-                    top_k=self.routing_method.experts_per_token,
-                    num_experts=self.num_experts,
-                    workspace_size_per_rank=workspace_mb * 1024 * 1024,
+            self.use_low_precision_combine = model_config.use_low_precision_moe_combine
+
+            if self.alltoall_method_type == AlltoallMethodType.MNNVL:
+                if self.moe_alltoall_backend == "mnnvllatency":
+                    MnnvlMemory.initialize()
+                    self.alltoall_workspace = MnnvlMoe.get_moe_workspaces(
+                        model_config.mapping)
+                    self.alltoall_prepare_workspace = MnnvlMoe.get_moe_prepare_workspace(
+                        model_config.mapping)
+                elif self.moe_alltoall_backend == "mnnvlthroughput":
+                    workspace_mb = int(
+                        os.environ.get("TRTLLM_MOE_A2A_WORKSPACE_MB", "512"))
+                    self.moe_a2a = MoeAlltoAll(
+                        mapping=self.mapping,
+                        max_num_tokens_per_rank=model_config.max_num_tokens,
+                        top_k=self.routing_method.experts_per_token,
+                        num_experts=self.num_experts,
+                        workspace_size_per_rank=workspace_mb * 1024 * 1024,
+                    )
+                else:
+                    raise ValueError(
+                        f"Unsupported moe alltoall backend: {self.moe_alltoall_backend}"
+                    )
+            elif self.alltoall_method_type == AlltoallMethodType.DeepEP or self.alltoall_method_type == AlltoallMethodType.DeepEPLowLatency:
+                raise NotImplementedError(
+                    "DeepEP and DeepEPLowLatency are not supported for CutlassFusedMoE yet"
                 )
             else:
-                raise ValueError(
-                    f"Unsupported moe alltoall backend: {self.moe_alltoall_backend}"
+                raise NotImplementedError(
+                    f"Not available alltoall method type: {self.alltoall_method_type!r}"
                 )
 
         # If True, the router weight will be multiplied on the input rather than at the end of FC2
@@ -204,13 +221,38 @@ def has_int8_woq_per_channel(self):
         return self.quant_config.layer_quant_mode.is_int8_weight_only(
         ) and not self.quant_config.layer_quant_mode.has_per_group_scaling()
 
+    def select_alltoall_method_type(self) -> AlltoallMethodType:
+        all2all_method_type = os.environ.get("TRTLLM_FORCE_ALLTOALL_METHOD")
+        if all2all_method_type is not None:
+            if AlltoallMethodType[all2all_method_type] in [
+                    AlltoallMethodType.DeepEP,
+                    AlltoallMethodType.DeepEPLowLatency
+            ]:
+                raise NotImplementedError(
+                    "DeepEP and DeepEPLowLatency are not supported for CutlassFusedMoE yet"
+                )
+            return AlltoallMethodType[all2all_method_type]
+
+        if not self.mapping.enable_attention_dp:
+            return AlltoallMethodType.NotEnabled
+
+        if self.mapping.tp_size == 1:
+            return AlltoallMethodType.NotEnabled
+
+        if os.environ.get("TRTLLM_MOE_DISABLE_ALLTOALLV", "0") == "1":
+            return AlltoallMethodType.NotEnabled
+
+        if not (self.mapping.moe_ep_size > self.routing_method.experts_per_token
+                and MnnvlMemory.supports_mnnvl()):
+            return AlltoallMethodType.NotEnabled
+
+        return AlltoallMethodType.MNNVL
+
     @cached_property
     def enable_alltoall(self):
-        return (self.mapping.moe_ep_size > self.routing_method.experts_per_token
-                and self.mapping.enable_attention_dp
-                and self.mapping.tp_size > 1
-                and os.environ.get("TRTLLM_MOE_DISABLE_ALLTOALLV", "0") != "1"
-                and MnnvlMemory.supports_mnnvl())
+        """ enable_alltoall (bool): whether to enable alltoall instead of allgather/reducescatter
+        """
+        return self.alltoall_method_type != AlltoallMethodType.NotEnabled
 
     @cached_property
     def moe_alltoall_backend(self):
@@ -510,6 +552,8 @@ def forward_chunk(
                         ep_rank=self.ep_rank,
                         ep_size=self.ep_size,
                         top_k=top_k,
+                        use_low_precision_combine=self.
+                        use_low_precision_combine,
                         token_count=token_count)
             elif self.moe_alltoall_backend == "mnnvlthroughput":
                 hidden = final_hidden_states.shape[-1]