ServiceNow · RaymondLi0 · Jul 18, 2025 · Jul 21, 2025 · Jul 22, 2025 · Jul 22, 2025
diff --git a/.github/workflows/ci.yaml b/.github/workflows/ci.yaml
@@ -32,7 +32,7 @@ jobs:
           pip install pybind11
           FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE FLASH_ATTENTION_FORCE_BUILD=TRUE MAMBA_SKIP_CUDA_BUILD=TRUE \
           MAMBA_FORCE_BUILD=TRUE CAUSAL_CONV1D_FORCE_BUILD=TRUE CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE \
-          pip install --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,DEV,DOCS]"
+          pip install --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,GENERATION,DEV,DOCS]"
       - name: Run tests
         run: pytest -v -ra .
 

diff --git a/.github/workflows/docs.yaml b/.github/workflows/docs.yaml
@@ -34,7 +34,7 @@ jobs:
           pip install pybind11
           FLASH_ATTENTION_SKIP_CUDA_BUILD=TRUE FLASH_ATTENTION_FORCE_BUILD=TRUE MAMBA_SKIP_CUDA_BUILD=TRUE \
           MAMBA_FORCE_BUILD=TRUE CAUSAL_CONV1D_FORCE_BUILD=TRUE CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE \
-          pip install --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,DEV,DOCS]"
+          pip install --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,GENERATION,DEV,DOCS]"
       - name: Build the documentation
         run: mkdocs build
 

diff --git a/Dockerfile b/Dockerfile
@@ -37,7 +37,7 @@ COPY --chmod=777 ./fast_llm/__init__.py fast_llm/
 COPY --chmod=777 ./fast_llm/csrc/ fast_llm/csrc/
 
 # Install dependencies within the virtual environment.
-RUN pip install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,DEV]" triton==3.1.0
+RUN pip install --no-cache-dir --no-build-isolation -e ".[CORE,OPTIONAL,HUGGINGFACE,SSM,VISION,GENERATION,DEV]" triton==3.1.0
 
 # Copy the remaining source code with universal write permissions.
 COPY --chmod=777 ./Megatron-LM Megatron-LM

diff --git a/Megatron-LM b/Megatron-LM
diff --git a/docs/user_guide/evaluators.md b/docs/user_guide/evaluators.md
@@ -0,0 +1,134 @@
+# Evaluations
+
+Fast-LLM allows you to perform various evaluations during training or as a separate evaluation step. In both cases, you need to use your training config with `training.evaluators` specified.
+
+For evaluators used during training, both `interval` and `offset` must be specified. Then, start training as usual with:
+
+`fast-llm train gpt --config path/to/training/config.yaml`
+
+To perform evaluation as a separate step, use the same training config. Depending on the training progress, either the start model or the latest checkpoint will be loaded, and `interval` and `offset` will be ignored. To start evaluation:
+
+`fast-llm evaluate gpt --config path/to/training/config.yaml`
+
+## Currently Supported Evaluators
+
+- `loss`
+- `lm_eval`
+
+## Loss Evaluator
+
+To set up loss evaluation, specify a dataset to be used in the `data.datasets` section of the config. You must also define the loss evaluator in the `training.evaluators` config section. See example below.
+
+```yaml
+training:
+  evaluations:
+    stack_3b:
+      interval: 10
+      evaluator:
+        type: loss
+        iterations: 10
+        dataset_name: stack_3b
+    fineweb:
+      evaluator:
+        type: loss
+        iterations: 10
+        dataset_name: stack_3b
+      interval: 10
+data:
+  datasets:
+    stack_3b:
+      type: memmap
+      path: path/to/memmap/dataset
+    fineweb:
+      type: memmap
+      path: path/to/memmap/dataset1
+```
+
+## Evaluation Harness (`lm_eval`) Evaluator
+
+**Note:** Only data parallelism is currently supported for the `lm_eval` evaluator.
+
+To run `lm_eval` evaluations, version `0.4.9` of `lm_eval` must be installed along with all dependencies required for your evaluation tasks.
+
+The following environment variables may need to be set:
+
+- `HF_HOME`: Path for Hugging Face data caching
+- `WANDB_API_KEY_PATH`: Path to a file containing your Weights & Biases API key (if logging to W&B)
+- `HUGGINGFACE_API_KEY_PATH`: Path to a file containing your Hugging Face hub token
+- `NLTK_DATA`: Path to a directory that will contain downloaded NLTK packages (needed for some tasks)
+- `HF_ALLOW_CODE_EVAL=1`: Required for some evaluation tasks
+
+You may need to specify additional environment variables depending on the `lm_eval` tasks you want to run.
+
+To specify an `lm_eval` task, the evaluator config includes the following fields:
+
+### Model Config
+
+The model instantiated for training is reused for evaluation, so you don't need to specify it separately. However, there are some parameters specific to `lm_eval`. See `fast_llm/engine/evaluation/config.EvaluatorLmEvalConfig` for details.
+
+### CLI Parameters for `lm_eval`
+
+All other parameters are specified as if you were calling the `lm_eval` CLI, using a list of strings. Some CLI parameters are ignored or restricted—specifically those related to model loading, W&B, batch sizes, and device setup, as these are managed by the rest of the Fast-LLM configuration.
+
+Also, the tokenizer must be specified in `data.tokenizer`. If the tokenizer does not have a `bos_token`, it must be specified explicitly in `data.tokenizer.bos_token`. Although `lm_eval` does not use the `bos_token` directly, it is still required because the same tokenizer is used by other Fast-LLM components.
+
+Below is an example of the config:
+
+```yaml
+training:
+  evaluations:
+    lm_eval_tasks1:
+      interval: 10
+      evaluator:
+        type: lm_eval
+        cli_args:
+          - --tasks
+          - gsm8k,xnli_en,wikitext,ifeval
+          - --output_path
+          - /path/to/lm_eval/output
+data:
+  tokenizer:
+    path: path/to/the/tokenizer
+```
+
+It is also possible to run different tasks with different intervals and offsets—for example, to run slower or more comprehensive tasks less frequently.:
+
+```yaml
+training:
+  evaluations:
+    gsm8k:
+      interval: 20
+      evaluator:
+        type: lm_eval
+        cli_args:
+          - --tasks
+          - gsm8k
+          - --output_path
+          - /path/to/lm_eval/output
+          - --limit
+          - "64"
+    ifeval:
+      offset: 10
+      interval: 40
+      evaluator:
+        type: lm_eval
+        cli_args:
+          - --tasks
+          - ifeval
+          - --output_path
+          - /path/to/lm_eval/output
+          - --limit
+          - "32"
+    faster_tasks:
+      interval: 10
+      evaluator:
+        type: lm_eval
+        cli_args:
+          - --tasks
+          - xnli_en,wikitext
+          - --output_path
+          - /path/to/lm_eval/output
+data:
+  tokenizer:
+    path: path/to/the/tokenizer
+```
diff --git a/fast_llm/cli.py b/fast_llm/cli.py
@@ -7,6 +7,7 @@
 from fast_llm.engine.config_utils.logging import configure_logging
 from fast_llm.engine.config_utils.run import log_main_rank
 from fast_llm.engine.config_utils.runnable import RunnableConfig
+from fast_llm.utils import set_global_variables
 
 # Import these submodules to ensure classes are added to the dynamic class registry.
 import fast_llm.data.auto  # isort: skip
@@ -20,6 +21,8 @@
 def fast_llm_main_wrapper():
     # (Pre-)configure logging
     configure_logging()
+    # Set global and environment variables before third-party imports.
+    set_global_variables()
     try:
         yield
     except Exception as e:

diff --git a/fast_llm/config.py b/fast_llm/config.py
@@ -735,7 +735,7 @@ def _get_class_name(cls) -> str:
     @classmethod
     def from_dict(
         cls,
-        default: "Config| dict[str, typing.Any]]",
+        default: "Config| dict[str, typing.Any]",
         *updates: "Config| dict[str | tuple[str, ...], typing.Any]",
         strict: bool = True,
         update_type: UpdateType = UpdateType.override,

diff --git a/fast_llm/core/distributed.py b/fast_llm/core/distributed.py
@@ -8,10 +8,13 @@
 
 import contextlib
 import datetime
+import io
 import logging
+import pickle
 import typing
 
 import torch
+import torch.monitor
 from torch._C._distributed_c10d import Work
 from torch.distributed import (  # noqa
     ProcessGroup,
@@ -46,6 +49,7 @@ def broadcast(
         return work
     else:
         work.wait()
+        return None
 
 
 def check_parallel_match(tensor: torch.Tensor, group: ProcessGroup | None, name: str) -> None:
@@ -110,6 +114,7 @@ def send(tensor: torch.Tensor, dst: int, group: ProcessGroup, async_op=False, ta
         return work
     else:
         work.wait()
+        return None
 
 
 def recv(tensor: torch.Tensor, src: int, group: ProcessGroup, async_op=False, tag: int = 0) -> Work | None:
@@ -119,6 +124,7 @@ def recv(tensor: torch.Tensor, src: int, group: ProcessGroup, async_op=False, ta
         return work
     else:
         work.wait()
+        return None
 
 
 @contextlib.contextmanager
@@ -133,3 +139,118 @@ def set_generator(generator: torch.Generator) -> typing.Generator[None, None, No
     finally:
         generator.set_state(default_generator.get_state())
         default_generator.set_state(old_state)
+
+
+def gather(
+    tensor: torch.Tensor,
+    gather_list: list[torch.Tensor] | None = None,
+    group: ProcessGroup | None = None,
+    async_op: bool = False,
+    dst: int = 0,
+):
+    assert group is not None
+    opts = torch.distributed.GatherOptions()
+    opts.rootRank = dst
+    work = group.gather([gather_list] if dst == group.rank() else [], [tensor], opts)
+
+    if async_op:
+        return work
+    elif work is not None:
+        work.wait()
+        return None
+
+
+def scatter(
+    tensor: torch.Tensor,
+    scatter_list: list[torch.Tensor] | None = None,
+    group: ProcessGroup | None = None,
+    async_op: bool = False,
+    src: int = 0,
+):
+    assert group is not None
+    opts = torch.distributed.ScatterOptions()
+    opts.rootRank = src
+    opts.asyncOp = async_op
+    work = group.scatter(
+        [tensor if not tensor.is_complex() else torch.view_as_real(tensor)],
+        [[t if not t.is_complex() else torch.view_as_real(t) for t in scatter_list]] if src == group.rank() else [],
+        opts,
+    )
+    if async_op:
+        return work
+    elif work is not None:
+        work.wait()
+        return None
+
+
+def _object_to_tensor(obj: typing.Any) -> torch.Tensor:
+    f = io.BytesIO()
+    pickle.Pickler(f).dump(obj)
+    return torch.tensor(torch.UntypedStorage.from_buffer(f.getvalue(), dtype=torch.uint8), dtype=torch.uint8)
+
+
+def _tensor_to_object(tensor: torch.Tensor) -> typing.Any:
+    return pickle.Unpickler(io.BytesIO(tensor.numpy(force=True).tobytes())).load()
+
+
+def gather_object(
+    obj: typing.Any,
+    group: ProcessGroup | None = None,
+    dst: int = 0,
+) -> list[typing.Any] | None:
+    assert group is not None
+    group_rank = group.rank()
+    group_size = group.size()
+    device = torch.cuda.current_device()
+
+    obj_tensor = _object_to_tensor(None if group_rank == dst else obj)
+    sizes = torch.full([group.size()], len(obj_tensor), dtype=torch.int64, device=device)
+    all_gather_into_tensor(sizes, sizes[group.rank()], group=group)
+    sizes = sizes.tolist()
+    max_size = max(sizes)
+
+    input_tensor = torch.empty(max_size, dtype=torch.uint8, device=device)
+
+    if group_rank == dst:
+        output_tensors = list(torch.empty(max_size * group_size, dtype=torch.uint8, device=device).chunk(group_size))
+        gather(input_tensor, output_tensors, dst=dst, group=group)
+        return [
+            obj if rank_ == dst else _tensor_to_object(tensor[:size])
+            for rank_, (tensor, size) in enumerate(zip(output_tensors, sizes, strict=True))
+        ]
+    else:
+        input_tensor[: obj_tensor.numel()].copy_(obj_tensor)
+        gather(input_tensor, None, dst=dst, group=group)
+        return None
+
+
+def scatter_object(
+    scatter_object_input_list: typing.Optional[list[typing.Any]] = None,
+    group: ProcessGroup | None = None,
+    src: int = 0,
+) -> typing.Any:
+    assert group is not None
+    group_rank = group.rank()
+    group_size = group.size()
+    device = torch.cuda.current_device()
+
+    if group_rank == src:
+        tensor_list = [
+            _object_to_tensor(None if rank_ == src else obj) for rank_, obj in enumerate(scatter_object_input_list)
+        ]
+        sizes = [tensor.numel() for tensor in tensor_list]
+        max_size = max(sizes)
+        size_tensor = torch.tensor([[size, max_size] for size in sizes], dtype=torch.int64, device=device)
+        scatter(size_tensor[group_rank], list(size_tensor.unbind()), src=src, group=group)
+        scatter_list = list(torch.empty(max_size * group_size, dtype=torch.uint8, device=device).chunk(group_size))
+        for scatter_tensor, tensor, size in zip(scatter_list, tensor_list, sizes, strict=True):
+            scatter_tensor[:size].copy_(tensor)
+        scatter(scatter_list[src], scatter_list, src=src, group=group)
+        return scatter_object_input_list[src]
+    else:
+        size_tensor = torch.empty(2, dtype=torch.int64, device=device)
+        scatter(size_tensor, None, src=src, group=group)
+        size, max_size = size_tensor.tolist()
+        output_tensor = torch.empty(max_size, dtype=torch.uint8, device=device)
+        scatter(output_tensor, None, src=src, group=group)
+        return _tensor_to_object(output_tensor[:size])
diff --git a/fast_llm/data/dataset/gpt/memmap.py b/fast_llm/data/dataset/gpt/memmap.py
@@ -177,7 +177,6 @@ def _init(
             assert self._num_pixels == num_pixels
         if num_tokens is not None:
             assert self._num_tokens == num_tokens
-        self._image_sizes = np.array(self._image_sizes, dtype=np.int32)
 
     def __getstate__(self) -> tuple[str, pathlib.Path, int | None, int | None]:
         return (self._name, self._prefix, self._num_documents, self._num_tokens, self._num_pixels)

diff --git a/fast_llm/data/dataset/gpt/sampled.py b/fast_llm/data/dataset/gpt/sampled.py
@@ -143,7 +143,7 @@ def _sample(self) -> None:
         # Get the document sizes, the main information needed for sampling.
         document_sizes, image_sizes = self._indexed_dataset.get_document_sizes()
         document_sizes = torch.from_numpy(document_sizes).to(self._device)
-        if image_sizes.any():
+        if image_sizes:
             image_token_sizes = []
             for i, sizes in enumerate(image_sizes):
                 image_token_sizes.append(

diff --git a/fast_llm/data/preparator/gpt_memmap/prepare.py b/fast_llm/data/preparator/gpt_memmap/prepare.py
@@ -458,7 +458,7 @@ def _split_and_blend_dataset_configs(
                     text_sizes, image_sizes = dataset.get_document_sizes()
                     tokens_cumsum = text_sizes.cumsum()
                     Assert.eq(tokens_cumsum[-1], dataset_config.num_tokens)
-                    if image_sizes.any():
+                    if image_sizes:
                         num_pixels_cumsum = np.cumsum([x.prod(axis=1).sum() for x in image_sizes])
                         # We use the patch sizes only for the purposes of even splitting and blending weights.
                         # We can always use a different patch size for training without any significant impact