Add preprocessing utils for Qwen3-Omni #2613

hengtaoguo · 2025-11-06T16:27:07Z

Description

Add image/video/audio preprocessing utils for Qwen3-Omni in MaxText.multimodal.qwen3_omni_preprocessor.preprocess_mm_data_qwen3_omni(), returning dataclass Qwen3OmniPreprocessorOutput containing all preprocessed data (pixel_values, pixel_grid_thw, video_values, video_grid_thw, video_second_per_grid, audio_values, audio_mask).
Add unit test comparing MaxText implementation with Qwen3-Omni's processor on HuggingFace.
[WIP] Refactor [multimodal_utils.py]:
- MaxText.multimodal.utils: Commonly used basic functions such as image loading and normalization.
- MaxText.multimodal.{MODEL}_preprocessor.py: Model-specific preprocessing utils.
- MaxText.multimodal.preprocessor.py: Centralized function preprocess_mm_data() will route to model-specific preprocessing logics based on model name.

Tests

Passing unit tests for MaxText preprocess_mm_data_qwen3_omni vs HuggingFace Qwen3OmniMoeProcessor:

python -m unittest tests.check_qwen3_embedding_vs_reference.TextQwen3OmniPreprocessing

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

eitanporat · 2025-11-19T09:28:27Z

is the functionality implemented on cpu in numpy in the torch variant. if so, is there a reason not to want to reuse it?

eitanporat · 2025-11-19T16:08:38Z

could you add the new requirements to the pyproject toml (decord and librosa)?

eitanporat

hi @hengtaoguo, I left some comments

eitanporat · 2025-11-19T09:13:58Z

src/MaxText/layers/decoders.py

            image_masks=image_masks,
        )
      # TODO(hengtaoguo): Add support for other multimodal models such as Llama4, refactor if needed
+      elif cfg.model_name in ["qwen3-omni-30b-a3b"]:


what is the reasoning behind this change?

eitanporat · 2025-11-19T09:16:43Z

src/MaxText/multimodal/preprocessor.py

+  if config.model_name in ["gemma3-4b", "gemma3-12b", "gemma3-27b"]:
+
+    images = [multimodal_utils.load_image_from_path(p) for p in config.image_path.split(",")]
+    processor_outputs = multimodal_utils.pre_process_gemma3_image(images)


maybe rename the functions to preprocess_mm_data_gemma3 ?

maybe it would be better to use a factory pattern here

eitanporat · 2025-11-19T09:16:45Z

src/MaxText/multimodal/preprocessor.py

+  elif config.model_name in ["qwen3-omni-30b-a3b"]:
+    from MaxText.multimodal.qwen3_omni_processor import preprocess_mm_data_qwen3_omni  # pylint: disable=import-outside-toplevel
+
+    processor_outputs = preprocess_mm_data_qwen3_omni(config)


why does it accept a config?

eitanporat · 2025-11-19T09:17:20Z

src/MaxText/multimodal/qwen3_omni_processor.py

+FRAME_FACTOR = 2
+FPS = 2.0
+FPS_MIN_FRAMES = 4
+FPS_MAX_FRAMES = 768


should move to the config

eitanporat · 2025-11-19T09:22:01Z

src/MaxText/multimodal/qwen3_omni_processor.py

+      max_pixels=max_pixels,
+  )
+
+  with jax.default_device(jax.devices("cpu")[0]):


why is this necessary i am curious.
could it support multiple cpus for example?

I have the same question. Could we replace jax.image.resize with other alternatives to avoid jax grabbing TPU for preprocessing. Perhaps we can apply the resize function from PIL on video frames? We did the same for gemma3 (https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/multimodal_utils.py#L359-L367) and llama4

eitanporat · 2025-11-19T09:26:27Z

src/MaxText/multimodal/utils.py

+    The normalized images.
+  """
+  images -= np.asarray(mean)
+  images /= np.asarray(std)


curious why it's called mean std. as this isn't used as mean as std.

it transforms the mean and std in the following way. if mean[images]=m and std[images]=s then
if x = normalize_images(images, mean, std), mean[x] = (m-mean)/std and std[x] = s/std

eitanporat · 2025-11-19T09:27:23Z

src/MaxText/multimodal/qwen3_omni_processor.py

+  """
+  if device != "cpu":
+    raise ValueError(
+        f"Got device `{device}` for feature extraction, but feature extraction on CUDA accelerator "


i guess this was copied from torch but this comment should be changed.

eitanporat · 2025-11-19T13:11:07Z

src/MaxText/multimodal/qwen3_omni_processor.py

+      raise ValueError("db_range must be greater than zero")
+    spectrogram_array = np.clip(spectrogram_array, a_min=spectrogram_array.max() - db_range, a_max=None)
+
+  return spectrogram


you probably have an error here... it will err

eitanporat · 2025-11-19T13:21:30Z

src/MaxText/multimodal/qwen3_omni_processor.py

+    mt_audio, mt_audio_mask = pre_process_audio_qwen3_omni(mt_audio)
+    processor_outputs.audio_values = mt_audio
+    processor_outputs.audio_mask = mt_audio_mask
+


this will break if you want audio and not video!

eitanporat · 2025-11-19T16:28:29Z

src/MaxText/multimodal/qwen3_omni_processor.py

+    processor_outputs.pixel_grid_thw = pixel_grid_thw
+    processor_outputs.num_images = len(images)
+
+  if config.video_path is not None:


did it work for you? in the base.yml i see config.video_path = "" so this doesn't actually work for me.

github-actions · 2025-11-19T16:52:32Z

🤖 Hi @hengtaoguo, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

📋 Review Summary

This Pull Request introduces comprehensive preprocessing utilities for Qwen3-Omni multimodal data, including image, video, and audio. The changes involve adding new configuration parameters, a new centralized preprocessor, and model-specific preprocessing logic with corresponding unit tests. The new test coverage for Qwen3-Omni preprocessing is good, and the overall structure for multimodal handling is improving.

🔍 General Feedback

The refactoring aims to deprecate the old multimodal_utils.py in favor of a new MaxText.multimodal.utils.py and a centralized preprocessor.py. While the direction is positive, some remnants of the old multimodal_utils.py are still in use, and there are duplicate dataclass definitions. A clearer plan for complete deprecation and migration would be beneficial.
The Qwen3-Omni embedding in decoders.py is currently a placeholder (pass), indicating future work is needed for full integration.
Specific implementation details in the qwen3_omni_processor.py, such as hardcoded temporal dimensions for image processing, could benefit from further clarification or configurability.

github-actions · 2025-11-19T16:54:05Z

src/MaxText/layers/decoders.py

            mask=bidirectional_mask,
            image_masks=image_masks,
        )
      # TODO(hengtaoguo): Add support for other multimodal models such as Llama4, refactor if needed


🟡 This pass statement indicates that multimodal embedding for qwen3-omni-30b-a3b is not yet implemented. While it unblocks the current integration, it's crucial to implement the actual embedding logic for this model to ensure full multimodal functionality. Consider creating a follow-up task to address this.

github-actions · 2025-11-19T16:54:06Z

src/MaxText/multimodal/preprocessor.py

+"""Multimodal data preprocessor router."""
+
+from MaxText import multimodal_utils  # TODO(hengtaoguo): deprecate this file and refactor to MaxText/multimodal/utils.py
+


🟡 The TODO comment on line 18 indicates a future refactoring to deprecate MaxText/multimodal_utils.py. It would be beneficial to have a clear plan or follow-up issue for this refactoring to ensure multimodal_utils.py is eventually removed and its relevant functions are moved to MaxText/multimodal/utils.py to keep the codebase clean and organized.

github-actions · 2025-11-19T16:54:07Z

src/MaxText/multimodal/preprocessor.py

+
+    images = [multimodal_utils.load_image_from_path(p) for p in config.image_path.split(",")]
+    processor_outputs = multimodal_utils.pre_process_gemma3_image(images)
+  elif config.model_name in ["llama4-17b-16e", "llama4-17b-128e"]:


🟡 The PreprocessorOutput is imported from MaxText.multimodal_utils. Since a new MaxText.multimodal.utils has been introduced, it would be more consistent to use PreprocessorOutput from MaxText.multimodal.utils instead of the old multimodal_utils. This also aligns with the TODO to deprecate multimodal_utils.py.

Suggested change

elif config.model_name in ["llama4-17b-16e", "llama4-17b-128e"]:

from MaxText.multimodal import utils as mm_utils

from MaxText.multimodal.qwen3_omni_processor import Qwen3OmniPreprocessorOutput # To resolve a potential circular dependency

# TODO(hengtaoguo): deprecate this file and refactor to MaxText/multimodal/utils.py

def preprocess_mm_data(config):

"""Preprocesses multimodal data based on the provided configuration.

Routes to the appropriate preprocessing function based on the model name.

Args:

config: A `pyconfig.Config` object containing configuration parameters.

Returns:

A `PreprocessorOutput` object containing the processed multimodal data.

"""

processor_outputs = mm_utils.PreprocessorOutput() # Using the new utils

github-actions · 2025-11-19T16:54:08Z

src/MaxText/multimodal/qwen3_omni_processor.py

+
+  images_in = [image] if isinstance(image, np.ndarray) else image
+  images_out = []
+  grids_thw = []


🟡 In pre_process_qwen3_image, the line grid_t = 2 // temporal_patch_size hardcodes the initial temporal dimension to 2. It's unclear why 2 is chosen as the dividend here without further explanation. This might limit flexibility or be a potential source of error if temporal_patch_size is greater than 2, resulting in grid_t becoming 0.

Consider adding a comment explaining the rationale behind this fixed value, or make it configurable if different temporal dimensions are expected for image processing in the future.

github-actions · 2025-11-19T16:54:09Z

src/MaxText/multimodal/qwen3_omni_processor.py

+    # Slaney-style mel is scaled to be approx constant energy per channel
+    enorm = 2.0 / (filter_freqs[2 : num_mel_filters + 2] - filter_freqs[:num_mel_filters])
+    mel_filters *= np.expand_dims(enorm, 0)
+


🟠 In preprocess_mm_data_qwen3_omni, the load_image_from_path function is called from mm_utils (which refers to MaxText.multimodal_utils). To align with the ongoing refactoring, this should be updated to use the load_image_from_path function from the new MaxText.multimodal.utils.

Suggested change

if config.image_path is not None:

images = [mm_utils.load_image_from_path(p) for p in config.image_path.split(",")]

pixel_values, pixel_grid_thw = pre_process_qwen3_image(images, config)

processor_outputs.pixel_values = pixel_values

processor_outputs.pixel_grid_thw = pixel_grid_thw

processor_outputs.num_images = len(images)

github-actions · 2025-11-19T16:54:10Z

src/MaxText/multimodal_utils.py

  pixel_values: None | np.ndarray = None
  pixel_mask: None | np.ndarray = None
  aspect_ratios: None | np.ndarray = None
+  num_images: int = 0


🟠 The PreprocessorOutput dataclass in src/MaxText/multimodal_utils.py seems to be a duplicate of the one introduced in src/MaxText/multimodal/utils.py. Given the TODO to deprecate multimodal_utils.py, it would be cleaner to remove this duplicated PreprocessorOutput and ensure all parts of the codebase use MaxText.multimodal.utils.PreprocessorOutput.

github-actions · 2025-11-19T16:54:11Z

src/MaxText/multimodal_utils.py

      pixel_values=image_tiles,
      pixel_mask=image_mask,
      aspect_ratios=aspect_ratios_array,
  )


🟡 The pre_process_image function in MaxText/multimodal_utils.py now accepts an optional config parameter, but its functionality is largely superseded by MaxText.multimodal.preprocessor.preprocess_mm_data. As part of the refactoring to deprecate multimodal_utils.py, consider either removing this function entirely if it's no longer needed, or clearly defining its role and ensuring it delegates to the new preprocessor if it must remain for backward compatibility during the transition.

hengtaoguo · 2025-11-19T17:24:50Z

is the functionality implemented on cpu in numpy in the torch variant. if so, is there a reason not to want to reuse it?

This has been a long-standing constraint, we intentionally exclude torch from our dependency. So we cannot use torch resize functions and need to reimplement everything in numpy/jnp.

aireenmei · 2025-11-19T23:05:32Z

src/MaxText/multimodal/qwen3_omni_processor.py

+      max_pixels=max_pixels,
+  )
+
+  with jax.default_device(jax.devices("cpu")[0]):


I have the same question. Could we replace jax.image.resize with other alternatives to avoid jax grabbing TPU for preprocessing. Perhaps we can apply the resize function from PIL on video frames? We did the same for gemma3 (https://github.com/AI-Hypercomputer/maxtext/blob/main/src/MaxText/multimodal_utils.py#L359-L367) and llama4

aireenmei · 2025-11-19T23:20:21Z

src/MaxText/multimodal/qwen3_omni_processor.py

+IMAGE_MEAN = 127.5
+IMAGE_STD = 127.5
+IMAGE_FACTOR = 28
+MIN_PIXELS = 4 * 28 * 28


Add some comments to explain the constants? Same for video constants. It's not easy to tell from name.

aireenmei · 2025-11-19T23:22:03Z

src/MaxText/multimodal_utils.py



-def pre_process_image(image, model_name):
+def pre_process_image(image, model_name, config=None):


Why is config needed here?

aireenmei · 2025-11-19T23:29:24Z

src/MaxText/multimodal/qwen3_omni_processor.py

+      "video_start": video_start,
+      "video_end": video_end,
+  }
+  vr = decord.VideoReader(video_path)


Add try/except to handle errors?

aireenmei · 2025-11-19T23:30:23Z

src/MaxText/multimodal/qwen3_omni_processor.py

+  Returns:
+      np.ndarray: The loaded audio waveform.
+  """
+  audio = librosa.load(data_path, sr=SAMPLE_RATE)[0]


Add try/except to handle errors?

aireenmei · 2025-11-19T23:36:37Z

src/MaxText/multimodal/qwen3_omni_processor.py

+    processor_outputs.num_videos = 1  # Only one video for now.
+
+  if config.audio_path is not None or (config.video_path is not None and config.use_audio_in_video):
+    mt_audio = _load_audio(config.video_path)


Should this be _load_audio(config.audio_path) when audio_path is provided?

aireenmei · 2025-11-22T17:38:49Z

src/MaxText/multimodal/qwen3_omni_processor.py

+    images_out.append(img_np)
+    grids_thw.append(img_grid_thw)
+
+  return images_out[0][0, :, :], grids_thw[0]


Why only return the first item here?

hengtaoguo force-pushed the hengtaoguo-pre branch 2 times, most recently from e6ff3dd to 674d9d9 Compare November 13, 2025 21:55

hengtaoguo force-pushed the hengtaoguo-pre branch 3 times, most recently from abd6cf5 to 71ba0b8 Compare November 19, 2025 06:14

preprocessing libs

cc08180

hengtaoguo force-pushed the hengtaoguo-pre branch from df87150 to cc08180 Compare November 19, 2025 06:17

hengtaoguo marked this pull request as ready for review November 19, 2025 06:19

hengtaoguo requested review from A9isha, NicoGrande, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, shuningjin, suexu1025 and vipannalla as code owners November 19, 2025 06:19

amend

c89f323

eitanporat reviewed Nov 19, 2025

View reviewed changes

hengtaoguo added the gemini-review label Nov 19, 2025

github-actions bot reviewed Nov 19, 2025

View reviewed changes

hengtaoguo assigned aireenmei Nov 19, 2025

fix return

e0aa07f

aireenmei reviewed Nov 19, 2025

View reviewed changes

amend

aa74004

eitanporat mentioned this pull request Nov 20, 2025

Qwen3 Preprocessing #2731

Open

4 tasks

aireenmei reviewed Nov 22, 2025

View reviewed changes

		"""Multimodal data preprocessor router."""

		from MaxText import multimodal_utils # TODO(hengtaoguo): deprecate this file and refactor to MaxText/multimodal/utils.py

-  elif config.model_name in ["llama4-17b-16e", "llama4-17b-128e"]:
+from MaxText.multimodal import utils as mm_utils
+from MaxText.multimodal.qwen3_omni_processor import Qwen3OmniPreprocessorOutput # To resolve a potential circular dependency
+# TODO(hengtaoguo): deprecate this file and refactor to MaxText/multimodal/utils.py
+def preprocess_mm_data(config):
+  """Preprocesses multimodal data based on the provided configuration.
+  Routes to the appropriate preprocessing function based on the model name.
+  Args:
+    config: A `pyconfig.Config` object containing configuration parameters.
+  Returns:
+    A `PreprocessorOutput` object containing the processed multimodal data.
+  """
+  processor_outputs = mm_utils.PreprocessorOutput() # Using the new utils

+  if config.image_path is not None:
+    images = [mm_utils.load_image_from_path(p) for p in config.image_path.split(",")]
+    pixel_values, pixel_grid_thw = pre_process_qwen3_image(images, config)
+    processor_outputs.pixel_values = pixel_values
+    processor_outputs.pixel_grid_thw = pixel_grid_thw
+    processor_outputs.num_images = len(images)



		def pre_process_image(image, model_name):
		def pre_process_image(image, model_name, config=None):

Add preprocessing utils for Qwen3-Omni #2613

Are you sure you want to change the base?

Add preprocessing utils for Qwen3-Omni #2613

Uh oh!

Conversation

hengtaoguo commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

eitanporat commented Nov 19, 2025

Uh oh!

eitanporat commented Nov 19, 2025

Uh oh!

eitanporat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 19, 2025

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

📋 Review Summary

🔍 General Feedback

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

hengtaoguo commented Nov 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hengtaoguo commented Nov 6, 2025 •

edited

Loading