opti-qwen2-vl-pre-process #1094

SangChengC · 2025-11-04T07:42:04Z

No description provided.

gemini-code-assist · 2025-11-04T07:42:24Z

Summary of Changes

Hello @SangChengC, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the image preprocessing pipeline for Qwen2-VL models by refactoring the Qwen2VLImageProcessor to use faster, GPU-accelerated operations and a more efficient approach to handling image resizing, normalization, and batching. Concurrently, it enhances the performance of inter-process communication within the LightLLM server by enabling TCP_NODELAY for RPyC connections, thereby reducing latency.

Highlights

Image Preprocessing Optimization: The explicit resize_image function call has been removed from the image encoding logic in qwen2_5_visual.py and qwen2_vl/qwen2_visual.py, indicating that image resizing is now handled more efficiently within the Qwen2VLImageProcessor.
Refactored Qwen2VLImageProcessor: The Qwen2VLImageProcessor class in vision_process.py has been significantly refactored to inherit from BaseImageProcessorFast and utilize torchvision.transforms.v2.functional. The preprocess method was completely rewritten to leverage GPU-accelerated torch.Tensor operations, including fused rescaling and normalization, and efficient image grouping and reordering for batched processing.
RPyC Communication Latency Reduction: The socket.TCP_NODELAY option has been enabled for all RPyC connections across httpserver/manager.py, visualserver/manager.py, and visualserver/model_infer/model_rpc.py. This change aims to reduce latency in inter-process communication by ensuring data is sent immediately without buffering.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request focuses on optimizing the vision pre-processing pipeline for Qwen2-VL models. The changes include refactoring the image processing logic to be more efficient by leveraging BaseImageProcessorFast and torch-based operations, which should yield significant performance improvements. Additionally, TCP_NODELAY is enabled for rpyc connections to reduce network latency. My review includes suggestions to improve code correctness, reusability, and maintainability, such as correcting a type hint, removing a hardcoded device to prevent redundant operations, and flagging the use of private library attributes which could be a future maintenance risk.

gemini-code-assist · 2025-11-04T07:44:24Z

lightllm/models/qwen2_vl/vision_process.py

-                min_pixels=self.min_pixels,
-                max_pixels=self.max_pixels,
+        image_arr = np.asarray(image, dtype=np.uint8)
+        image_data = torch.from_numpy(image_arr).permute(2, 0, 1).contiguous().to("cuda", non_blocking=True)


The preprocess method hardcodes moving the image tensor to "cuda". This makes the function less flexible and tightly coupled to a specific device. Furthermore, the calling code in qwen2_5_visual.py and qwen2_visual.py also moves the tensors to "cuda", leading to redundant device transfers. It's better to keep this preprocessing function device-agnostic and let the caller manage device placement.

Suggested change

image_data = torch.from_numpy(image_arr).permute(2, 0, 1).contiguous().to("cuda", non_blocking=True)

image_data = torch.from_numpy(image_arr).permute(2, 0, 1).contiguous()

gemini-code-assist · 2025-11-04T07:44:24Z

lightllm/models/qwen2_vl/vision_process.py

 def resize_image(
    image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS
-) -> tuple[Image.Image, int, int]:
+) -> tuple[Image.Image]:


The return type hint for resize_image is tuple[Image.Image], but the function returns a single Image.Image object. This mismatch can be confusing and might be flagged by static type checkers. It should be corrected to Image.Image to accurately reflect the return value.

Suggested change

def resize_image(

image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS

) -> tuple[Image.Image, int, int]:

) -> tuple[Image.Image]:

def resize_image(

image_file: Image.Image, factor: int = IMAGE_FACTOR, min_pixels: int = MIN_PIXELS, max_pixels: int = MAX_PIXELS

) -> Image.Image:

gemini-code-assist · 2025-11-04T07:44:24Z

lightllm/models/qwen2_vl/vision_process.py

    def preprocess(self, image) -> Tuple[torch.Tensor, torch.Tensor]:
-        if self.do_convert_rgb:
-            image = convert_to_rgb(image)
-        image = to_numpy_array(image)
-        input_data_format = infer_channel_dimension_format(image)
-        height, width = get_image_size(image, channel_dim=input_data_format)
-
-        resized_height, resized_width = height, width
-        if self.do_resize:
-            resized_height, resized_width = smart_resize(
-                height,
-                width,
-                factor=self.patch_size * self.merge_size,
-                min_pixels=self.min_pixels,
-                max_pixels=self.max_pixels,
+        image_arr = np.asarray(image, dtype=np.uint8)
+        image_data = torch.from_numpy(image_arr).permute(2, 0, 1).contiguous().to("cuda", non_blocking=True)
+        grouped_images, grouped_images_index = group_images_by_shape(
+            [image_data], disable_grouping=self.disable_grouping
+        )
+        resized_images_grouped = {}
+        for shape, stacked_images in grouped_images.items():
+            height, width = stacked_images.shape[-2:]
+            if self.do_resize:
+                resized_height, resized_width = smart_resize(
+                    height,
+                    width,
+                    factor=self.patch_size * self.merge_size,
+                    min_pixels=self.min_pixels,
+                    max_pixels=self.max_pixels,
+                )
+                stacked_images = self.resize(
+                    image=stacked_images,
+                    size=SizeDict(height=resized_height, width=resized_width),
+                    interpolation=self.interpolation,
+                )
+            resized_images_grouped[shape] = stacked_images
+        resized_images = reorder_images(resized_images_grouped, grouped_images_index)
+
+        # Group images by size for further processing
+        # Needed in case do_resize is False, or resize returns images with different sizes
+        grouped_images, grouped_images_index = group_images_by_shape(
+            resized_images, disable_grouping=self.disable_grouping
+        )
+        processed_images_grouped = {}
+        processed_grids = {}
+        for shape, stacked_images in grouped_images.items():
+            resized_height, resized_width = stacked_images.shape[-2:]
+            # Fused rescale and normalize
+            patches = self.rescale_and_normalize(
+                stacked_images, self.do_rescale, self.rescale_factor, self.do_normalize, self.image_mean, self.image_std
            )
-            image = resize(
-                image, size=(resized_height, resized_width), resample=self.resample, input_data_format=input_data_format
+            if patches.ndim == 4:
+                # add a temporal dimension if we have images
+                patches = patches.unsqueeze(1)
+            if patches.shape[1] % self.temporal_patch_size != 0:
+                repeats = patches[:, -1:].repeat(1, self.temporal_patch_size - 1, 1, 1, 1)
+                patches = torch.cat([patches, repeats], dim=1)
+            batch_size, grid_t, channel = patches.shape[:3]
+            grid_t = grid_t // self.temporal_patch_size
+            grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
+
+            patches = (
+                patches.view(
+                    batch_size,
+                    grid_t,
+                    self.temporal_patch_size,
+                    channel,
+                    grid_h // self.merge_size,
+                    self.merge_size,
+                    self.patch_size,
+                    grid_w // self.merge_size,
+                    self.merge_size,
+                    self.patch_size,
+                )
+                .permute(0, 1, 4, 7, 5, 8, 3, 2, 6, 9)
+                .contiguous()
            )
-
-        if self.do_rescale:
-            image = self.rescale(image, scale=self.rescale_factor, input_data_format=input_data_format)
-
-        if self.do_normalize:
-            image = self.normalize(
-                image=image, mean=self.image_mean, std=self.image_std, input_data_format=input_data_format
+            # Reorder dimensions to group grid and patch information for subsequent flattening.
+            # (batch, grid_t, grid_h, grid_w, merge_h, merge_w, channel, temp_patch_size, patch_h, patch_w)
+            flatten_patches = patches.view(
+                batch_size,
+                grid_t * grid_h * grid_w,
+                channel * self.temporal_patch_size * self.patch_size * self.patch_size,
            )

-        image = to_channel_dimension_format(image, self.data_format, input_channel_dim=input_data_format)
-
-        patches = np.array([image])
-
-        if patches.shape[0] == 1:
-            # why to copy image 2 times. use self.temporal_patch_size = 2.
-            patches = np.tile(patches, (self.temporal_patch_size, 1, 1, 1))
-        channel = patches.shape[1]
-        grid_t = patches.shape[0] // self.temporal_patch_size
-        grid_h, grid_w = resized_height // self.patch_size, resized_width // self.patch_size
-        patches = patches.reshape(
-            grid_t,
-            self.temporal_patch_size,
-            channel,
-            grid_h // self.merge_size,
-            self.merge_size,
-            self.patch_size,
-            grid_w // self.merge_size,
-            self.merge_size,
-            self.patch_size,
-        )
-        patches = patches.transpose(0, 3, 6, 4, 7, 2, 1, 5, 8)
-        flatten_patches = patches.reshape(
-            grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
-        )
-        image_grid_thw = (grid_t, grid_h, grid_w)
-        pixel_values = torch.as_tensor(flatten_patches)
-        grid_thw = torch.as_tensor([image_grid_thw])
+            processed_images_grouped[shape] = flatten_patches
+            processed_grids[shape] = [[grid_t, grid_h, grid_w]] * batch_size
+
+        processed_images = reorder_images(processed_images_grouped, grouped_images_index)
+        processed_grids = reorder_images(processed_grids, grouped_images_index)
+        pixel_values = torch.cat(processed_images, dim=0)  # (num_patches_total, C*T*ps*ps)
+        image_grid_thw = torch.as_tensor(processed_grids)

-        return pixel_values, grid_thw
+        return pixel_values, image_grid_thw


The preprocess method is implemented to handle a single image, but its implementation uses primitives like group_images_by_shape which are designed for batch processing. This is a missed opportunity for a significant performance improvement. By modifying the preprocess method to accept a list of images, you could leverage true batch processing and avoid calling it in a loop from the encode methods. While the calling encode methods are mostly outside the current diff, consider this for a future refactoring to further boost performance.

gemini-code-assist · 2025-11-04T07:44:25Z

lightllm/server/httpserver/manager.py

        self.enable_multimodal = args.enable_multimodal
        if self.enable_multimodal:
            self.cache_client = rpyc.connect("localhost", args.cache_port, config={"allow_pickle": True})
+            self.cache_client._channel.stream.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)


Accessing private attributes like _channel and stream.sock to set socket options is risky. This implementation detail of rpyc might change in future versions, which would break this code. While this may be necessary if no public API is available on the client side, it's a maintainability concern. Consider adding a comment explaining why this is done and that it might need updates if rpyc is upgraded.

gemini-code-assist · 2025-11-04T07:44:25Z

lightllm/server/visualserver/manager.py

        self.zmq_recv_socket = context.socket(zmq.PULL)
        self.zmq_recv_socket.bind(f"{args.zmq_mode}127.0.0.1:{args.visual_port}")
        self.cache_client = rpyc.connect("localhost", args.cache_port, config={"allow_pickle": True})
+        self.cache_client._channel.stream.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)


Accessing private attributes like _channel and stream.sock to set socket options is risky. This implementation detail of rpyc might change in future versions, which would break this code. This is a maintainability concern, also seen in other parts of this pull request.

gemini-code-assist · 2025-11-04T07:44:25Z

lightllm/server/visualserver/model_infer/model_rpc.py

        weight_dir = kvargs["weight_dir"]
        self.vit_rank_id = kvargs["vit_rank_id"]
        self.cache_client = rpyc.connect("localhost", self.cache_port, config={"allow_pickle": True})
+        self.cache_client._channel.stream.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)


Accessing private attributes like _channel and stream.sock to set socket options is risky. This implementation detail of rpyc might change in future versions, which would break this code. This is a maintainability concern.

gemini-code-assist · 2025-11-04T07:44:25Z

lightllm/server/visualserver/model_infer/model_rpc.py

    while repeat_count < 20:
        try:
            con = rpyc.connect("localhost", port, config={"allow_pickle": True})
+            con._channel.stream.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)


Accessing private attributes like _channel and stream.sock to set socket options is risky. This implementation detail of rpyc might change in future versions, which would break this code. This is a maintainability concern.

opti-qwen2-vl-pre-process

550b6cf

gemini-code-assist bot reviewed Nov 4, 2025

View reviewed changes

fix

1781ca7

hiworldwzj merged commit 96e2a1d into main Nov 4, 2025
1 check passed

hiworldwzj deleted the opti-qwen2-vl-pre-process branch November 4, 2025 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opti-qwen2-vl-pre-process #1094

opti-qwen2-vl-pre-process #1094

Uh oh!

SangChengC commented Nov 4, 2025

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

gemini-code-assist bot Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	image_data = torch.from_numpy(image_arr).permute(2, 0, 1).contiguous().to("cuda", non_blocking=True)
	image_data = torch.from_numpy(image_arr).permute(2, 0, 1).contiguous()

opti-qwen2-vl-pre-process #1094

opti-qwen2-vl-pre-process #1094

Uh oh!

Conversation

SangChengC commented Nov 4, 2025

Uh oh!

gemini-code-assist bot commented Nov 4, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants