Skip to content

support cogvideox i2v & adapt to the changes in commit 607f968 #59

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Watebear
Copy link
Contributor

@Watebear Watebear commented Jun 4, 2025

Contributing Guidelines

We have prepared a pre-commit hook to enforce consistent code formatting across the project. If your code complies with the standards, you should not see any errors, you can clean up your code following the steps below:

  1. Install the required dependencies:
    pip install ruff pre-commit
  1. Then, run the following command before commit:
    pre-commit run --all-files
  1. Finally, please double-check your code to ensure it complies with the following additional specifications as much as possible:
  • Avoid hard-coding local paths: Make sure your submissions do not include hard-coded local paths, as these paths are specific to individual development environments and can cause compatibility issues. Use relative paths or configuration files instead.
  • Clear error handling: Implement clear error-handling mechanisms in your code so that error messages can accurately indicate the location of the problem, possible causes, and suggested solutions, facilitating quick debugging.
  • Detailed comments and documentation: Add comments to complex code sections and provide comprehensive documentation to explain the functionality of the code, input-output requirements, and potential error scenarios.

Thank you for your contributions!

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @Watebear, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, providing a summary of this pull request by Watebear. This PR primarily focuses on adding support for the Image-to-Video (I2V) functionality of the CogVideoX model. It also includes adaptations to align with changes introduced in a specific commit (607f968), which seem to involve modifications to the model's input processing and overall workflow to accommodate the I2V task. Key changes include introducing a new configuration file for I2V, modifying the model and scheduler to handle image latents alongside video latents, and refactoring the runner to manage the I2V specific pipeline.

Highlights

  • CogVideoX I2V Support: Adds core functionality to support Image-to-Video generation using the CogVideoX model.
  • New I2V Configuration: Introduces a dedicated JSON configuration file (configs/cogvideox_i2v.json) containing parameters specific to the CogVideoX I2V task.
  • Image Latent Handling: Modifies the model's inference process and the scheduler's preparation steps to correctly encode, process, and integrate image latents for I2V generation.
  • Runner Refactoring: Refactors the CogVideoX runner to streamline model loading and implement the I2V specific workflow, including image encoding and adjusted target shape calculation.
  • Adaptation to Upstream Changes: Includes necessary code adjustments to be compatible with changes introduced in commit 607f968, particularly in how model inputs and embeddings are handled.

Changelog

Click here to see the changelog
  • README.md
    • Added a link to the CogVideoX1.5-5B-I2V model on Hugging Face.
  • configs/cogvideox_i2v.json
    • Added a new configuration file specifically for CogVideoX I2V inference, defining parameters like dimensions, steps, VAE factors, and transformer settings.
  • lightx2v/models/networks/cogvideox/infer/pre_infer.py
    • Modified the infer method to accept an optional ofs parameter.
    • Added logic to compute and add an offset/image embedding (ofs_emb) to the time embedding if ofs is provided.
  • lightx2v/models/networks/cogvideox/model.py
    • Added conditional logic in the infer method to concatenate image latents (self.scheduler.image_latents) with video latents along dimension 2 when the task is 'i2v'.
    • Passed the calculated ofs_emb to the pre_infer.infer call.
    • Removed commented-out boolean values next to if conditions for use_dynamic_cfg and do_classifier_free_guidance.
  • lightx2v/models/networks/cogvideox/weights/pre_weights.py
    • Added conditional loading of ofs_embedding_linear_1 and ofs_embedding_linear_2 weights if ofs_embed_dim is present in the configuration.
  • lightx2v/models/runners/cogvideox/cogvidex_runner.py
    • Imported load_image and torch.
    • Refactored model loading by combining load_transformer, load_image_encoder, load_text_encoder, and load_vae into a single load_model method.
    • Modified run_text_encoder to remove the img parameter.
    • Removed placeholder methods run_vae_encoder and get_encoder_output_i2v.
    • Added run_image_encoder method to load, preprocess, and encode the input image using the VAE.
    • Added run_input_encoder_local_t2v and run_input_encoder_local_i2v methods to handle encoder outputs based on the task.
    • Added run_vae_decoder_local method to decode latents using the VAE.
    • Updated set_target_shape to calculate target and padding shapes specifically for the 'i2v' task.
    • Renamed save_video_func to save_video.
  • lightx2v/models/schedulers/cogvideox/scheduler.py
    • Added a retrieve_latents helper function (copied from diffusers) to extract latents from encoder outputs.
    • Modified the prepare method to call prepare_image_latents if the task is 'i2v'.
    • Added the prepare_image_latents method to process image encoder output, including retrieving, permuting, scaling (with a note about the original model's scaling), padding, and potentially prepending the first frame.
  • lightx2v/models/video_encoders/hf/cogvideox/autoencoder_ks_cogvidex.py
    • Minor whitespace changes in _encode and encode methods.
  • lightx2v/models/video_encoders/hf/cogvideox/model.py
    • Removed type hint comments (# type: ignore).
    • Added an encode method that wraps the internal _encode call and enables tiling.
  • scripts/run_cogvideox_i2v.sh
    • Added a new shell script to demonstrate running CogVideoX I2V inference with example paths and parameters.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


From still image, motion flows,
Diffusion magic, the video grows.
Latents shift and blend,
A visual story to send.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully introduces support for CogVideoX Image-to-Video (I2V) and adapts to changes from a previous commit. The core logic for handling I2V, including image latent preparation and offset embeddings, appears well-integrated. The refactoring in the runner for model loading is a good improvement.

I've identified a few areas that could benefit from further clarification or changes, primarily concerning a magic number, a specific padding logic, and adherence to contribution guidelines regarding hardcoded paths in the new script. Addressing these will enhance the code's clarity, maintainability, and usability.

Summary of Findings

  • Magic Number for Offset Embedding: The value 2.0 is used for ofs_emb initialization in lightx2v/models/networks/cogvideox/model.py. The reason for this specific value should be clarified or made a configurable parameter.
  • Hardcoded Paths in Shell Script: The new script scripts/run_cogvideox_i2v.sh contains hardcoded local paths for PYTHONPATH and the input image, which violates the project's contributing guidelines.
  • Padding Logic for I2V Frame Dimensions: A specific padding logic NewLength = CurrentLength + (CurrentLength % PatchSize) is used in both the runner and scheduler for I2V tasks. While this makes the dimension even if PatchSize is 2, its general applicability and intent for other PatchSize values or if the dimension is already correctly padded needs clarification.
  • Hardcoded Device/Dtype in Runner: The run_image_encoder method in CogvideoxRunner hardcodes cuda device and bfloat16 dtype, which could be made more flexible by using global configurations.

Merge Readiness

The pull request makes significant progress in adding I2V support for CogVideoX. However, there are several medium-severity issues related to magic numbers, hardcoded paths, and potentially unclear padding logic that should be addressed before merging. These changes will improve the code's robustness, maintainability, and adherence to project guidelines. I am unable to approve the pull request directly; please ensure these points are considered and have other reviewers approve the changes before merging.

latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
if self.config.task in ["i2v"]:
ofs_emb = None if self.config.transformer_ofs_embed_dim is None else latent_model_input.new_full((1,), fill_value=2.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The ofs_emb is initialized with fill_value=2.0 when self.config.transformer_ofs_embed_dim is not None. Could you clarify the significance of this specific value 2.0? If it's a critical parameter, consider defining it as a named constant or making it configurable for better readability and maintainability.

raise NotImplementedError("I2V inference is not implemented for Cogvideox.")
def run_image_encoder(self, config, vae_model):
image = load_image(image=self.config.image_path)
image = vae_model.video_processor.preprocess(image, height=config.height, width=config.width).to(torch.device("cuda"), dtype=torch.bfloat16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The device and dtype for image preprocessing are hardcoded here to torch.device("cuda") and torch.bfloat16. While this might be the common case, would it be beneficial to align this with a global device/dtype configuration from self.config for greater flexibility, similar to how init_device is determined in get_init_device?

if self.config.task in ["t2v"]:
self.config.target_shape = target_shape
elif self.config.task in ["i2v"]:
self.config.target_shape = target_shape[:1] + (target_shape[1] + target_shape[1] % self.config.patch_size_t,) + target_shape[2:]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The frame dimension of target_shape for I2V is adjusted using target_shape[1] + target_shape[1] % self.config.patch_size_t.

Given that target_shape[1] (derived from num_frames on lines 67-73) should already be a multiple of self.config.patch_size_t due to the padding logic on lines 71-73, target_shape[1] % self.config.patch_size_t would be 0, making this addition redundant.

If self.config.patch_size_t is always 2 (as in cogvideox_i2v.json), this logic effectively ensures the dimension is even (L if L is even, L+1 if L is odd).

Could you clarify the intent here? Is this an additional specific padding requirement for CogVideoX I2V, or is there a scenario where target_shape[1] might not be a multiple of patch_size_t at this point?

Comment on lines +175 to +177
if self.config.patch_size_t is not None:
first_frame = image_latents[:, : image_latents.size(1) % self.config.patch_size_t, ...]
image_latents = torch.cat([first_frame, image_latents], dim=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This padding logic first_frame = image_latents[:, : image_latents.size(1) % self.config.patch_size_t, ...]; image_latents = torch.cat([first_frame, image_latents], dim=1) results in a new length L_new = L + (L % P) where L is the current number of frames and P is patch_size_t.

If patch_size_t is 2 (as per cogvideox_i2v.json), this makes the frame dimension even. This is the same pattern observed in the runner's set_target_shape for I2V.

Is this specific way of padding (making the frame count L + L % P) a requirement for subsequent operations in CogVideoX, particularly when patch_size_t might not be 2? If patch_size_t could be other values (e.g., 3), L + L % P (e.g., 80 + 80%3 = 80+2=82) doesn't make the length a multiple of P (82 is not divisible by 3). Clarifying the rationale would be helpful.

export ENABLE_PROFILING_DEBUG=true
export ENABLE_GRAPH_MODE=false

export PYTHONPATH=/mtc/wushuo/VideoGen/diffusers:$PYTHONPATH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line hardcodes a local path /mtc/wushuo/VideoGen/diffusers into PYTHONPATH. According to the contributing guidelines, hard-coding local paths should be avoided.

Could this be removed or made a configurable variable at the top of the script, similar to lightx2v_path and model_path?

--model_cls cogvideox \
--task i2v \
--model_path $model_path \
--image_path /mtc/wushuo/VideoGen/CogVideo/input.jpg \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The --image_path argument is hardcoded to /mtc/wushuo/VideoGen/CogVideo/input.jpg. This also seems to be a local path specific to a development environment, which goes against the contributing guidelines.

Consider making this a script argument or a placeholder variable that the user needs to set, to improve the script's general usability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant