-
Notifications
You must be signed in to change notification settings - Fork 10
support cogvideox i2v & adapt to the changes in commit 607f968 #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @Watebear, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, providing a summary of this pull request by Watebear. This PR primarily focuses on adding support for the Image-to-Video (I2V) functionality of the CogVideoX model. It also includes adaptations to align with changes introduced in a specific commit (607f968), which seem to involve modifications to the model's input processing and overall workflow to accommodate the I2V task. Key changes include introducing a new configuration file for I2V, modifying the model and scheduler to handle image latents alongside video latents, and refactoring the runner to manage the I2V specific pipeline.
Highlights
- CogVideoX I2V Support: Adds core functionality to support Image-to-Video generation using the CogVideoX model.
- New I2V Configuration: Introduces a dedicated JSON configuration file (
configs/cogvideox_i2v.json
) containing parameters specific to the CogVideoX I2V task. - Image Latent Handling: Modifies the model's inference process and the scheduler's preparation steps to correctly encode, process, and integrate image latents for I2V generation.
- Runner Refactoring: Refactors the CogVideoX runner to streamline model loading and implement the I2V specific workflow, including image encoding and adjusted target shape calculation.
- Adaptation to Upstream Changes: Includes necessary code adjustments to be compatible with changes introduced in commit 607f968, particularly in how model inputs and embeddings are handled.
Changelog
Click here to see the changelog
- README.md
- Added a link to the CogVideoX1.5-5B-I2V model on Hugging Face.
- configs/cogvideox_i2v.json
- Added a new configuration file specifically for CogVideoX I2V inference, defining parameters like dimensions, steps, VAE factors, and transformer settings.
- lightx2v/models/networks/cogvideox/infer/pre_infer.py
- Modified the
infer
method to accept an optionalofs
parameter. - Added logic to compute and add an offset/image embedding (
ofs_emb
) to the time embedding ifofs
is provided.
- Modified the
- lightx2v/models/networks/cogvideox/model.py
- Added conditional logic in the
infer
method to concatenate image latents (self.scheduler.image_latents
) with video latents along dimension 2 when the task is 'i2v'. - Passed the calculated
ofs_emb
to thepre_infer.infer
call. - Removed commented-out boolean values next to
if
conditions foruse_dynamic_cfg
anddo_classifier_free_guidance
.
- Added conditional logic in the
- lightx2v/models/networks/cogvideox/weights/pre_weights.py
- Added conditional loading of
ofs_embedding_linear_1
andofs_embedding_linear_2
weights ifofs_embed_dim
is present in the configuration.
- Added conditional loading of
- lightx2v/models/runners/cogvideox/cogvidex_runner.py
- Imported
load_image
andtorch
. - Refactored model loading by combining
load_transformer
,load_image_encoder
,load_text_encoder
, andload_vae
into a singleload_model
method. - Modified
run_text_encoder
to remove theimg
parameter. - Removed placeholder methods
run_vae_encoder
andget_encoder_output_i2v
. - Added
run_image_encoder
method to load, preprocess, and encode the input image using the VAE. - Added
run_input_encoder_local_t2v
andrun_input_encoder_local_i2v
methods to handle encoder outputs based on the task. - Added
run_vae_decoder_local
method to decode latents using the VAE. - Updated
set_target_shape
to calculate target and padding shapes specifically for the 'i2v' task. - Renamed
save_video_func
tosave_video
.
- Imported
- lightx2v/models/schedulers/cogvideox/scheduler.py
- Added a
retrieve_latents
helper function (copied fromdiffusers
) to extract latents from encoder outputs. - Modified the
prepare
method to callprepare_image_latents
if the task is 'i2v'. - Added the
prepare_image_latents
method to process image encoder output, including retrieving, permuting, scaling (with a note about the original model's scaling), padding, and potentially prepending the first frame.
- Added a
- lightx2v/models/video_encoders/hf/cogvideox/autoencoder_ks_cogvidex.py
- Minor whitespace changes in
_encode
andencode
methods.
- Minor whitespace changes in
- lightx2v/models/video_encoders/hf/cogvideox/model.py
- Removed type hint comments (
# type: ignore
). - Added an
encode
method that wraps the internal_encode
call and enables tiling.
- Removed type hint comments (
- scripts/run_cogvideox_i2v.sh
- Added a new shell script to demonstrate running CogVideoX I2V inference with example paths and parameters.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
From still image, motion flows,
Diffusion magic, the video grows.
Latents shift and blend,
A visual story to send.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully introduces support for CogVideoX Image-to-Video (I2V) and adapts to changes from a previous commit. The core logic for handling I2V, including image latent preparation and offset embeddings, appears well-integrated. The refactoring in the runner for model loading is a good improvement.
I've identified a few areas that could benefit from further clarification or changes, primarily concerning a magic number, a specific padding logic, and adherence to contribution guidelines regarding hardcoded paths in the new script. Addressing these will enhance the code's clarity, maintainability, and usability.
Summary of Findings
- Magic Number for Offset Embedding: The value
2.0
is used forofs_emb
initialization inlightx2v/models/networks/cogvideox/model.py
. The reason for this specific value should be clarified or made a configurable parameter. - Hardcoded Paths in Shell Script: The new script
scripts/run_cogvideox_i2v.sh
contains hardcoded local paths forPYTHONPATH
and the input image, which violates the project's contributing guidelines. - Padding Logic for I2V Frame Dimensions: A specific padding logic
NewLength = CurrentLength + (CurrentLength % PatchSize)
is used in both the runner and scheduler for I2V tasks. While this makes the dimension even ifPatchSize
is 2, its general applicability and intent for otherPatchSize
values or if the dimension is already correctly padded needs clarification. - Hardcoded Device/Dtype in Runner: The
run_image_encoder
method inCogvideoxRunner
hardcodescuda
device andbfloat16
dtype, which could be made more flexible by using global configurations.
Merge Readiness
The pull request makes significant progress in adding I2V support for CogVideoX. However, there are several medium-severity issues related to magic numbers, hardcoded paths, and potentially unclear padding logic that should be addressed before merging. These changes will improve the code's robustness, maintainability, and adherence to project guidelines. I am unable to approve the pull request directly; please ensure these points are considered and have other reviewers approve the changes before merging.
latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) | ||
if self.config.task in ["i2v"]: | ||
ofs_emb = None if self.config.transformer_ofs_embed_dim is None else latent_model_input.new_full((1,), fill_value=2.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ofs_emb
is initialized with fill_value=2.0
when self.config.transformer_ofs_embed_dim
is not None. Could you clarify the significance of this specific value 2.0
? If it's a critical parameter, consider defining it as a named constant or making it configurable for better readability and maintainability.
raise NotImplementedError("I2V inference is not implemented for Cogvideox.") | ||
def run_image_encoder(self, config, vae_model): | ||
image = load_image(image=self.config.image_path) | ||
image = vae_model.video_processor.preprocess(image, height=config.height, width=config.width).to(torch.device("cuda"), dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The device and dtype for image preprocessing are hardcoded here to torch.device("cuda")
and torch.bfloat16
. While this might be the common case, would it be beneficial to align this with a global device/dtype configuration from self.config
for greater flexibility, similar to how init_device
is determined in get_init_device
?
if self.config.task in ["t2v"]: | ||
self.config.target_shape = target_shape | ||
elif self.config.task in ["i2v"]: | ||
self.config.target_shape = target_shape[:1] + (target_shape[1] + target_shape[1] % self.config.patch_size_t,) + target_shape[2:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The frame dimension of target_shape
for I2V is adjusted using target_shape[1] + target_shape[1] % self.config.patch_size_t
.
Given that target_shape[1]
(derived from num_frames
on lines 67-73) should already be a multiple of self.config.patch_size_t
due to the padding logic on lines 71-73, target_shape[1] % self.config.patch_size_t
would be 0, making this addition redundant.
If self.config.patch_size_t
is always 2 (as in cogvideox_i2v.json
), this logic effectively ensures the dimension is even (L
if L
is even, L+1
if L
is odd).
Could you clarify the intent here? Is this an additional specific padding requirement for CogVideoX I2V, or is there a scenario where target_shape[1]
might not be a multiple of patch_size_t
at this point?
if self.config.patch_size_t is not None: | ||
first_frame = image_latents[:, : image_latents.size(1) % self.config.patch_size_t, ...] | ||
image_latents = torch.cat([first_frame, image_latents], dim=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This padding logic first_frame = image_latents[:, : image_latents.size(1) % self.config.patch_size_t, ...]; image_latents = torch.cat([first_frame, image_latents], dim=1)
results in a new length L_new = L + (L % P)
where L
is the current number of frames and P
is patch_size_t
.
If patch_size_t
is 2 (as per cogvideox_i2v.json
), this makes the frame dimension even. This is the same pattern observed in the runner's set_target_shape
for I2V.
Is this specific way of padding (making the frame count L + L % P
) a requirement for subsequent operations in CogVideoX, particularly when patch_size_t
might not be 2? If patch_size_t
could be other values (e.g., 3), L + L % P
(e.g., 80 + 80%3 = 80+2=82
) doesn't make the length a multiple of P
(82 is not divisible by 3). Clarifying the rationale would be helpful.
export ENABLE_PROFILING_DEBUG=true | ||
export ENABLE_GRAPH_MODE=false | ||
|
||
export PYTHONPATH=/mtc/wushuo/VideoGen/diffusers:$PYTHONPATH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--model_cls cogvideox \ | ||
--task i2v \ | ||
--model_path $model_path \ | ||
--image_path /mtc/wushuo/VideoGen/CogVideo/input.jpg \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --image_path
argument is hardcoded to /mtc/wushuo/VideoGen/CogVideo/input.jpg
. This also seems to be a local path specific to a development environment, which goes against the contributing guidelines.
Consider making this a script argument or a placeholder variable that the user needs to set, to improve the script's general usability.
Contributing Guidelines
We have prepared a
pre-commit
hook to enforce consistent code formatting across the project. If your code complies with the standards, you should not see any errors, you can clean up your code following the steps below:Thank you for your contributions!