Releases · PKU-YuanGroup/UniWorld

🚀 Introducing UniWorld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback!
🌟 Surpassing GPT-Image-1 on multiple benchmarks, it showcases superior fine-grained control and complex language capabilities! Now, the new model, training framework, and evaluation results are fully open-source!

✨ Key Highlights

🧠 We introduce UniWorld-R1, the industry's post-training framework for image editing based on Reinforcement Learning (RL) policy optimization. It leverages our novel DiffusionNFT technique for more efficient training and compatibility with high-order samplers.
🏆 We pioneer the use of a Multi-modal Large Language Model (MLLM) as a training-free reward model. By leveraging its logits output for fine-grained feedback, we significantly improve the model's alignment with human intent.
🥇 UniWorld-V2 achieves new SOTA results, scoring an impressive 7.83 on GEdit-Bench (surpassing GPT-Image-1's 7.53) and leading on ImgEdit with 4.49, outperforming all known open and closed-source models.
🎨 We demonstrate unprecedented fine-grained controllability, including mastering complex artistic Chinese characters, achieving precise spatial editing with "Redbox Control," and rendering realistic global light & shadow fusion—capabilities that are challenging for traditional SFT models.

🔭 Future Work

Continue collecting data and explore joint training with Visual Language Models (VLMs).
Integrate higher-resolution semantic encoders or adopt VLM techniques like multi-scale image gridding to increase input image resolution.

🚀UniWorld: a unified model that skips VAEs and uses semantic features from SigLIP! Using just 1% of BAGEL’s data, it outperforms on image editing and excels in understanding & generation.
🌟Now data, model, training & evaluation script are open-source!

Key features:

We observe that GPT-4o likely employs a non-mandatory VAE injection, making it difficult to preserve low-level features consistent with the reference image.
We demonstrate remarkable image perception capabilities, surpassing those of GPT-4o.
We used only 2.7M data samples—just 0.1% of BAGEL—achieving high efficiency. All data, training and evaluation code, and models have been fully open-sourced.

Future works:

Continue collecting data and perform joint training with a VLM.
Integrate higher-resolution semantic encoders or adopt VLM techniques to increase input-image resolution, such as multi-scale image gridding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

✨ Key Highlights

🔭 Future Work

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Releases: PKU-YuanGroup/UniWorld

Release v2.0.0

✨ Key Highlights

🔭 Future Work

Uh oh!

Release v1.0.0

Uh oh!