[2022 CVPR Oral] High-Resolution Image Synthesis with Latent Diffusion Models

image generation 연구는 크게 2가지 방향으로 분류할 수 있음

1. Generative Adversarial Networks (GANs)
  장점 : 좋은 quality의 high-resolution image 생성 가능, sampling speed가 빠름
  단점 : optimization 힘듬, full data distribution을 capture하지 못함

2. likelihood-based methods
  장점 : optimization 잘됨
  단점 : image내의 imperceptible high-frequency details를 modeling하느라 학습 시간이 오래 걸림
  likelihood-based methods에는 VAE, Flow, ARM, DM이 있음
    1. Variational Autoencoders (VAEs), Flow-based models
      장점 : sampling speed가 빠름
      단점 : sample quality가 안좋음

    2. Autoregressive models (ARMs)
      장점 : density estimation에서 좋은 성능을 보임
      단점 : sequential sampling으로 인해 sampling speed가 느림
      적당한 sampling speed로 high-resolution image를 생성하기 위해, 2-stage approach에 대한 연구가 진행됐음
      (image를 latent로 compress하고, latent를 ARM으로 학습하자)
      하지만 computational cost로 인해 high compression rate을 사용하기에, sample quality가 좋지 않음

    3. Diffusion Probabilistic Models (DMs)
      장점 : density estimation에서 좋은 성능을 보임, sample quality도 좋음
      단점 : 학습 비용이 많이 듬, sampling speed가 느림


저자들의 목표 : reduce the computational demands of DMs without impairing their performance

DM의 단점은 sequential sampling + pixel space에서의 optimization & inference 때문임

DM의 특성인 sequential sampling은 어쩔 수 없겠지만, pixel space를 사용할 이유는 없음
→ ARM에서 사용한 2-stage approach를 사용하는 것이 매우 적절해보임


<details><summary>2-stage approach details</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/d03fca06-b63e-48fe-8f3b-d2e743c8f131">

> `Figure 2`
> rate-distortion을 기준으로 pixel space에서 학습한 diffusion model을 2개의 stage로 나눌 수 있음
> 1. perceptual compression stage
>   removes high-frequency details but still learns little semantic variation
> 2. semantic compression stage
>   the actual generative model learns the semantic and conceptual composition of the data

> 2-stage로 학습 진행
> 1. train autoencoder
>   data space와 perceptually equivalent한 lower-dimensional representational space로 compress
> 2. train diffusion model
>   autoencoder로 구한 latent에 대해서 diffusion model 학습

> 2-stage ARM은 computational cost 때문에 high compression rate을 사용하여 sample quality가 안좋았음
> 즉, 과도한 perceptual compression으로는 high-fidelity reconstruction이 불가능함
> 저자의 목표인 perceptually equivalent + efficient computation을 위해서 적절한 compression rate를 사용해야함
> 
> compression level을 자유롭게 선택할 수 있도록, diffusion model에 U-Net architecture를 사용
> (LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone free to choose the level of compression)

> autoencoder + diffusion을 joint training하는 기존 연구도 있는데, loss에 이것저것 많이 붙여야 학습이 가능했음
> faithful reconstruction을 위해 joint training하지 않는다고 함
> joint training을 하지 않기에 training efficient한 부분도 있음
> → 잘 학습한 universal autoencoder를 가져다쓰고, 각자 dataset에 대해 diffusion model만 학습하면 됨

</details>


논문의 3가지 main contribution은 다음과 같음

1. pixel space가 아닌 latent space에서 학습하는 diffusion model, Latent Diffusion Models (LDMs)를 제안함
  LDM은 pixel-based diffusion approach와 비슷한 성능을 보이면서, train & inference computational cost를 낮춤
2. LDM backbone으로 U-Net architecture 사용
  CNN backbone을 사용하여 purely transformer-based approaches와 다르게 high-dimension data로 잘 scale함
  → compression level을 보다 자유롭게 선택할 수 있으며, 이로 인해 sample quality가 좋음
  → 학습때 보다 더 큰 high-resolution image를 generation할 수 있음
3. cross-attention을 이용한 general-purpose conditioning mechanism을 제안함
  다양한 modality input에 대해 general하게 학습할 수 있음

중요하다고 생각되는 부분만 간단히 요약


## 1. Method
<img width="49%" alt="image" src="https://github.com/user-attachments/assets/2e8ceb69-b90a-4b35-95de-50bb01f8c925">
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/8bf6160c-75a6-4f86-8af8-d313726427d3">

<details><summary>Perceptual Image Compression</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/955b5ae5-b4a1-4768-8f67-218b399c0781">

> perceptual loss, patch-based adversarial objective로 autoencoder 학습

> high-variance latent space가 되는 것을 방지하기 위해, regularization 사용
> KL-reg, VQ-reg 중 하나를 선택
> 1. KL-reg : learned latent와 standard normal간의 slight KL-penalty를 적용 (VAE)
> 2. VQ-reg : decoder에 vector quantization layer를 붙임 (VQ-VAE)

</details>


<details><summary>Latent Diffusion Models</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/fa293c03-c628-4b30-a807-c4179c4a185f">

> latent에 대해서 diffusion model 학습
> time-conditional UNet 사용

> inference pipeline
> 1. sample Gaussian noise
> 2. LDM을 이용해서 Gaussian noise를 latent로 바꿔줌
> 3. autoencoder의 decoder를 이용해서 latent를 image로 바꿔줌

</details>


<details><summary>Conditioning Mechanisms</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/347111ea-f6a9-4ffa-b9f7-adbadb3a1b37">
<img width="49%" alt="image" src="https://github.com/user-attachments/assets/dc355fbc-fc52-4fa0-8dc3-8f7e6218fd6d">

> UNet backbone에 cross-attention을 적용해서 conditional image generator 학습
> domain specific encoder $\tau$와 LDM $\epsilon$을 jointly optimize

> Stable Diffusion이 LDM과 다른 점 : text encoder를 학습하지 않고 frozen CLIP text encoder 사용
> (Stable Diffusion has the same architecture as Latent Diffusion but uses a frozen CLIP Text Encoder instead of training the text encoder jointly with the diffusion model)

</details>


## 2. Experiments
### 2.1. Base

<details><summary>On Perceptual Compression Tradeoffs</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/f43a9b98-ba0b-4e17-9ab6-606c958a9a99">

> `Table 8`
> first stage model인 autoencoder config별 reconstruction performance
> autoencoder는 (H, W, 3) image를 (H/f, W/f, c) latent로 바꿔줌
> (f : downsampling factor)
> 참고로 autoencoder는 OpenImages dataset에서 학습

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/79e79924-83cd-445d-9d92-8eefd24751c3">

> `Figure 6`
> 다양한 autoencoder config + LDM을 ImageNet에서 학습
> small downsampling factors (f = 1, 2) : 학습이 느림
> → diffusion model은 perceptual compression을 학습하느라 학습이 오래 걸림
> (위에서 언급했던 likelihood-based methods의 단점)
> 
> overly large downsampling factors (f = 32) : 학습이 금방 끝나지만, fidelity가 안좋음
> → 너무 강한 first stage compression은 information loss로 인해 quality에 제한이 생김
>
> 적당한 downsampling factors (f = 4, 8, 16)로 efficiency와 perceptually faithful results간의 good balance 만족 가능

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/e183859b-804f-4752-b3b4-50cfd484db16">

> `Figure 7`
> 적당한 downsampling factors (f = 4, 8)인 LDM으로 좋은 quality의 image를 efficient하게 sampling 가능
> ImageNet과 같은 complex dataset에서 compression rate를 과하게 사용하면 quality가 저하됨

</details>



<details><summary>Image Generation with Latent Diffusion</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/e1858001-8d16-4fd7-8dfd-37c013791c9c">
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/fa0b765a-f5f8-4357-be41-195a4cb3f0fa">

> `Table 1`
> LDM은 FID에서 comparable한 성능을 보임
> likelihood-based training으로 인한 mode-covering advantage로 인해, GAN-based methods보다 Precision Recall이 좋음

</details>



<details><summary>post-hoc image guidance</summary>

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/de685202-5154-4119-8286-8953c7b3ae82">

> `Figure 14`
> 256 x 256 image로 학습한 unconditional LDM으로 512 x 512 image를 generation하면 image quality가 안좋음
> → 어떻게 하면 그럴듯한 512 x 512 image를 generation할 수 있을까?
> 256 x 256 image를 생성한 다음, post-hoc image guidance로 512 x 512 image를 generation하면 됨

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/61bd7136-bf66-45cb-a041-0496502877d1">

> post-hoc image guidance
> 기존의 classifier guidance 방식에서 classifier를 바꿔준다고 생각하면 되는데, 방식은 다음과 같음
> 
> 1. 256 x 256 image로 학습한 unconditional LDM이 존재
> 2. 512 x 512 image를 생성할 수 있도록, Gaussian distribution에서 noisy latent를 뽑음
> 3. timestep t에서의 noisy latent를 autoencoder decoder를 사용해서 512 x 512 image로 변환
> 4. 512 x 512 noisy image를 2x bicubic downsampling
> 5. downsampled 256 x 256 noisy image와 256 x 256 image간의 perceptual loss로 guidance
> (perceptual loss로 LPIPS 사용)

</details>


### 2.2. Conditional LDM using Cross-Attention

<details><summary>conditional LDM using cross-attention - text</summary>

> text-to-image task
> input : text prompt
> 
> 1.45B KL-regularized LDM을 LAION-400M에서 학습
> BERT tokenizer, Transformer text encoder 사용

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/07eefb88-1944-46e2-9f8c-771227d7aeab">
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/54f4fe38-175a-4cd0-9712-8e1fbd7c1e0a">

> `Table 2`
> our model improves upon powerful AR and GAN-based methods
> applying classifier-free diffusion guidance greatly boosts sample quality

</details>



<details><summary>conditional LDM using cross-attention - semantic layout</summary>

> layout-to-image synthesis task
> input : layout-text pairs
>
> layout에 해당하는 latent와 text embedding간의 cross-attention을 통해 image generation
> Transformer text encoder를 이용해서 text embedding 구함

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/0bc8bbb6-9cbc-4b14-84a1-df4582975763">
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/9d73acc6-9539-4d54-be59-61119a0a3af5">

> `Table 9`
> train from scratch on COCO : reaches the performance of recent state-of-the art models
> pre-train on OpenImages & fine-tune on COCO : surpass the performance of recent state-of-the art models

</details>



<details><summary>conditional LDM using cross-attention - class</summary>

> class-conditional image synthesis task
> input : class
> 
> class를 embedding layer로 encode해준 다음 latent와 cross-attention을 통해 image generation

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/326b78b3-f1da-4680-950c-54475e1f7533">

> `Table 3`
> we outperform the state of the art diffusion model ADM while significantly reducing computational requirements and parameter count

</details>


### 2.3. Conditional LDM using Concat

<details><summary>Convolutional Sampling Beyond 256 x 256</summary>

> image-to-image translation tasks : semantic synthesis, super-resolution, inpainting
> 각 task에서 학습한 LDM은 학습때 봤던 image보다 더 큰 high-resolution image를 생성할 수 있음
> (our model generalizes to larger resolutions and can generate images up to the megapixel regime when evaluated in a convolutional manner)


<img width="100%" alt="image" src="https://github.com/user-attachments/assets/3ad92a4e-8a5c-4f5f-a80d-e75a24f11f86">

> 다만, 512 x 512, 1024 x 1024 image를 만들 때 signal-to-noise ratio가 결과물 quality에 지대한 영향을 미침
> (signal-to-noise ratio induced by the variance of the latent space ($Var(z)/\sigma\_t^2$) significantly affects the results for convolutional sampling)
> 
> KL-regularized autoencoder의 latent space SNR은 높음
> → 모델이 대부분의 semantic detail을 reverse denoising process의 초기에 할당해버림

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/661e1fc7-d65d-4a2c-95a6-da8ddcece817">

> component-wise standard deviation으로 latent space를 rescaling해주어 SNR을 낮춰줌

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/42248713-cd47-4077-a802-8625d7a1f5bf">

> `Figure 15`
> KL-reg autoencoder는 SNR이 높아서, rescaling을 안해주면 image quality가 안좋음
> VQ-reg를 쓰던, 아니면 KL-reg에 rescaling을 해줘서 쓰는 것이 바람직함

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/2c9fdba4-d8c0-49bd-8fc3-02fec9125064">
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/5d9b73e2-def5-474f-855e-18ccd805619f">
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/5d1edbf8-39d0-4fce-a4e5-951c1b318d38">

> 코드 보면 KL-reg autoencoder의 encoder로 encode할 때, scale factor를 곱해서 작게 만들어주고
> decoder로 decode할 때는 scale factor로 나눠줘서 크게 만들어줌

</details>



<details><summary>conditional LDM using concat - semantic map</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/4900abe7-5a63-496e-94e7-593b89c23272">

> semantic synthesis task
> input : semantic map
> 
> autoencoder (VQ-reg, f = 4)
> 384 x 384 image를 256 x 256 random crop해서 학습에 사용
> 256 x 256 image를 downsample하여 latent representation에 concat해서 LDM 학습

</details>



<details><summary>conditional LDM using concat - low-resolution image</summary>

> super-resolution task
> input : low-resolution image

> LDM-SR
> autoencoder (VQ-reg, f = 4)
> image degradation : bicubic interpolation with 4x downsampling
> low-resolution image를 latent representation에 concat해서 LDM 학습

> LDM-BSR
> LDM-SR은 bicubic downsampling이 아닌 다른 전처리에 대해 generalize하지 못함
> 다양한 real world image에 대해 super-resolution할 수 있도록, BSR-degradation process로 downsample
> BSR-degradation process : (JPEG compressions noise, camera sensor noise, different image interpolations for downsampling, Gaussian blur kernels, Gaussian noise) in a random order to an image

<img width="49%" alt="image" src="https://github.com/user-attachments/assets/90baa3d0-8463-4a0d-9cfb-125a94496805">
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/90c9fc92-4937-4fbe-9f9a-88d8027c2a85">

> `Table 5`
> LDM-SR shows competitive performance on FID, IS
> PSNR, SSIM 성능은 좋지 않음
> PSNR, SSIM은 imperfectly aligned high frequency details보다 blurriness를 선호하기에, human perception과 align이 안맞음
> post-hoc image guidance를 통해 PSNR, SSIM 성능을 높일 수 있음


<img width="100%" alt="image" src="https://github.com/user-attachments/assets/43e497a9-9663-4b0e-a517-499b128082ac">

> `Figure 18`
> LDM-BSR produces images much sharper than LDM-SR, making it suitable for real-world applications

</details>



<details><summary>conditional LDM using concat - masked image</summary>

> inpainting task
> input : image with masked regions
> 
> train on 256 x 256 image, sample 512 x 512 image

<img width="49%" alt="image" src="https://github.com/user-attachments/assets/fb1d1eb4-323e-42e1-87ec-787f622a0075">
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/eb287abb-4e19-42fc-bb04-8fa29aa6fb61">

> `Table 7`
> our model with attention improves the overall image quality as measured by FID
> 
> 더 큰 사이즈의 LDM을 사용하면 더 성능이 좋아질 수 있다고 생각할 수 있음
> 그런데 big LDM 성능이 base LDM보다 안좋음
> why? → additional attention modules 때문이라고 추측
> (discrepancy in the quality of samples produced at resolutions 256 x 256 and 512 x 512, which we hypothesize to be caused by the additional attention modules)
> 아마도 training때 봤던 resolution인 256 x 256에 overfitting해서 생기는 문제가 아닌가?란 생각
> 
> 저자들은 big LDM을 512 x 512로 half epoch fine-tune하여 위 문제를 극복
> why effective? → allows the model to adjust to the new feature statistics

</details>


## 3. Limitations

- sequential sampling process
  pixel-based DM에 비해 computational requirements는 낮아지긴 했음
  그러나 sequential sampling process로 인해 GAN보다는 여전히 느림
- using autoencoder model
  f = 4 autoencoding model의 image quality loss는 매우 작음
  하지만 pixel space에서의 fine-grained accuracy가 요구되는 task에서는 bottleneck이 될 수 있음
  즉, high precision이 요구되는 task에서 LDM은 적합하지 않을 수 있다고 표현
  (we assume that our super-resolution models are already somewhat limited in this respect)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2022 CVPR Oral] High-Resolution Image Synthesis with Latent Diffusion Models #223

1. Method

2. Experiments

2.1. Base

2.2. Conditional LDM using Cross-Attention

2.3. Conditional LDM using Concat

3. Limitations

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2022 CVPR Oral] High-Resolution Image Synthesis with Latent Diffusion Models #223

Description

1. Method

2. Experiments

2.1. Base

2.2. Conditional LDM using Cross-Attention

2.3. Conditional LDM using Concat

3. Limitations

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions