[2023 NeurIPS] SEGA: Instructing Text-to-Image Models using Semantic Guidance

text-to-image diffusion model이 좋은 성능을 보이지만, 유저가 원하는 결과가 한번에 나오지 않음

어떻게 하면 유저가 원하는 결과를 얻을 수 있을까?

가장 간단하게 생각할 수 있는 방법으로 multiple iteration 돌리기, input prompt 바꾸기가 있음
multiple iteration 돌리기 → 언제 원하는 결과를 얻을 수 있는지 예측 불가
input prompt 바꾸기 → small changes to the text prompt often lead to entirely different images

차라리 reference image를 editing하는 것이 더 바람직해보임

previous works (inpainting, fine-tuning, embedding optimization, ...)
→ semantic mask 필요, reference image가 많이 바뀜, 다양한 concept editing X, 학습 필요, ...

저자들은 기존 연구들의 한계를 극복한 Semantic Guidance (SEGA)를 제안함

논문의 4가지 main contribution은 다음과 같음

1. sophisticated semantic control이 가능한 SEGA를 제안함
  SEGA는 noise estimate에서 semantic guidance direction을 구하는 방법으로, classifier-free guidance와 유사함
  `no additional training`, `no extensions to the architecture`, `no external guidance`
  `calculated within a single forward pass`, `architecture-agnostic`
  (compatible with latent and pixel-based diffusion models)
2. SEGA를 통해 구한 semantic vector는 4가지 성질을 가짐
  robustness : can incorporate arbitrary concepts into the original image
  uniqueness : can be calculated once and subsequently applied to other images
  monotonicity : scales monotonically with the strength
  isolation : different vectors do not interfere with each other
3. SEGA는 다양한 task에서 좋은 성능을 보임
4. SEGA를 통해 diffusion model을 더 잘 이해할 수 있음
  how abstract concepts are represented by the model?
  how their interpretation reflects on the generated image?

중요하다고 생각되는 부분만 간단히 요약


## 1. Semantic Guidance

<details><summary>Semantic Guidance on Concepts</summary>

### Intuition

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/71afc482-20b4-436d-b833-5fa8e15cff87">

> `Figure 1`
> space를 semantic concept을 represent하는 arbitrary sub-spaces의 composition으로 볼 수 있음
> ex. king - male + female = queen


### Isolating Semantics in Diffusion

> arbitrary semantic concept에 대한 semantic vector를 어떻게 구할 수 있을까?
> 가장 쉽게 생각할 수 있는 방법은 바로 classifier-free guidance임

<img width="40%" alt="image" src="https://github.com/user-attachments/assets/4f8cc00c-f4da-43f6-ac01-b503788fb525">

> `Figure 2`
> $\epsilon\_{\theta}(z\_t, c\_e) - \epsilon\_{\theta}(z\_t)$
> classifier-free guidance를 이용해서 concept description $e$에 대한 latent vector를 구할 수 있음
> 이렇게 구한 latent vector의 numerical value는 Gaussian distribution을 따름
> 여기서 값의 크기가 큰 dimension이 concept $e$와 관련이 높다고 볼 수 있음
> 즉, Gaussian distribution의 upper & lower tail에 해당하는 latent dimension이 concept $e$를 encode한다...라는 것
> 
> 저자들은 empirical하게 $\epsilon$-estimate's dimension의 1~5%만 사용해도 이미지를 원하는대로 바꿀 수 있음을 확인함
> space of sparse noise-estimate vectors를 semantic space라고 명칭
> 
> noise-estimate의 모든 dimension을 사용하지 않는다는 점에서 classifier-free guidance와 다름


### One Direction

<img width="70%" alt="image" src="https://github.com/user-attachments/assets/bf2e19cd-1e8d-472b-b963-6cc744084735">

> semantic space를 이용한 semantic guidance를 식으로 표현하면 위와 같음
> 또한 prompt $p$, concept $e$에 대해 위와 같이 image generation

> diffusion process를 보다 세밀하게 control할 수 있도록 2개의 parameter 도입
> → warm-up parameter $\delta$, momentum $\nu\_t$
> 
> warm-up timestep 동안 semantic guidance 적용 X → image를 얼마나 editing할 지 control 가능
> momentum을 통해 다양한 timestep에서 자주 나오는 dimension을 강화할 수 있음
> (momentum은 warm-up과 상관 없이 초기부터 계속 쌓이는 구조)


### Beyond One Direction

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/f6adb8e1-733a-4655-85f0-6bc7809546f0">

> `Equation 10`
> multiple concepts $e\_i$에 대한 semantic guidance를 위와 같이 계산
> 이를 이용하여 multiple concepts을 반영한 image generation 가능

</details>


<details><summary>Properties of Semantic Space</summary>

<img width="90%" alt="Figure 3" src="https://github.com/user-attachments/assets/7441e1ad-dd26-48d3-ac7f-5c5251c14993">
<img width="90%" alt="Figure 4" src="https://github.com/user-attachments/assets/8d9cdc86-e87b-4dde-9bf5-58f66be23f41">

> Robustness (`Figure 3 (a)`)
> SEGA는 original image에 arbitrary concept을 잘 incorporate함
> target concept을 original image의 어디에 integrate할 지 알려주지 않아도 잘 integrate함

> Uniqueness (`Figure 3 (b)`)
> concept에 대해 구한 semantic guidance vector를 다른 이미지에 적용 가능
> left-most image에서 구한 glasses의 semantic guidance를 다른 face images에 적용해도 됨
> 
> initial noise latent에서 $epsilon$-estimates가 크게 변화하기에, initial seed가 같아야만 transfer 가능
> 또한 image composition의 변화가 너무 크면 transfer가 안된다고 함
> (ex. human face → animal or inanimate object)

> Monotonicity (`Figure 3 (c)`)
> semantic guidance vector의 strength가 커질수록 image에서의 semantic concept magnitude가 커짐
> (ex. smile semantic guidance의 scale을 키우면 image에서의 smile strength가 커짐)

> Isolation (`Figure 4`)
> concept vector는 dimension의 극히 일부만 사용하기에, different concepts간 largely isolated함
> 즉, different concept vectors는 서로를 방해하지 않음
> → multiple concepts를 적용한 image generation 가능

</details>


## 2. Experimental Evaluation

<details><summary>Empirical Results</summary>
<img width="45%" alt="image" src="https://github.com/user-attachments/assets/54c73131-83d8-4c10-b4ee-957cd519b9ce">

> `Table 1` - user study
> for positive guidance, SEGA faithfully adds the target concept to the image
> 2 outliers : `bangs` and `bald`
> `bangs` : low rate of annotator consensus → assume non-native English-speaking annotators are not familiar with the term `bangs`
> `bald` : long hair often makes up a large portion of a portrait and thus requires more substantial changes to the image →  require stronger hyperparameters
> 
> negative guidance to remove existing attributes from an image to work similarly well
> guidance away from `beard`, `bald`, `gray hair` usually resulted in a substantial reduction of the respective feature, but failed to remove it entirely
> → hyperparameters were probably not strong enough

<img width="40%" alt="image" src="https://github.com/user-attachments/assets/23596d67-4b46-40a7-b15c-c4530171b8fc">

> `Table 2` - user study
> per-attribute success rate remains similar for four instead of one distinct edit concept
> → demonstrate the isolation of semantic guidance vectors

> potential influence of manipulations with SEGA on overall image quality
> calculate FID scores against a reference dataset FFHQ
> Stable Diffusion : 117.73 (small artifacts often present in facial images)
> Stable Diffusion + SEGA : 59.86 (additional guidance signal frequently removed uncanny artifacts, resulting in overall better quality)

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/7006e5f0-3183-41e4-9ced-4c5bafc26a0b">

> `Table 3`- inappropriate-image-prompts (I2P) benchmark
> suppress inappropriate content using SEGA to guide the generation away from the inappropriate concepts
> SEGA performs strong mitigation at inference for both architectures further highlighting the capabilities and versatility of the approach

</details>


<details><summary>Comparisons</summary>
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/69aef3f9-10da-4d69-b139-a22a98dab74c">

> `Table 4` - user study
> 4 types of manipulation category : composition of multiple edits, minor changes, style transfer, removal of specific objects from a scene
> SEGA clearly outperforms Prompt2Prompt and Disentanglement on all examined editing tasks
> compared to Composable Diffusion, SEGA again has significantly higher success rates for multi-conditioning and minor changes while achieving comparable performance for style transfer and object removal

> evaluate the faithfulness to the original image composition
> SEGA : 83.33%
> Composable Diffusion : 13.33%
> → SEGA is generally preferred in terms of its edit capabilities and perceived fidelity over other methods

</details>


<details><summary>Qualitative Results</summary>
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/3a8b830c-ed8b-4814-888d-d06b26ce6ad3">

> `Figure 6`
> for style transfer, the entirety of the image has to be changed while keeping the image composition the same
> → require a slightly lower threshold of $\lambda \approx 0.9$
> 
> SEGA faithfully applies the styles of famous artists, as well as artistic epochs and drawing techniques
> changing the prompt significantly alters the image composition
> → highlight the advantages of semantic control, which allows versatile and yet robust changes

</details>


## 3. Appendix

<details><summary>Numerical Properties of Noise Estimates</summary>
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/778f8ae3-54c3-4597-91eb-4b616078cfb6">

> `Figure 7`
> unconditioned estimate $\epsilon\_{\theta}(z\_t)$, text conditioned estimate $\epsilon\_{\theta}(z\_t, c\_p)$, edit conditioned estimate $\epsilon\_{\theta}(z\_t, c\_e)$ 모두 numerical values가 Gaussian distribution을 따름
> 이는 아마도 Gaussian sampled noise를 estimate하기 위해 학습했기 때문이라고 볼 수 있음

</details>


<details><summary>Further intuition and ablations on hyperparameters</summary>
<img width="90%" alt="Figure 9" src="https://github.com/user-attachments/assets/f9575751-86e4-4b0b-ac8c-a1e845f1478d">

> Threshold $\lambda$ (`Figure 9`)
> SEGA는 concept에 맞는 image의 relevant region을 automatically identify함
> threshold는 image의 relevant region을 얼마나 잡을지를 조정해주는 값이라고 볼 수 있음
> 1에 가깝게 높은 값으로 설정하면 image alteration이 거의 없을 것이고, 낮은 값으로 설정하면 많은 영역이 바뀔 것임
> 
> $\lambda \geq 0.95$ → 대부분의 image edit에 적합함
> $\lambda \in [0.8, 0.9]$ → style transfer과 같이 entire image에 변화가 필요한 경우에 적합함

> Scale $s\_e$
> scale이 커질수록 concept expression의 magnitude도 커짐
> scale behaves very robustly for a larger range of values
> (values around 3 or 4 are sufficient for delicate changes)
> 
> interesting point : scale is less susceptible to producing image artifacts for high values
> (can generate image with values of 20+ w/o quality degradation)

> Warm-up $\delta$
> overall image composition is largely generated in early diffusion steps, with later ones only refining smaller details
> number of warmup steps may be used to steer the granularity of compositional changes
> 
> want to change details and not composition (ex. style transfer) → $\delta \geq 15$
> want compositional editing → $\delta \geq 5$

> Momentum
> most edits produce satisfactory results without the use of momentum, but image fidelity can further improve when used
> main use case is in combination with warmup
> higher momentum facilitates higher warmup periods in combination with more radical changes

<img width="60%" alt="image" src="https://github.com/user-attachments/assets/512d152e-a417-4f8e-9543-eea7a7948b9e">

> `Figure 11` - interplay between warm-up and semantic guidance scale

<img width="60%" alt="image" src="https://github.com/user-attachments/assets/fb1a64b1-93cd-4157-af85-84ca3d8d8407">

> `Figure 12` - interplay between warm-up and threshold

<img width="60%" alt="image" src="https://github.com/user-attachments/assets/63dcf849-c0fa-4fd2-9d86-922ef20343e9">

> `Figure 13` - interplay between threshold and semantic guidance scale

<img width="60%" alt="image" src="https://github.com/user-attachments/assets/996be34e-b9fa-4781-82be-0e75771491b2">

> `Figure 14` - interplay between the momentum scale and momentum

</details>


<details><summary>Further Examples of Semantic Space's Properties</summary>

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/fccbda99-1ae0-4505-b369-f0afe270591c">

> `Figure 8` - uniqueness of guidance vectors
> guidance vector for `glasses` is successfully applied to three images and fails to add glasses on three other images
> unsuccessfully edited images on the right retain the same image fidelity
> → SEGA's guidance vectors do not introduce artifacts
> (uniqueness에 해당하는 concept vector transfer가 안된다는 것이지, image에 SEGA 적용하면 잘됨)

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/41be02a8-995b-4ed3-9585-f43c726db33f">

> `Figure 15`
> multiple concept vectors scale monotonically and independent of each other

</details>


<details><summary>User Study</summary>

> why do user study for empirical evaluation?
> classifier trained on CelebA is accurate for the synthetic images

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/ecbb28bb-40dc-4efa-ba1a-52dd5d0d80b1">

> `Figure 16`
> SEGA edits removing artifacts and improving overall uncanny image feel

</details>


<details><summary>Comparisons to related methods</summary>
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/0de5f834-1fc0-4054-a43a-46ba579a8751">
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/70d9d355-4771-4194-996a-3c677e0bd08d">
<img width="90%" alt="image" src="https://github.com/user-attachments/assets/cab03b99-c84f-437a-a4f6-6b150faf652f">

> `Figure 19, 20, 21` - comparisons with Composable Diffusion, Prompt2Prompt, Disentanglement

<img width="90%" alt="image" src="https://github.com/user-attachments/assets/2ef6435d-5eef-438a-a640-e569a8685d35">

> `Figure 22` - comparison with Negative Prompting
> negative prompting offers no control over the strength of the negative guidance term
> negative prompting often leads to unnecessarily large changes to the image
> negative prompting use single prompt with multiple concepts → one of the concepts not being removed in the generated image

</details>


<details><summary>Interpreting Diffusion Model</summary>
<img width="90%" alt="Figure 18" src="https://github.com/user-attachments/assets/32c6bcfc-a867-4927-963a-de1f4f7681a4">

> `Figure 18`
> uncovering how the underlying DM interprets more complex concepts
> adding the concept `carbon emissions` → much older vehicle
> reducing the concept `safety` → convertible with no roof and likely increased horsepower
> → deeper natural language and image understanding that go beyond descriptive captions of images

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[2023 NeurIPS] SEGA: Instructing Text-to-Image Models using Semantic Guidance #229

1. Semantic Guidance

Intuition

Isolating Semantics in Diffusion

One Direction

Beyond One Direction

2. Experimental Evaluation

3. Appendix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2023 NeurIPS] SEGA: Instructing Text-to-Image Models using Semantic Guidance #229

Description

1. Semantic Guidance

Intuition

Isolating Semantics in Diffusion

One Direction

Beyond One Direction

2. Experimental Evaluation

3. Appendix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions