Skip to content

[2023 NeurIPS] SEGA: Instructing Text-to-Image Models using Semantic Guidance #229

@Jasonlee1995

Description

@Jasonlee1995

text-to-image diffusion model이 좋은 성능을 보이지만, 유저가 원하는 결과가 한번에 나오지 않음

어떻게 하면 유저가 원하는 결과를 얻을 수 있을까?

가장 간단하게 생각할 수 있는 방법으로 multiple iteration 돌리기, input prompt 바꾸기가 있음
multiple iteration 돌리기 → 언제 원하는 결과를 얻을 수 있는지 예측 불가
input prompt 바꾸기 → small changes to the text prompt often lead to entirely different images

차라리 reference image를 editing하는 것이 더 바람직해보임

previous works (inpainting, fine-tuning, embedding optimization, ...)
→ semantic mask 필요, reference image가 많이 바뀜, 다양한 concept editing X, 학습 필요, ...

저자들은 기존 연구들의 한계를 극복한 Semantic Guidance (SEGA)를 제안함

논문의 4가지 main contribution은 다음과 같음

  1. sophisticated semantic control이 가능한 SEGA를 제안함
    SEGA는 noise estimate에서 semantic guidance direction을 구하는 방법으로, classifier-free guidance와 유사함
    no additional training, no extensions to the architecture, no external guidance
    calculated within a single forward pass, architecture-agnostic
    (compatible with latent and pixel-based diffusion models)
  2. SEGA를 통해 구한 semantic vector는 4가지 성질을 가짐
    robustness : can incorporate arbitrary concepts into the original image
    uniqueness : can be calculated once and subsequently applied to other images
    monotonicity : scales monotonically with the strength
    isolation : different vectors do not interfere with each other
  3. SEGA는 다양한 task에서 좋은 성능을 보임
  4. SEGA를 통해 diffusion model을 더 잘 이해할 수 있음
    how abstract concepts are represented by the model?
    how their interpretation reflects on the generated image?

중요하다고 생각되는 부분만 간단히 요약

1. Semantic Guidance

Semantic Guidance on Concepts

Intuition

image

Figure 1
space를 semantic concept을 represent하는 arbitrary sub-spaces의 composition으로 볼 수 있음
ex. king - male + female = queen

Isolating Semantics in Diffusion

arbitrary semantic concept에 대한 semantic vector를 어떻게 구할 수 있을까?
가장 쉽게 생각할 수 있는 방법은 바로 classifier-free guidance임

image

Figure 2
$\epsilon_{\theta}(z_t, c_e) - \epsilon_{\theta}(z_t)$
classifier-free guidance를 이용해서 concept description $e$에 대한 latent vector를 구할 수 있음
이렇게 구한 latent vector의 numerical value는 Gaussian distribution을 따름
여기서 값의 크기가 큰 dimension이 concept $e$와 관련이 높다고 볼 수 있음
즉, Gaussian distribution의 upper & lower tail에 해당하는 latent dimension이 concept $e$를 encode한다...라는 것

저자들은 empirical하게 $\epsilon$-estimate's dimension의 1~5%만 사용해도 이미지를 원하는대로 바꿀 수 있음을 확인함
space of sparse noise-estimate vectors를 semantic space라고 명칭

noise-estimate의 모든 dimension을 사용하지 않는다는 점에서 classifier-free guidance와 다름

One Direction

image

semantic space를 이용한 semantic guidance를 식으로 표현하면 위와 같음
또한 prompt $p$, concept $e$에 대해 위와 같이 image generation

diffusion process를 보다 세밀하게 control할 수 있도록 2개의 parameter 도입
→ warm-up parameter $\delta$, momentum $\nu_t$

warm-up timestep 동안 semantic guidance 적용 X → image를 얼마나 editing할 지 control 가능
momentum을 통해 다양한 timestep에서 자주 나오는 dimension을 강화할 수 있음
(momentum은 warm-up과 상관 없이 초기부터 계속 쌓이는 구조)

Beyond One Direction

image

Equation 10
multiple concepts $e_i$에 대한 semantic guidance를 위와 같이 계산
이를 이용하여 multiple concepts을 반영한 image generation 가능

Properties of Semantic Space Figure 3 Figure 4

Robustness (Figure 3 (a))
SEGA는 original image에 arbitrary concept을 잘 incorporate함
target concept을 original image의 어디에 integrate할 지 알려주지 않아도 잘 integrate함

Uniqueness (Figure 3 (b))
concept에 대해 구한 semantic guidance vector를 다른 이미지에 적용 가능
left-most image에서 구한 glasses의 semantic guidance를 다른 face images에 적용해도 됨

initial noise latent에서 $epsilon$-estimates가 크게 변화하기에, initial seed가 같아야만 transfer 가능
또한 image composition의 변화가 너무 크면 transfer가 안된다고 함
(ex. human face → animal or inanimate object)

Monotonicity (Figure 3 (c))
semantic guidance vector의 strength가 커질수록 image에서의 semantic concept magnitude가 커짐
(ex. smile semantic guidance의 scale을 키우면 image에서의 smile strength가 커짐)

Isolation (Figure 4)
concept vector는 dimension의 극히 일부만 사용하기에, different concepts간 largely isolated함
즉, different concept vectors는 서로를 방해하지 않음
→ multiple concepts를 적용한 image generation 가능

2. Experimental Evaluation

Empirical Results image

Table 1 - user study
for positive guidance, SEGA faithfully adds the target concept to the image
2 outliers : bangs and bald
bangs : low rate of annotator consensus → assume non-native English-speaking annotators are not familiar with the term bangs
bald : long hair often makes up a large portion of a portrait and thus requires more substantial changes to the image → require stronger hyperparameters

negative guidance to remove existing attributes from an image to work similarly well
guidance away from beard, bald, gray hair usually resulted in a substantial reduction of the respective feature, but failed to remove it entirely
→ hyperparameters were probably not strong enough

image

Table 2 - user study
per-attribute success rate remains similar for four instead of one distinct edit concept
→ demonstrate the isolation of semantic guidance vectors

potential influence of manipulations with SEGA on overall image quality
calculate FID scores against a reference dataset FFHQ
Stable Diffusion : 117.73 (small artifacts often present in facial images)
Stable Diffusion + SEGA : 59.86 (additional guidance signal frequently removed uncanny artifacts, resulting in overall better quality)

image

Table 3- inappropriate-image-prompts (I2P) benchmark
suppress inappropriate content using SEGA to guide the generation away from the inappropriate concepts
SEGA performs strong mitigation at inference for both architectures further highlighting the capabilities and versatility of the approach

Comparisons image

Table 4 - user study
4 types of manipulation category : composition of multiple edits, minor changes, style transfer, removal of specific objects from a scene
SEGA clearly outperforms Prompt2Prompt and Disentanglement on all examined editing tasks
compared to Composable Diffusion, SEGA again has significantly higher success rates for multi-conditioning and minor changes while achieving comparable performance for style transfer and object removal

evaluate the faithfulness to the original image composition
SEGA : 83.33%
Composable Diffusion : 13.33%
→ SEGA is generally preferred in terms of its edit capabilities and perceived fidelity over other methods

Qualitative Results image

Figure 6
for style transfer, the entirety of the image has to be changed while keeping the image composition the same
→ require a slightly lower threshold of $\lambda \approx 0.9$

SEGA faithfully applies the styles of famous artists, as well as artistic epochs and drawing techniques
changing the prompt significantly alters the image composition
→ highlight the advantages of semantic control, which allows versatile and yet robust changes

3. Appendix

Numerical Properties of Noise Estimates image

Figure 7
unconditioned estimate $\epsilon_{\theta}(z_t)$, text conditioned estimate $\epsilon_{\theta}(z_t, c_p)$, edit conditioned estimate $\epsilon_{\theta}(z_t, c_e)$ 모두 numerical values가 Gaussian distribution을 따름
이는 아마도 Gaussian sampled noise를 estimate하기 위해 학습했기 때문이라고 볼 수 있음

Further intuition and ablations on hyperparameters Figure 9

Threshold $\lambda$ (Figure 9)
SEGA는 concept에 맞는 image의 relevant region을 automatically identify함
threshold는 image의 relevant region을 얼마나 잡을지를 조정해주는 값이라고 볼 수 있음
1에 가깝게 높은 값으로 설정하면 image alteration이 거의 없을 것이고, 낮은 값으로 설정하면 많은 영역이 바뀔 것임

$\lambda \geq 0.95$ → 대부분의 image edit에 적합함
$\lambda \in [0.8, 0.9]$ → style transfer과 같이 entire image에 변화가 필요한 경우에 적합함

Scale $s_e$
scale이 커질수록 concept expression의 magnitude도 커짐
scale behaves very robustly for a larger range of values
(values around 3 or 4 are sufficient for delicate changes)

interesting point : scale is less susceptible to producing image artifacts for high values
(can generate image with values of 20+ w/o quality degradation)

Warm-up $\delta$
overall image composition is largely generated in early diffusion steps, with later ones only refining smaller details
number of warmup steps may be used to steer the granularity of compositional changes

want to change details and not composition (ex. style transfer) → $\delta \geq 15$
want compositional editing → $\delta \geq 5$

Momentum
most edits produce satisfactory results without the use of momentum, but image fidelity can further improve when used
main use case is in combination with warmup
higher momentum facilitates higher warmup periods in combination with more radical changes

image

Figure 11 - interplay between warm-up and semantic guidance scale

image

Figure 12 - interplay between warm-up and threshold

image

Figure 13 - interplay between threshold and semantic guidance scale

image

Figure 14 - interplay between the momentum scale and momentum

Further Examples of Semantic Space's Properties image

Figure 8 - uniqueness of guidance vectors
guidance vector for glasses is successfully applied to three images and fails to add glasses on three other images
unsuccessfully edited images on the right retain the same image fidelity
→ SEGA's guidance vectors do not introduce artifacts
(uniqueness에 해당하는 concept vector transfer가 안된다는 것이지, image에 SEGA 적용하면 잘됨)

image

Figure 15
multiple concept vectors scale monotonically and independent of each other

User Study

why do user study for empirical evaluation?
classifier trained on CelebA is accurate for the synthetic images

image

Figure 16
SEGA edits removing artifacts and improving overall uncanny image feel

Comparisons to related methods image image image

Figure 19, 20, 21 - comparisons with Composable Diffusion, Prompt2Prompt, Disentanglement

image

Figure 22 - comparison with Negative Prompting
negative prompting offers no control over the strength of the negative guidance term
negative prompting often leads to unnecessarily large changes to the image
negative prompting use single prompt with multiple concepts → one of the concepts not being removed in the generated image

Interpreting Diffusion Model Figure 18

Figure 18
uncovering how the underlying DM interprets more complex concepts
adding the concept carbon emissions → much older vehicle
reducing the concept safety → convertible with no roof and likely increased horsepower
→ deeper natural language and image understanding that go beyond descriptive captions of images

Metadata

Metadata

Assignees

No one assigned

    Labels

    LanguageRelated with Natural Language Processing tasksML-SafetyCorruption, Consistency, Adversary, Calibration, Anomaly Detection, GAN DetectionPrincipleUnderstanding the AIVisionRelated with Computer Vision tasks

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions