ControlNet - 한국어

Created

2024/07/21 00:53

소개

지난 번에는 Diffusion Model을 통해 이미지를 생성하는 방법에 대해 이야기했었습니다. 정말 놀라운 기술인것 같습니다! 이번에는 이 Diffusion Model의 더욱 재미있는 활용 사례를 소개하고자 합니다.

Diffusion Model이 계속 발전해 나가면서 저희가 원하는 더 정교한 편집이 가능해졌습니다. 아래 이미지를 한 번 살펴보시죠.

Image Reference Github: fondant-usecase

DCI-VTON-Virtual-Try-On

Applications

Diffusion Model이 인테리어 디자인 분야나 가상 피팅에 어떻게 활용될 수 있는지 한번 살펴보시죠! 특정 영역이나 목적을 지정하여 해당 영역을 변경해 줄 수 있습니다. 가상 피팅의 경우, 모델의 옷을 변경하는 것이죠. 이는 사용자가 자신의 아바타나 이미지를 제공하여 옷이 자신의 몸에 어떻게 어울릴지 미리 볼 수 있는 “개인 맞춤형 가상 피팅”도 가능하다는 것을 의미합니다. 온라인 쇼핑의 단점 중 하나는 오프라인 쇼핑에 비해 고객이 옷이 자신에게 맞을지 알 수 없다는 점이었죠. 해당 기술은 온라인 쇼핑의 이러한 한계를 극복할 수 있게 해줍니다.

Metown Interview

Metown도 3D 가상 피팅으로 이 문제를 해결하려고 노력 중입니다 (하지만 3D 기술에 대한 논의는 다음 기회에 다루겠습니다).

Paper and Model Overview

따라서 오늘 리뷰할 논문은 “Control Net: Adding Conditional Control to Text-to-Image Diffusion Models”입니다.

제목에서 알 수 있듯이, 이 연구는 Diffusion Model을 “제어”할 수 있게 해줍니다. 이를 통해 우리가 앞서 보았던 것을 가능하게 합니다. 이 모델은 기존의 Stable Diffusion을 특정 작업에 대해 특화시킬 수 있게 도와줍니다. 이제 우리는 이미지 수정을 자유롭게 할 수 있으며, 초기에 넣는 텍스트 프롬프트 외에 조금 더 자유도를 가지고 다른 input을 줄 수 있게 되었습니다.

이것이 바로 해당 모델의 주요 동기와 핵심 요점입니다.

정말 멋진 기술입니다! 이제 논문을 자세히 살펴보겠습니다!

Model Architecture

“2023/0/14 - released ControlNet 1.1.”

ControlNet 모델의 아키텍처를 보면, 기존의 Stable Diffusion 모델을 freeze시킨다는 것을 볼 수 있습니다. 그 다음, external network가 주요 모델에 관여하여 일부 수정이나 추가 정보를 통합할 수 있게 해줍니다.

Koiboi (Youtube)의 리뷰를 참조해서 설명하면, 이 아키텍처 hyper-network의 일종이라고 볼 수 있습니다. Hyper-Network 기존의 강력한 모델 위에 구축된 하위 모델이 존재하는 형태입니다.

“HyperNetwork is an approach that originated in the Natural Language Processing (NLP) community [25], with the aim of training a small recurrent neural network to influence the weights of a larger one. It has been applied to image gener- ation with generative adversarial networks (GANs) [4, 18]. Heathen et al. [26] and Kurumuz [43] implement HyperNet- works for Stable Diffusion [72] to change the artistic style of its output images”

- page 2 of ControlNet

하이퍼네트워크 간단 요약

“Hypernetwork is a fine-tuning technique developed by Novel AI, an early adopter of Stable Diffusion. It is a small neural network attached to a Stable Diffusion model to modify its style.”

위의 이미지는 Stable Diffusion 모델의 Cross Attention 모듈을 보여주고, 아래 이미지는 키와 값을 수정하는 추가 레이어를 더한 Hyper Network Design입니다.

Reference: Stable Diffusion Art Blog

Math Review

이전 논문들이 수학에 크게 의존했던 것과 달리, 이 연구는 주로 Stable Diffusion에 external network를 통합하는 새로운 아이디어에 중점을 두었기 때문에, 수학적 수식이 많이 등장하진 않습니다. 아래는 주요 학습 메커니즘과 도입된 수학 개념들입니다:

F(x;Θ)는 논문에서 설명된 대로 입력 feature map x를 다른 feature map y로 변환하는 파라미터 Θ를 가진 훈련된 네트워크입니다. 이미지 생성 모델의 문맥에서, 이 입력 x와 y는 2D feature map (h, w, c) 형태입니다.

ControlNet Layer

앞서 언급했듯이, ControlNet의 핵심은 stable diffusion의 freeze된 레이어 위에 external network를 추가하는 것입니다. Θ는 원래 paramater block parameter를 나타내며, Θc는 추가되는 클론된 학습 가능 파라미터입니다. 이 경우 F(x;Θ)는 원래 stable diffusion 모델을 나타냅니다.

논문에서 Z(⋅;⋅)로 표기된 zero convolution layers는 가중치와 바이어스가 모두 0으로 초기화된 특별한 1 × 1 convolution layers입니다. 이 두 개의 zero convolution layer가 원래의 stable diffusion 모델에 추가되어 clone layer Yc를 생성합니다. 이것이 ControlNet Block의 output입니다.

ControlNet Layer - Clone Layer

“In the first training step, since both the weight and bias parameters of a zero convolution layer are initialized to zero, both of the Z (·; ·) terms in Equation (2) evaluate to zero, and yc = y. In this way, harmful noise cannot influence the hidden states of the neural network layers in the trainable copy when the training starts”

Conditioning

아키텍처를 다시 살펴보면, 저자는 ControlNet 구조가 stable diffusion 모델의 U-net의 각 인코더 레벨에 적용된다고 설명합니다. 구체적으로, "Stable Diffusion의 12개의 인코딩 블록과 1개의 중간 블록의 학습 가능한 복사본"이 생성됩니다.

이 ControlNet 레이어를 stable diffusion에 추가하기 위해, 저자는 "각 입력 조건 이미지를 (예: edge, pose, depth, 등등) 512 × 512의 입력 크기에서 Stable Diffusion의 크기와 일치하는 64 × 64 feature space 벡터로 변환"한다고 말합니다. 이 조건 벡터 cf가 식 (4)에서 나타나는 것입니다.

이 cf는 "다양한 조건 제어, 예: edges, depth, segmentation, human pose, 등"이 될 수 있습니다.

추가 Note:

E(·): tiny network of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, using 16, 32, 64, 128, channels respectively, initialized with Gaussian weights and trained jointly with the full model

Training

z0: image input

zt: noisy-input

t: time step

ct: text prompt (condition)

cf: task-specific condition

εθ: network predicting the noise added to the noisy image zt

또한, 저자는 ControlNet의 입력 조건 이미지를 프롬프트 대신 견고하게 인식하는 능력을 높이기 위해 text prompt ct의 50%를 빈 문자열로 무작위로 대체했다고 언급합니다.

이 방정식은 결국 원본 이미지 z0에 추가된 실제 노이즈 ε와 네트워크가 예측한 노이즈 εθ 간의 차이를 최소화하는 것입니다.

Code Snippet

공식 ControlNet 구현 레포에서 몇 가지 코드를 발췌했습니다.

ControlNet Github

class ControlledUnetModel(UNetModel):
    def forward(self, x, timesteps=None, context=None, control=None, only_mid_control=False, **kwargs):
        hs = []
        with torch.no_grad():
            t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
            emb = self.time_embed(t_emb)
            h = x.type(self.dtype)
            for module in self.input_blocks:
                h = module(h, emb, context)
                hs.append(h)
            h = self.middle_block(h, emb, context)

        if control is not None:
            h += control.pop()

        for i, module in enumerate(self.output_blocks):
            if only_mid_control or control is None:
                h = torch.cat([h, hs.pop()], dim=1)
            else:
                h = torch.cat([h, hs.pop() + control.pop()], dim=1)
            h = module(h, emb, context)

        h = h.type(x.dtype)
        return self.out(h)
JavaScript
복사

다음과 부분들을 확인할 수 있습니다:

input x: noisy input에 해당하는 입력

t_emb: time step t가 t_emb로 임베딩됨

middle block: 시간 임베딩과 함께 context를 매개변수로 사용

control cf가 제공되면 (None이 아닌 경우), h 블록에 포함됩니다

class ControlLDM(LatentDiffusion):
def __init__(self, control_stage_config, control_key, only_mid_control, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.control_model = instantiate_from_config(control_stage_config)
        self.control_key = control_key
        self.only_mid_control = only_mid_control
        self.control_scales = [1.0] * 13
 
 
...
 def apply_model(self, x_noisy, t, cond, *args, **kwargs):
        assert isinstance(cond, dict)
        diffusion_model = self.model.diffusion_model

        cond_txt = torch.cat(cond['c_crossattn'], 1)

        if cond['c_concat'] is None:
            eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=None, only_mid_control=self.only_mid_control)
        else:
            control = self.control_model(x=x_noisy, hint=torch.cat(cond['c_concat'], 1), timesteps=t, context=cond_txt)
            control = [c * scale for c, scale in zip(control, self.control_scales)]
            eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=control, only_mid_control=self.only_mid_control)

        return eps

   ...
 @torch.no_grad()
 def sample_log(self, cond, batch_size, ddim, ddim_steps, **kwargs):
	   ddim_sampler = DDIMSampler(self)
     b, c, h, w = cond["c_concat"][0].shape
     shape = (self.channels, h // 8, w // 8)
	   samples, intermediates = ddim_sampler.sample(ddim_steps, batch_size, shape, cond, verbose=False, **kwargs)
     return samples, intermediates     
   
   
JavaScript
복사

class DDIMSampler(object):
    def __init__(self, model, schedule="linear", **kwargs):
        super().__init__()
        self.model = model
        self.ddpm_num_timesteps = model.num_timesteps
        self.schedule = schedule
 ...
 @torch.no_grad()
    def p_sample_ddim(self, x, c, t, index, repeat_noise=False, use_original_steps=False, quantize_denoised=False,
                      temperature=1., noise_dropout=0., score_corrector=None, corrector_kwargs=None,
                      unconditional_guidance_scale=1., unconditional_conditioning=None,
                      dynamic_threshold=None):
        b, *_, device = *x.shape, x.device

        if unconditional_conditioning is None or unconditional_guidance_scale == 1.:
            model_output = self.model.apply_model(x, t, c)
            
...
				# direction pointing to x_t
        dir_xt = (1. - a_prev - sigma_t**2).sqrt() * e_t
        noise = sigma_t * noise_like(x.shape, device, repeat_noise) * temperature
        if noise_dropout > 0.:
            noise = torch.nn.functional.dropout(noise, p=noise_dropout)
        x_prev = a_prev.sqrt() * pred_x0 + dir_xt + noise
        return x_prev, pred_x0
JavaScript
복사

So from above, we see how self.control_model is loaded with the config file. Then, in the sample_log function, it generates diffusion samples that integrates the control_model with DDIM (Denoising Diffusion Implicit Models) sampler.

Conclusion

ControlNet은 저희가 Diffusion Model을 제어하고 활용하는데에 있어 더 많은 자유도를 제공합니다.

“Additionally, we report that the training of ControlNet is robust and scalable on datasets of different sizes, and that for some tasks like depth-to-image conditioning, training Con- trolNets on a single NVIDIA RTX 3090Ti GPU can achieve results competitive with industrial models trained on large computation clusters.”

- page 2 of ControlNet

Efficiency

또한, 해당 연구에서 저자가 강조한 것은 이 설계 디자인의 Efficiency입니다. 저자가 언급했듯이, 이는 Diffusion 모델을 맞춤화하는 매우 효율적인 방법입니다. 사전 훈련된 stable diffusion 레이어 위에 그저 외부 조건 모델을 추가하는 것이기 때문에, 전체 아키텍처를 다시 수천 개의 클러스터에서 훈련할 필요 없이 이 외부 모델만 훈련하면 됩니다. 넓은 의미에서, 이는 stable diffusion 모델을 제어 가능하도록 미세 조정하는 것입니다.

Written By: Terry (Taehan) Kim,

Reviewed by Metown Corp. (Sangbin Jeon, PhD Researcher)

References and Further Readings: