https://gigazine.net/news/20220914-stable-diffusion-illustrated/

2022年8月に無料で一般公開された画像生成AI「Stable Diffusion」は、NVIDIA製GPUを搭載したCPUあるいはGoogle Colaboratoryのようなオンライン実行環境を整えれば、任意の文字列や誰でも画像を生成することができます。そんなStable Diffusionがどのようにして画像を生成しているのかについて、AIについてTwitterで解説を行うAI Pubが説明しています。

// Stable Diffusion, Explained //

You've seen the Stable Diffusion AI art all over Twitter.

But how does Stable Diffusion _work_?

A thread explaining diffusion models, latent space representations, and context injection:

1/15 pic.twitter.com/VX9UVmUaKJ
— AI Pub (@ai__pub)

Stable Diffusionがどのように画像を生成しているのかについては、以下の画像をクリックすると見ることができるGIFアニメーションを見ると、おおまかに理解できます。

そもそもStable Diffusionの「Diffusion」とは「拡散」という意味です。この拡散とは、画像にランダムで小さなノイズを繰り返し追加するプロセスのことで、以下の画像だと左から右に進める処理です。Stable Diffusionでは、さらにこの逆となる右から左、つまりノイズを画像に変換する処理も行います。

First, a one-tweet summary of diffusion models (DMs).

Diffusion is the process of adding small, random noise to an image, repeatedly. (Left-to-right)

Diffusion models reverse this process, turning noise into images, bit-by-bit. (Right-to-left)

Photo credit: @AssemblyAI

2/15 pic.twitter.com/Jk34utlZxE
— AI Pub (@ai__pub)

そして、このノイズを画像に変換する工程を、学習済みのニューラルネットワークが担当します。ニュートラルネットワークが学習するのは関数f(x,t)で、これはxのノイズをわずかに除去し、t-1回目でどのように見えるかを生成するものです。

How do DMs turn noise into images?

By training a neural network to do so gradually.

With the sequence of noised images = x_1, x_2, ... x_T,

The neural net learns a function f(x,t) that denoises x "a little bit", producing what x would look like at time step t-1.

3/15 pic.twitter.com/QZJX89X6Rk
— AI Pub (@ai__pub)

純粋なノイズをキレイな画像に変化させるには、この関数を何度も適用すればいいというわけです。Stable Diffusionの処理はいわばf(f(f(f(....f(N, T), T-1), T-2)..., 2, 1)というように、関数の入れ子状態になっています。このNは純粋なノイズ、Tはステップ数になります。

To turn pure noise into an HD image, just apply f several times!

The output of a diffusion model really is just
f(f(f(f(....f(N, T), T-1), T-2)..., 2, 1)
where N is pure noise, and T is the number of diffusion steps.

The neural net f is typically implemented as a U-net.

4/15 pic.twitter.com/lxYvucaCGt
— AI Pub (@ai__pub)

もちろん一連の作業を512×512ピクセルで行うのは計算負荷が非常に高く、コストもかかります。

The key idea behind Stable Diffusion:

Training and computing a diffusion model on large 512 x 512 images is _incredibly_ slow and expensive.

Instead, let's do the computation on _embeddings_ of images, rather than on images themselves.

5/15 pic.twitter.com/h8qRyxeqY1
— AI Pub (@ai__pub)

そのため、この計算負荷を下げるため、実際のピクセル空間を使うのではなく、より低い次元である潜在空間を利用します。具体的には、エンコーダーを使って画像Xを潜在空間表現であるz(x)に圧縮し、xではなくz(x)で拡散(Diffusion Process)とノイズ除去(Denoising U-Net)を実行する流れ。以下の図のεがエンコーダーで、Dがデコーダーです。

So, Stable Diffusion works in two steps.

Step 1: Use an encoder to compress an image "x" into a lower-dimensional, latent-space representation "z(x)"

Step 2: run diffusion and denoising on z(x), rather than x.

Diagram below!

6/15 pic.twitter.com/SPHCyZDOSD
— AI Pub (@ai__pub)

Stable Diffusionは、さらに関数の変数として文字列(プロンプト)を入力できます。プロンプトを入力することで、ノイズ除去の方向がある程度定まることになります。

But where does the text prompt come in?

I lied! SD does NOT learn a function f(x,t) to denoise x a "little bit" back in time.

It actually learns a function f(x, t, y), with y the "context" to guide the denoising of x.

Below, y is the image label "arctic fox".

8/15 pic.twitter.com/z4WVWJ8NVu
— AI Pub (@ai__pub)

Stable Diffusionはプロンプトという「コンテクスト」を、ノイズ除去プロセスで行われる単純結合(Simple concatenation)とデコード直前のCross-attentionで介入させます。

But how does SD process context?

The "context" y, alongside the time step t, can be injected into the latent space representation z(x) either by:

1) Simple concatenation
2) Cross-attention

Stable diffusion uses both.

10/15 pic.twitter.com/WaDmy0RyIB
— AI Pub (@ai__pub)

また、コンテクストとして文字列のほかに画像を扱うことができるというのもStable Diffusionの大きな特徴です。Stable Diffusionは、画像データから画像修復と画像合成の同時を行います。

The cool part not talked about on Twitter: the context mechanism is incredibly flexible.

Instead of y = an image label,

Let y = a masked image, or y = a scene segmentation.

SD trained on this different data, can now do image inpainting and semantic image synthesis!

11/15 pic.twitter.com/wzCF8OOV0p
— AI Pub (@ai__pub)