扩散模型 (Diffusion Models)

📌 核心定义 (What)

一句话定义：扩散模型是一种生成模型，通过学习逆转噪声添加过程来生成数据。训练时逐步给图片加噪声直到变成纯噪声；生成时从纯噪声开始，逐步去噪还原成图片。

核心包含两个过程：

前向扩散 (Forward Diffusion): 逐步添加高斯噪声，直到数据变成纯噪声。
反向扩散 (Reverse Diffusion): 学习逆转这个过程，从噪声恢复数据。

代表模型：DDPM (Denoising Diffusion Probabilistic Models), Stable Diffusion, DALL-E 2/3, Midjourney。

🏠 生活类比 (Analogy)

🎨 “修复古画”

想象一幅名画：

毁坏过程 (前向扩散)：画被泼上墨水，一层又一层，直到完全看不出原貌（纯噪声）。
修复过程 (反向扩散)：修复师学习如何一层层去除墨水，最终还原原画。

关键是：修复师不知道原画长什么样！他只学习了”墨水是怎么泼上去的规律”，然后逆向操作。

扩散模型就是这个修复师：

训练时：学习”噪声是怎么加的”
生成时：从纯噪声开始，一步步”去噪”出图片

🎬 视频详解 (Video)

🌫️ 扩散过程可视化 (Interactive)

亲眼观察前向扩散（加噪）和反向扩散（去噪）的过程！

🌫️扩散过程可视化

原始图像t = 0 / 20纯噪声

信号保留率

100.0%

√ᾱ_t

噪声强度

0.0%

√(1-ᾱ_t)

📐 扩散公式:

x_t = √ᾱ_t · x_0 + √(1-ᾱ_t) · ε

前向: 逐步添加噪声

💡 前向扩散将图像变成噪声 | 反向扩散从噪声恢复图像

🎯 为什么需要它 (Why)

对比其他生成模型

模型	优点	缺点
GAN	生成速度快，质量高	训练不稳定，模式坍塌
VAE	训练稳定，有潜空间	生成图像模糊
Diffusion	质量最高，训练稳定	生成速度慢（需多步）

扩散模型胜出原因：

质量: 生成图像质量超越 GAN
稳定: 不存在模式坍塌问题
灵活: 易于条件生成 (text-to-image)
可控: 通过 guidance 控制生成方向

📊 数学原理 (Math)

1. 前向扩散过程

从数据 $x_0$ 开始，逐步添加噪声：

单步加噪

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t \mathbf{I})

$\beta_t$ : 噪声调度 (noise schedule)，控制每步加多少噪
$t = 1, 2, ..., T$ （如 T=1000）
当 $t \to T$ 时， $x_T \approx \mathcal{N}(0, \mathbf{I})$

重参数化技巧：可以直接从 $x_0$ 跳到 $x_t$ ：

直接采样

x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})

$\alpha_t = 1 - \beta_t$
$\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ （累积乘积）
训练时不需要逐步加噪，直接采样任意时间步

2. 反向扩散过程

学习逆转噪声添加：

去噪分布

p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 \mathbf{I})

$\mu_\theta$ : 神经网络预测的均值
$\sigma_t$ : 通常固定为 $\beta_t$ 或 $\tilde{\beta}_t$
网络实际预测的是噪声 $\epsilon_\theta(x_t, t)$

3. 训练目标

简化后的损失函数：

去噪损失

\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]

$\epsilon$ : 真实添加的噪声
$\epsilon_\theta$ : 网络预测的噪声
目标：让网络学会预测每步添加的噪声

4. 采样过程

从 $x_T \sim \mathcal{N}(0, \mathbf{I})$ 开始，逐步去噪：

for t = T, T-1, ..., 1:
    z ~ N(0, I) if t > 1 else z = 0
    x_{t-1} = (1/√α_t) * (x_t - (β_t/√(1-ᾱ_t)) * ε_θ(x_t, t)) + σ_t * z

import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleDiffusion:
    def __init__(self, num_steps=1000, beta_start=1e-4, beta_end=0.02):
        self.num_steps = num_steps

        # 线性噪声调度
        self.betas = torch.linspace(beta_start, beta_end, num_steps)
        self.alphas = 1 - self.betas
        self.alpha_bars = torch.cumprod(self.alphas, dim=0)

    def forward_diffusion(self, x0, t):
        """前向扩散: x0 -> xt"""
        # 获取对应时间步的参数
        alpha_bar = self.alpha_bars[t].view(-1, 1, 1, 1)

        # 采样噪声
        noise = torch.randn_like(x0)

        # 重参数化
        xt = torch.sqrt(alpha_bar) * x0 + torch.sqrt(1 - alpha_bar) * noise

        return xt, noise

    def training_step(self, model, x0):
        """训练一步"""
        batch_size = x0.shape[0]

        # 随机采样时间步
        t = torch.randint(0, self.num_steps, (batch_size,))

        # 前向扩散
        xt, noise = self.forward_diffusion(x0, t)

        # 预测噪声
        noise_pred = model(xt, t)

        # MSE 损失
        loss = F.mse_loss(noise_pred, noise)

        return loss

    @torch.no_grad()
    def sample(self, model, shape):
        """从噪声生成图像"""
        device = next(model.parameters()).device

        # 从纯噪声开始
        x = torch.randn(shape, device=device)

        # 逐步去噪
        for t in reversed(range(self.num_steps)):
            t_batch = torch.full((shape[0],), t, device=device, dtype=torch.long)

            # 预测噪声
            noise_pred = model(x, t_batch)

            # 计算去噪参数
            alpha = self.alphas[t]
            alpha_bar = self.alpha_bars[t]
            beta = self.betas[t]

            # 去噪公式
            if t > 0:
                noise = torch.randn_like(x)
            else:
                noise = 0

            x = (1 / torch.sqrt(alpha)) * (
                x - (beta / torch.sqrt(1 - alpha_bar)) * noise_pred
            ) + torch.sqrt(beta) * noise

        return x

# 简单的 UNet 噪声预测网络（实际中会更复杂）
class SimpleNoisePredictor(nn.Module):
    def __init__(self, channels=3, dim=64):
        super().__init__()
        self.time_embed = nn.Embedding(1000, dim)
        self.net = nn.Sequential(
            nn.Conv2d(channels, dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(dim, dim, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(dim, channels, 3, padding=1),
        )

    def forward(self, x, t):
        # 简化版：实际中 time embedding 会更复杂
        t_emb = self.time_embed(t)[:, :, None, None]
        return self.net(x) + t_emb.expand(-1, -1, x.shape[2], x.shape[3])[:, :x.shape[1]]

from diffusers import StableDiffusionPipeline
import torch

# 加载预训练模型
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

# 文本到图像生成
prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut.png")

# 使用 negative prompt 提高质量
image = pipe(
    prompt="a beautiful sunset over mountains, 4k, detailed",
    negative_prompt="blurry, low quality, distorted",
    num_inference_steps=50,
    guidance_scale=7.5,
).images[0]

⚠️ 常见误区 (Pitfalls)

文本引导生成使用 CFG，而不是额外的分类器：

# 同时预测有条件和无条件的噪声
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_cond - noise_pred_uncond)

guidance_scale 越大，越”听话”但也越不自然。推荐 7-9。

Pixel Space (DDPM): 直接在图像空间操作，分辨率受限
Latent Space (Stable Diffusion): 先用 VAE 压缩，更高效

图像 (512x512x3) --VAE编码--> 潜变量 (64x64x4) --扩散--> 去噪潜变量 --VAE解码--> 图像

🔗 相关概念

神经网络 - 噪声预测网络
损失函数 - MSE 去噪损失
概率论 - 高斯分布基础

📚 延伸资源

DDPM 原论文

Ho et al., 2020

阅读

Stable Diffusion 详解

Jay Alammar 图解

阅读

Diffusers 库

HuggingFace 官方实现

访问