目录

第四章:架构与模型组件(TorchCode)

第四章:架构与模型组件

本章介绍如何将基础组件组装为完整的模型模块:GPT-2 Block、SwiGLU MLP、Conv2d、ViT Patch Embedding、LoRA、Mixture of Experts。


GPT-2 Block 是 GPT 系列模型的基本构建单元,采用 Pre-Norm 架构,包含因果自注意力和前馈网络,通过残差连接组合。

输入 x
  ├──────────────────────────┐
  │                          │ (残差连接)
  ▼                          │
LayerNorm(x)                 │
  │                          │
  ▼                          │
Causal Self-Attention        │
  │                          │
  ▼                          │
  + ◄────────────────────────┘
  ├──────────────────────────┐
  │                          │ (残差连接)
  ▼                          │
LayerNorm(x)                 │
  │                          │
  ▼                          │
MLP (Linear→GELU→Linear)    │
  │                          │
  ▼                          │
  + ◄────────────────────────┘
输出

Pre-Norm vs Post-Norm

  • Pre-Norm:先归一化再做注意力/MLP(GPT-2、LLaMA 使用)
  • Post-Norm:先做注意力/MLP 再归一化(原始 Transformer 论文)
  • Pre-Norm 训练更稳定,不需要 warmup
import torch
import torch.nn as nn
import math

class GPT2Block(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

        # 注意力投影
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.num_heads = num_heads
        self.dk = d_model // num_heads

        # MLP: d → 4d → d
        self.mlp = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def causal_attn(self, x):
        B, S, _ = x.shape
        q = self.W_q(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
        k = self.W_k(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
        v = self.W_v(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)

        scores = (q @ k.transpose(-1, -2)) / math.sqrt(self.dk)
        mask = torch.triu(torch.full((S, S), float('-inf'), device=x.device), diagonal=1)
        scores = scores + mask
        attn = torch.softmax(scores, dim=-1) @ v
        return self.W_o(attn.transpose(1, 2).contiguous().view(B, S, -1))

    def forward(self, x):
        x = x + self.causal_attn(self.ln1(x))  # 残差 + 注意力
        x = x + self.mlp(self.ln2(x))           # 残差 + MLP
        return x

# 测试
block = GPT2Block(d_model=64, num_heads=4)
x = torch.randn(2, 8, 64)
print(block(x).shape)  # (2, 8, 64)
print("参数量:", sum(p.numel() for p in block.parameters()))
  • MLP 隐藏层维度 = 4 × d_model(经验值)
  • 使用 GELU 激活(而非 ReLU)
  • 残差连接确保梯度流通

SwiGLU 是现代 LLM(LLaMA、Mistral、PaLM)使用的前馈网络,用门控机制替代了 GPT-2 的简单 FFN。

SwiGLU(x)=down_proj(SiLU(gate_proj(x))up_proj(x))\text{SwiGLU}(x) = \text{down\_proj}\big(\text{SiLU}(\text{gate\_proj}(x)) \odot \text{up\_proj}(x)\big)

其中 SiLU(Swish)= xσ(x)x \cdot \sigma(x)\odot 是逐元素乘法。

标准 FFN (GPT-2)SwiGLU MLP (LLaMA)
Linear(d, 4d) → GELU → Linear(4d, d)gate_proj(d, d_ff) + up_proj(d, d_ff) → SiLU·gate → down_proj(d_ff, d)
2 个线性层3 个线性层
无门控门控机制
  • gate_proj 决定"哪些信息通过"(经过 SiLU 激活后作为门)
  • up_proj 提供"要传递的内容"
  • 两者逐元素相乘,实现选择性信息传递
import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLUMLP(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.gate_proj = nn.Linear(d_model, d_ff)
        self.up_proj = nn.Linear(d_model, d_ff)
        self.down_proj = nn.Linear(d_ff, d_model)

    def forward(self, x):
        gate = F.silu(self.gate_proj(x))  # SiLU = x * sigmoid(x)
        up = self.up_proj(x)
        return self.down_proj(gate * up)

# 测试
mlp = SwiGLUMLP(d_model=64, d_ff=128)
x = torch.randn(2, 8, 64)
print(mlp(x).shape)  # (2, 8, 64)
# 参数量:3 个线性层 = 64*128 + 64*128 + 128*64 + biases
print("参数量:", sum(p.numel() for p in mlp.parameters()))

卷积操作用滑动窗口(卷积核)在输入特征图上提取局部特征,是 CNN 的核心操作。

对于输入 (B, C_in, H, W) 和卷积核 (C_out, C_in, kH, kW)

out[b,cout,i,j]=cinm,nx[b,cin,is+m,js+n]w[cout,cin,m,n]+bias[cout]\text{out}[b, c_{out}, i, j] = \sum_{c_{in}} \sum_{m,n} x[b, c_{in}, i \cdot s + m, j \cdot s + n] \cdot w[c_{out}, c_{in}, m, n] + \text{bias}[c_{out}]Hout=H+2paddingkHstride+1H_{out} = \lfloor\frac{H + 2 \cdot \text{padding} - kH}{\text{stride}}\rfloor + 1
import torch
import torch.nn.functional as F

def my_conv2d(x, weight, bias=None, stride=1, padding=0):
    if padding > 0:
        x = F.pad(x, [padding] * 4)

    B, C_in, H, W = x.shape
    C_out, _, kH, kW = weight.shape
    H_out = (H - kH) // stride + 1
    W_out = (W - kW) // stride + 1

    output = torch.zeros(B, C_out, H_out, W_out, device=x.device)
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            patch = x[:, :, h_start:h_start+kH, w_start:w_start+kW]
            # patch: (B, C_in, kH, kW), weight: (C_out, C_in, kH, kW)
            output[:, :, i, j] = (patch.unsqueeze(1) * weight.unsqueeze(0)).sum(dim=(2, 3, 4))

    if bias is not None:
        output += bias.view(1, -1, 1, 1)
    return output

# 测试
x = torch.randn(1, 3, 8, 8)
w = torch.randn(16, 3, 3, 3)
out = my_conv2d(x, w, stride=1, padding=1)
print(out.shape)  # (1, 16, 8, 8)
ref = F.conv2d(x, w, padding=1)
print("匹配:", torch.allclose(out, ref, atol=1e-4))

Vision Transformer(ViT)的第一步:将图像分割为不重叠的 patch,每个 patch 展平后通过线性层投影为嵌入向量。

  1. (B, C, H, W) 图像分割为 (H/P × W/P) 个 patch,每个 patch 大小 (C, P, P)
  2. 展平每个 patch 为 (C × P × P) 维向量
  3. 线性投影到 embed_dim
import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size, patch_size, in_channels, embed_dim):
        super().__init__()
        self.patch_size = patch_size
        self.num_patches = (img_size // patch_size) ** 2
        patch_dim = in_channels * patch_size * patch_size
        self.proj = nn.Linear(patch_dim, embed_dim)

    def forward(self, x):
        B, C, H, W = x.shape
        P = self.patch_size
        # 重排为 patches: (B, num_patches, C*P*P)
        x = x.unfold(2, P, P).unfold(3, P, P)  # (B, C, H/P, W/P, P, P)
        x = x.contiguous().view(B, C, -1, P, P)  # (B, C, N, P, P)
        x = x.permute(0, 2, 1, 3, 4).contiguous().view(B, -1, C * P * P)
        return self.proj(x)

# 测试:32×32 图像,8×8 patch,3 通道,64 维嵌入
pe = PatchEmbedding(32, 8, 3, 64)
x = torch.randn(2, 3, 32, 32)
print(pe(x).shape)       # (2, 16, 64)  — 16 个 patch
print(pe.num_patches)    # 16
  • ViT(Vision Transformer)
  • CLIP 的视觉编码器
  • 任何将图像输入 Transformer 的场景

LoRA 是一种参数高效微调方法。冻结预训练权重,只训练两个低秩矩阵 A 和 B,使得微调参数量大幅减少。

h=W0x+αrBAxh = W_0 x + \frac{\alpha}{r} B A x
  • W0W_0:冻结的预训练权重 (out, in)
  • AA:低秩矩阵 (rank, in),随机初始化
  • BB:低秩矩阵 (out, rank),零初始化
  • α/r\alpha / r:缩放因子

训练开始时 BA=0BA = 0,LoRA 不改变原始模型的输出。随着训练进行,B 逐渐学习到有意义的值。

原始 Linear(1024, 1024) 有 ~1M 参数。LoRA rank=8 只需 1024×8 + 8×1024 = 16K 参数(1.6%)。

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank, alpha=1.0):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.linear.weight.requires_grad_(False)
        if self.linear.bias is not None:
            self.linear.bias.requires_grad_(False)

        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.scaling = alpha / rank

    def forward(self, x):
        base = self.linear(x)
        lora = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return base + lora

# 测试
layer = LoRALinear(16, 8, rank=4)
x = torch.randn(2, 16)
print(layer(x).shape)  # (2, 8)
trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in layer.parameters())
print(f"可训练参数: {trainable}, 总参数: {total}, 比例: {trainable/total:.1%}")

MoE 层包含多个"专家"(独立的 MLP),通过路由网络为每个 token 选择 top-k 个专家,只激活部分专家进行计算。

输入 token
Router (Linear → Softmax)
    ├── Expert 0 (MLP)  ← 权重 0.7
    ├── Expert 1 (MLP)  ← 权重 0.3
    ├── Expert 2 (MLP)  ← 未选中
    └── Expert 3 (MLP)  ← 未选中
加权求和输出
  • 增加模型容量而不成比例增加计算量
  • Mixtral 8x7B:8 个专家,每次选 2 个,总参数 47B 但每个 token 只用 ~13B
  • 实现"条件计算":不同 token 走不同路径
import torch
import torch.nn as nn

class MixtureOfExperts(nn.Module):
    def __init__(self, d_model, d_ff, num_experts, top_k=2):
        super().__init__()
        self.top_k = top_k
        self.router = nn.Linear(d_model, num_experts)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.ReLU(),
                nn.Linear(d_ff, d_model),
            )
            for _ in range(num_experts)
        ])

    def forward(self, x):
        B, S, D = x.shape
        x_flat = x.view(-1, D)  # (B*S, D)

        # 路由:选择 top-k 专家
        router_logits = self.router(x_flat)  # (B*S, num_experts)
        topk_vals, topk_idx = router_logits.topk(self.top_k, dim=-1)
        topk_weights = torch.softmax(topk_vals, dim=-1)  # 归一化权重

        # 计算每个选中专家的输出并加权求和
        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_idx = topk_idx[:, k]  # (B*S,)
            weight = topk_weights[:, k].unsqueeze(-1)  # (B*S, 1)
            for e_idx in range(len(self.experts)):
                mask = (expert_idx == e_idx)
                if mask.any():
                    output[mask] += weight[mask] * self.experts[e_idx](x_flat[mask])

        return output.view(B, S, D)

# 测试
moe = MixtureOfExperts(32, 64, num_experts=4, top_k=2)
x = torch.randn(2, 8, 32)
print(moe(x).shape)  # (2, 8, 32)
  • Mixtral 8x7B、Switch Transformer、GShard
  • 需要大模型容量但计算预算有限的场景

相关内容