第四章：架构与模型组件（TorchCode）

Steven 收录于类别 PyTorch 和系列 TorchCode 系列

2026-04-01 2026-04-03 约 2909 字预计阅读 13 分钟

系列 - TorchCode 系列

第四章：架构与模型组件

本章介绍如何将基础组件组装为完整的模型模块：GPT-2 Block、SwiGLU MLP、Conv2d、ViT Patch Embedding、LoRA、Mixture of Experts。

4.1 GPT-2 Transformer Block

是什么

GPT-2 Block 是 GPT 系列模型的基本构建单元，采用 Pre-Norm 架构，包含因果自注意力和前馈网络，通过残差连接组合。

架构（Pre-Norm）

输入 x
  │
  ├──────────────────────────┐
  │                          │ (残差连接)
  ▼                          │
LayerNorm(x)                 │
  │                          │
  ▼                          │
Causal Self-Attention        │
  │                          │
  ▼                          │
  + ◄────────────────────────┘
  │
  ├──────────────────────────┐
  │                          │ (残差连接)
  ▼                          │
LayerNorm(x)                 │
  │                          │
  ▼                          │
MLP (Linear→GELU→Linear)    │
  │                          │
  ▼                          │
  + ◄────────────────────────┘
  │
  ▼
输出

Pre-Norm vs Post-Norm

Pre-Norm：先归一化再做注意力/MLP（GPT-2、LLaMA 使用）
Post-Norm：先做注意力/MLP 再归一化（原始 Transformer 论文）
Pre-Norm 训练更稳定，不需要 warmup

代码示例

import torch
import torch.nn as nn
import math

class GPT2Block(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

        # 注意力投影
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
        self.num_heads = num_heads
        self.dk = d_model // num_heads

        # MLP: d → 4d → d
        self.mlp = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),
        )

    def causal_attn(self, x):
        B, S, _ = x.shape
        q = self.W_q(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
        k = self.W_k(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
        v = self.W_v(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)

        scores = (q @ k.transpose(-1, -2)) / math.sqrt(self.dk)
        mask = torch.triu(torch.full((S, S), float('-inf'), device=x.device), diagonal=1)
        scores = scores + mask
        attn = torch.softmax(scores, dim=-1) @ v
        return self.W_o(attn.transpose(1, 2).contiguous().view(B, S, -1))

    def forward(self, x):
        x = x + self.causal_attn(self.ln1(x))  # 残差 + 注意力
        x = x + self.mlp(self.ln2(x))           # 残差 + MLP
        return x

# 测试
block = GPT2Block(d_model=64, num_heads=4)
x = torch.randn(2, 8, 64)
print(block(x).shape)  # (2, 8, 64)
print("参数量:", sum(p.numel() for p in block.parameters()))

关键设计决策

MLP 隐藏层维度 = 4 × d_model（经验值）
使用 GELU 激活（而非 ReLU）
残差连接确保梯度流通

4.2 SwiGLU MLP

是什么

SwiGLU 是现代 LLM（LLaMA、Mistral、PaLM）使用的前馈网络，用门控机制替代了 GPT-2 的简单 FFN。

数学定义

\text{SwiGLU}(x) = \text{down\_proj}\big(\text{SiLU}(\text{gate\_proj}(x)) \odot \text{up\_proj}(x)\big)

其中 SiLU（Swish）= $x \cdot \sigma(x)$ ， $\odot$ 是逐元素乘法。

与标准 FFN 的区别

标准 FFN (GPT-2)	SwiGLU MLP (LLaMA)
Linear(d, 4d) → GELU → Linear(4d, d)	gate_proj(d, d_ff) + up_proj(d, d_ff) → SiLU·gate → down_proj(d_ff, d)
2 个线性层	3 个线性层
无门控	门控机制

门控机制的直觉

gate_proj 决定"哪些信息通过"（经过 SiLU 激活后作为门）
up_proj 提供"要传递的内容"
两者逐元素相乘，实现选择性信息传递

代码示例

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLUMLP(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.gate_proj = nn.Linear(d_model, d_ff)
        self.up_proj = nn.Linear(d_model, d_ff)
        self.down_proj = nn.Linear(d_ff, d_model)

    def forward(self, x):
        gate = F.silu(self.gate_proj(x))  # SiLU = x * sigmoid(x)
        up = self.up_proj(x)
        return self.down_proj(gate * up)

# 测试
mlp = SwiGLUMLP(d_model=64, d_ff=128)
x = torch.randn(2, 8, 64)
print(mlp(x).shape)  # (2, 8, 64)
# 参数量：3 个线性层 = 64*128 + 64*128 + 128*64 + biases
print("参数量:", sum(p.numel() for p in mlp.parameters()))

4.3 2D Convolution（二维卷积）

是什么

卷积操作用滑动窗口（卷积核）在输入特征图上提取局部特征，是 CNN 的核心操作。

数学定义

对于输入 (B, C_in, H, W) 和卷积核 (C_out, C_in, kH, kW)：

\text{out}[b, c_{out}, i, j] = \sum_{c_{in}} \sum_{m,n} x[b, c_{in}, i \cdot s + m, j \cdot s + n] \cdot w[c_{out}, c_{in}, m, n] + \text{bias}[c_{out}]

输出尺寸计算

H_{out} = \lfloor\frac{H + 2 \cdot \text{padding} - kH}{\text{stride}}\rfloor + 1

代码示例

import torch
import torch.nn.functional as F

def my_conv2d(x, weight, bias=None, stride=1, padding=0):
    if padding > 0:
        x = F.pad(x, [padding] * 4)

    B, C_in, H, W = x.shape
    C_out, _, kH, kW = weight.shape
    H_out = (H - kH) // stride + 1
    W_out = (W - kW) // stride + 1

    output = torch.zeros(B, C_out, H_out, W_out, device=x.device)
    for i in range(H_out):
        for j in range(W_out):
            h_start = i * stride
            w_start = j * stride
            patch = x[:, :, h_start:h_start+kH, w_start:w_start+kW]
            # patch: (B, C_in, kH, kW), weight: (C_out, C_in, kH, kW)
            output[:, :, i, j] = (patch.unsqueeze(1) * weight.unsqueeze(0)).sum(dim=(2, 3, 4))

    if bias is not None:
        output += bias.view(1, -1, 1, 1)
    return output

# 测试
x = torch.randn(1, 3, 8, 8)
w = torch.randn(16, 3, 3, 3)
out = my_conv2d(x, w, stride=1, padding=1)
print(out.shape)  # (1, 16, 8, 8)
ref = F.conv2d(x, w, padding=1)
print("匹配:", torch.allclose(out, ref, atol=1e-4))

4.4 ViT Patch Embedding

是什么

Vision Transformer（ViT）的第一步：将图像分割为不重叠的 patch，每个 patch 展平后通过线性层投影为嵌入向量。

算法步骤

将 (B, C, H, W) 图像分割为 (H/P × W/P) 个 patch，每个 patch 大小 (C, P, P)
展平每个 patch 为 (C × P × P) 维向量
线性投影到 embed_dim 维

代码示例

import torch
import torch.nn as nn

class PatchEmbedding(nn.Module):
    def __init__(self, img_size, patch_size, in_channels, embed_dim):
        super().__init__()
        self.patch_size = patch_size
        self.num_patches = (img_size // patch_size) ** 2
        patch_dim = in_channels * patch_size * patch_size
        self.proj = nn.Linear(patch_dim, embed_dim)

    def forward(self, x):
        B, C, H, W = x.shape
        P = self.patch_size
        # 重排为 patches: (B, num_patches, C*P*P)
        x = x.unfold(2, P, P).unfold(3, P, P)  # (B, C, H/P, W/P, P, P)
        x = x.contiguous().view(B, C, -1, P, P)  # (B, C, N, P, P)
        x = x.permute(0, 2, 1, 3, 4).contiguous().view(B, -1, C * P * P)
        return self.proj(x)

# 测试：32×32 图像，8×8 patch，3 通道，64 维嵌入
pe = PatchEmbedding(32, 8, 3, 64)
x = torch.randn(2, 3, 32, 32)
print(pe(x).shape)       # (2, 16, 64)  — 16 个 patch
print(pe.num_patches)    # 16

适用场景

ViT（Vision Transformer）
CLIP 的视觉编码器
任何将图像输入 Transformer 的场景

4.5 LoRA（Low-Rank Adaptation）

是什么

LoRA 是一种参数高效微调方法。冻结预训练权重，只训练两个低秩矩阵 A 和 B，使得微调参数量大幅减少。

数学定义

h = W_0 x + \frac{\alpha}{r} B A x

$W_0$ ：冻结的预训练权重 (out, in)
$A$ ：低秩矩阵 (rank, in)，随机初始化
$B$ ：低秩矩阵 (out, rank)，零初始化
$\alpha / r$ ：缩放因子

为什么 B 零初始化

训练开始时 $BA = 0$ ，LoRA 不改变原始模型的输出。随着训练进行，B 逐渐学习到有意义的值。

参数效率

原始 Linear(1024, 1024) 有 ~1M 参数。LoRA rank=8 只需 1024×8 + 8×1024 = 16K 参数（1.6%）。

代码示例

import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank, alpha=1.0):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.linear.weight.requires_grad_(False)
        if self.linear.bias is not None:
            self.linear.bias.requires_grad_(False)

        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        self.scaling = alpha / rank

    def forward(self, x):
        base = self.linear(x)
        lora = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return base + lora

# 测试
layer = LoRALinear(16, 8, rank=4)
x = torch.randn(2, 16)
print(layer(x).shape)  # (2, 8)
trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in layer.parameters())
print(f"可训练参数: {trainable}, 总参数: {total}, 比例: {trainable/total:.1%}")

4.6 Mixture of Experts（MoE，混合专家）

是什么

MoE 层包含多个"专家"（独立的 MLP），通过路由网络为每个 token 选择 top-k 个专家，只激活部分专家进行计算。

架构

输入 token
    │
    ▼
Router (Linear → Softmax)
    │
    ├── Expert 0 (MLP)  ← 权重 0.7
    ├── Expert 1 (MLP)  ← 权重 0.3
    ├── Expert 2 (MLP)  ← 未选中
    └── Expert 3 (MLP)  ← 未选中
    │
    ▼
加权求和输出

为什么需要

增加模型容量而不成比例增加计算量
Mixtral 8x7B：8 个专家，每次选 2 个，总参数 47B 但每个 token 只用 ~13B
实现"条件计算"：不同 token 走不同路径

代码示例

import torch
import torch.nn as nn

class MixtureOfExperts(nn.Module):
    def __init__(self, d_model, d_ff, num_experts, top_k=2):
        super().__init__()
        self.top_k = top_k
        self.router = nn.Linear(d_model, num_experts)
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(d_model, d_ff),
                nn.ReLU(),
                nn.Linear(d_ff, d_model),
            )
            for _ in range(num_experts)
        ])

    def forward(self, x):
        B, S, D = x.shape
        x_flat = x.view(-1, D)  # (B*S, D)

        # 路由：选择 top-k 专家
        router_logits = self.router(x_flat)  # (B*S, num_experts)
        topk_vals, topk_idx = router_logits.topk(self.top_k, dim=-1)
        topk_weights = torch.softmax(topk_vals, dim=-1)  # 归一化权重

        # 计算每个选中专家的输出并加权求和
        output = torch.zeros_like(x_flat)
        for k in range(self.top_k):
            expert_idx = topk_idx[:, k]  # (B*S,)
            weight = topk_weights[:, k].unsqueeze(-1)  # (B*S, 1)
            for e_idx in range(len(self.experts)):
                mask = (expert_idx == e_idx)
                if mask.any():
                    output[mask] += weight[mask] * self.experts[e_idx](x_flat[mask])

        return output.view(B, S, D)

# 测试
moe = MixtureOfExperts(32, 64, num_experts=4, top_k=2)
x = torch.randn(2, 8, 32)
print(moe(x).shape)  # (2, 8, 32)

适用场景

Mixtral 8x7B、Switch Transformer、GShard
需要大模型容量但计算预算有限的场景

目录

目录

第四章：架构与模型组件（TorchCode）

第四章：架构与模型组件

4.1 GPT-2 Transformer Block

是什么

架构（Pre-Norm）

Pre-Norm vs Post-Norm

代码示例

关键设计决策

4.2 SwiGLU MLP

是什么

数学定义

与标准 FFN 的区别

门控机制的直觉

代码示例

4.3 2D Convolution（二维卷积）

是什么

数学定义

输出尺寸计算

代码示例

4.4 ViT Patch Embedding

是什么

算法步骤

代码示例

适用场景

4.5 LoRA（Low-Rank Adaptation）

是什么

数学定义

为什么 B 零初始化

参数效率

代码示例

4.6 Mixture of Experts（MoE，混合专家）

是什么

架构

为什么需要

代码示例

适用场景

相关内容

目录

第四章：架构与模型组件（TorchCode）

第四章：架构与模型组件

4.1 GPT-2 Transformer Block

是什么

架构（Pre-Norm）

Pre-Norm vs Post-Norm

代码示例

关键设计决策

4.2 SwiGLU MLP

是什么

数学定义

与标准 FFN 的区别

门控机制的直觉

代码示例

4.3 2D Convolution（二维卷积）

是什么

数学定义

输出尺寸计算

代码示例

4.4 ViT Patch Embedding

是什么

算法步骤

代码示例

适用场景

4.5 LoRA（Low-Rank Adaptation）

是什么

数学定义

为什么 B 零初始化

参数效率

代码示例

4.6 Mixture of Experts（MoE，混合专家）

是什么

架构

为什么需要

代码示例

适用场景

相关内容

总览：TorchCode 知识架构与学习路径

PyTorch lr曲线

PyTorch 激活函数

PyTorch 分布式训练与操作工具技术文档

第一章：激活函数与基础组件（TorchCode）