第四章:架构与模型组件(TorchCode)
系列 - TorchCode 系列
目录
第四章:架构与模型组件
本章介绍如何将基础组件组装为完整的模型模块:GPT-2 Block、SwiGLU MLP、Conv2d、ViT Patch Embedding、LoRA、Mixture of Experts。
4.1 GPT-2 Transformer Block
是什么
GPT-2 Block 是 GPT 系列模型的基本构建单元,采用 Pre-Norm 架构,包含因果自注意力和前馈网络,通过残差连接组合。
架构(Pre-Norm)
输入 x
│
├──────────────────────────┐
│ │ (残差连接)
▼ │
LayerNorm(x) │
│ │
▼ │
Causal Self-Attention │
│ │
▼ │
+ ◄────────────────────────┘
│
├──────────────────────────┐
│ │ (残差连接)
▼ │
LayerNorm(x) │
│ │
▼ │
MLP (Linear→GELU→Linear) │
│ │
▼ │
+ ◄────────────────────────┘
│
▼
输出Pre-Norm vs Post-Norm
- Pre-Norm:先归一化再做注意力/MLP(GPT-2、LLaMA 使用)
- Post-Norm:先做注意力/MLP 再归一化(原始 Transformer 论文)
- Pre-Norm 训练更稳定,不需要 warmup
代码示例
import torch
import torch.nn as nn
import math
class GPT2Block(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
# 注意力投影
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
self.num_heads = num_heads
self.dk = d_model // num_heads
# MLP: d → 4d → d
self.mlp = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(),
nn.Linear(4 * d_model, d_model),
)
def causal_attn(self, x):
B, S, _ = x.shape
q = self.W_q(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
k = self.W_k(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
v = self.W_v(x).view(B, S, self.num_heads, self.dk).transpose(1, 2)
scores = (q @ k.transpose(-1, -2)) / math.sqrt(self.dk)
mask = torch.triu(torch.full((S, S), float('-inf'), device=x.device), diagonal=1)
scores = scores + mask
attn = torch.softmax(scores, dim=-1) @ v
return self.W_o(attn.transpose(1, 2).contiguous().view(B, S, -1))
def forward(self, x):
x = x + self.causal_attn(self.ln1(x)) # 残差 + 注意力
x = x + self.mlp(self.ln2(x)) # 残差 + MLP
return x
# 测试
block = GPT2Block(d_model=64, num_heads=4)
x = torch.randn(2, 8, 64)
print(block(x).shape) # (2, 8, 64)
print("参数量:", sum(p.numel() for p in block.parameters()))关键设计决策
- MLP 隐藏层维度 = 4 × d_model(经验值)
- 使用 GELU 激活(而非 ReLU)
- 残差连接确保梯度流通
4.2 SwiGLU MLP
是什么
SwiGLU 是现代 LLM(LLaMA、Mistral、PaLM)使用的前馈网络,用门控机制替代了 GPT-2 的简单 FFN。
数学定义
其中 SiLU(Swish)= , 是逐元素乘法。
与标准 FFN 的区别
| 标准 FFN (GPT-2) | SwiGLU MLP (LLaMA) |
|---|---|
| Linear(d, 4d) → GELU → Linear(4d, d) | gate_proj(d, d_ff) + up_proj(d, d_ff) → SiLU·gate → down_proj(d_ff, d) |
| 2 个线性层 | 3 个线性层 |
| 无门控 | 门控机制 |
门控机制的直觉
gate_proj决定"哪些信息通过"(经过 SiLU 激活后作为门)up_proj提供"要传递的内容"- 两者逐元素相乘,实现选择性信息传递
代码示例
import torch
import torch.nn as nn
import torch.nn.functional as F
class SwiGLUMLP(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.gate_proj = nn.Linear(d_model, d_ff)
self.up_proj = nn.Linear(d_model, d_ff)
self.down_proj = nn.Linear(d_ff, d_model)
def forward(self, x):
gate = F.silu(self.gate_proj(x)) # SiLU = x * sigmoid(x)
up = self.up_proj(x)
return self.down_proj(gate * up)
# 测试
mlp = SwiGLUMLP(d_model=64, d_ff=128)
x = torch.randn(2, 8, 64)
print(mlp(x).shape) # (2, 8, 64)
# 参数量:3 个线性层 = 64*128 + 64*128 + 128*64 + biases
print("参数量:", sum(p.numel() for p in mlp.parameters()))4.3 2D Convolution(二维卷积)
是什么
卷积操作用滑动窗口(卷积核)在输入特征图上提取局部特征,是 CNN 的核心操作。
数学定义
对于输入 (B, C_in, H, W) 和卷积核 (C_out, C_in, kH, kW):
输出尺寸计算
代码示例
import torch
import torch.nn.functional as F
def my_conv2d(x, weight, bias=None, stride=1, padding=0):
if padding > 0:
x = F.pad(x, [padding] * 4)
B, C_in, H, W = x.shape
C_out, _, kH, kW = weight.shape
H_out = (H - kH) // stride + 1
W_out = (W - kW) // stride + 1
output = torch.zeros(B, C_out, H_out, W_out, device=x.device)
for i in range(H_out):
for j in range(W_out):
h_start = i * stride
w_start = j * stride
patch = x[:, :, h_start:h_start+kH, w_start:w_start+kW]
# patch: (B, C_in, kH, kW), weight: (C_out, C_in, kH, kW)
output[:, :, i, j] = (patch.unsqueeze(1) * weight.unsqueeze(0)).sum(dim=(2, 3, 4))
if bias is not None:
output += bias.view(1, -1, 1, 1)
return output
# 测试
x = torch.randn(1, 3, 8, 8)
w = torch.randn(16, 3, 3, 3)
out = my_conv2d(x, w, stride=1, padding=1)
print(out.shape) # (1, 16, 8, 8)
ref = F.conv2d(x, w, padding=1)
print("匹配:", torch.allclose(out, ref, atol=1e-4))4.4 ViT Patch Embedding
是什么
Vision Transformer(ViT)的第一步:将图像分割为不重叠的 patch,每个 patch 展平后通过线性层投影为嵌入向量。
算法步骤
- 将
(B, C, H, W)图像分割为(H/P × W/P)个 patch,每个 patch 大小(C, P, P) - 展平每个 patch 为
(C × P × P)维向量 - 线性投影到
embed_dim维
代码示例
import torch
import torch.nn as nn
class PatchEmbedding(nn.Module):
def __init__(self, img_size, patch_size, in_channels, embed_dim):
super().__init__()
self.patch_size = patch_size
self.num_patches = (img_size // patch_size) ** 2
patch_dim = in_channels * patch_size * patch_size
self.proj = nn.Linear(patch_dim, embed_dim)
def forward(self, x):
B, C, H, W = x.shape
P = self.patch_size
# 重排为 patches: (B, num_patches, C*P*P)
x = x.unfold(2, P, P).unfold(3, P, P) # (B, C, H/P, W/P, P, P)
x = x.contiguous().view(B, C, -1, P, P) # (B, C, N, P, P)
x = x.permute(0, 2, 1, 3, 4).contiguous().view(B, -1, C * P * P)
return self.proj(x)
# 测试:32×32 图像,8×8 patch,3 通道,64 维嵌入
pe = PatchEmbedding(32, 8, 3, 64)
x = torch.randn(2, 3, 32, 32)
print(pe(x).shape) # (2, 16, 64) — 16 个 patch
print(pe.num_patches) # 16适用场景
- ViT(Vision Transformer)
- CLIP 的视觉编码器
- 任何将图像输入 Transformer 的场景
4.5 LoRA(Low-Rank Adaptation)
是什么
LoRA 是一种参数高效微调方法。冻结预训练权重,只训练两个低秩矩阵 A 和 B,使得微调参数量大幅减少。
数学定义
- :冻结的预训练权重
(out, in) - :低秩矩阵
(rank, in),随机初始化 - :低秩矩阵
(out, rank),零初始化 - :缩放因子
为什么 B 零初始化
训练开始时 ,LoRA 不改变原始模型的输出。随着训练进行,B 逐渐学习到有意义的值。
参数效率
原始 Linear(1024, 1024) 有 ~1M 参数。LoRA rank=8 只需 1024×8 + 8×1024 = 16K 参数(1.6%)。
代码示例
import torch
import torch.nn as nn
class LoRALinear(nn.Module):
def __init__(self, in_features, out_features, rank, alpha=1.0):
super().__init__()
self.linear = nn.Linear(in_features, out_features)
self.linear.weight.requires_grad_(False)
if self.linear.bias is not None:
self.linear.bias.requires_grad_(False)
self.lora_A = nn.Parameter(torch.randn(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scaling = alpha / rank
def forward(self, x):
base = self.linear(x)
lora = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return base + lora
# 测试
layer = LoRALinear(16, 8, rank=4)
x = torch.randn(2, 16)
print(layer(x).shape) # (2, 8)
trainable = sum(p.numel() for p in layer.parameters() if p.requires_grad)
total = sum(p.numel() for p in layer.parameters())
print(f"可训练参数: {trainable}, 总参数: {total}, 比例: {trainable/total:.1%}")4.6 Mixture of Experts(MoE,混合专家)
是什么
MoE 层包含多个"专家"(独立的 MLP),通过路由网络为每个 token 选择 top-k 个专家,只激活部分专家进行计算。
架构
输入 token
│
▼
Router (Linear → Softmax)
│
├── Expert 0 (MLP) ← 权重 0.7
├── Expert 1 (MLP) ← 权重 0.3
├── Expert 2 (MLP) ← 未选中
└── Expert 3 (MLP) ← 未选中
│
▼
加权求和输出为什么需要
- 增加模型容量而不成比例增加计算量
- Mixtral 8x7B:8 个专家,每次选 2 个,总参数 47B 但每个 token 只用 ~13B
- 实现"条件计算":不同 token 走不同路径
代码示例
import torch
import torch.nn as nn
class MixtureOfExperts(nn.Module):
def __init__(self, d_model, d_ff, num_experts, top_k=2):
super().__init__()
self.top_k = top_k
self.router = nn.Linear(d_model, num_experts)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
)
for _ in range(num_experts)
])
def forward(self, x):
B, S, D = x.shape
x_flat = x.view(-1, D) # (B*S, D)
# 路由:选择 top-k 专家
router_logits = self.router(x_flat) # (B*S, num_experts)
topk_vals, topk_idx = router_logits.topk(self.top_k, dim=-1)
topk_weights = torch.softmax(topk_vals, dim=-1) # 归一化权重
# 计算每个选中专家的输出并加权求和
output = torch.zeros_like(x_flat)
for k in range(self.top_k):
expert_idx = topk_idx[:, k] # (B*S,)
weight = topk_weights[:, k].unsqueeze(-1) # (B*S, 1)
for e_idx in range(len(self.experts)):
mask = (expert_idx == e_idx)
if mask.any():
output[mask] += weight[mask] * self.experts[e_idx](x_flat[mask])
return output.view(B, S, D)
# 测试
moe = MixtureOfExperts(32, 64, num_experts=4, top_k=2)
x = torch.randn(2, 8, 32)
print(moe(x).shape) # (2, 8, 32)适用场景
- Mixtral 8x7B、Switch Transformer、GShard
- 需要大模型容量但计算预算有限的场景