从反向传播角度理解 PPO 损失函数（稳定渲染版）

从反向传播角度理解 PPO 损失函数

从反向传播角度看，PPO 的损失函数可以理解为一个三通道协同优化系统，通过三股梯度流共同作用于 Actor（策略网络）和 Critic（价值网络）。

$$
L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]
$$

$$
L(\theta, \phi) = -L^{\mathrm{CLIP}}(\theta) + c_1 L^{\mathrm{VF}}(\phi) - c_2 H(\pi_\theta)
$$

$$
L^{\mathrm{CLIP}}(\theta) = \mathbb{E}_t[\min(r_t(\theta)\hat{A}_t,\ \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]
$$

$$
L^{\mathrm{VF}}(\phi) = \mathbb{E}_t[(V_{\phi}(s_t) - R_t)^2]
$$

$$
H(\pi_\theta) = \mathbb{E}_t[H(\pi_\theta(\cdot \mid s_t))]
$$

$$
H(\pi_\theta(\cdot \mid s_t)) = -\sum_a \pi_\theta(a \mid s_t)\log \pi_\theta(a \mid s_t)
$$

$$
r_t(\theta) = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\theta_{\mathrm{old}}}(a_t \mid s_t)}
$$

当 $\hat{A}_t > 0$：

当 $\hat{A}_t < 0$：

👉 min + clip 实现自适应梯度裁剪

$$
\nabla_\phi L^{\mathrm{VF}} = 2 (V_\phi(s_t) - R_t)\nabla_\phi V_\phi(s_t)
$$

$$
H(\pi_\theta) = -\sum_a \pi_\theta(a)\log \pi_\theta(a)
$$

👉 推动分布更加均匀，防止策略过早收敛

Actor：

$$
\nabla_\theta L = \nabla_\theta L^{\mathrm{CLIP}} - c_2 \nabla_\theta H(\pi_\theta)
$$

Critic：

$$
\nabla_\phi L = c_1 \nabla_\phi L^{\mathrm{VF}}
$$

PPO 的本质是：

👉 在反向传播中构建一个受约束的多目标优化系统

同时实现：