RL——MBPO

本文简单介绍 MBPO（Model-Based Policy Optimization）模型

参考链接：
- 原始论文：When to Trust Your Model: Model-Based Policy Optimization, UC Berkeley, NeurIPS 2019

MBPO 基本思想

主要用于解决 Online RL 的问题，属于 Model-based 方法

MBPO 方法详情

MBPO 训练代码
- 在收集到的奖励上减去一个误差项
- 训练时，Policy Optimization使用的是SAC方法：
其中，关键变量定义如下：

\(\eta[\pi]\) denotes the returns of the policy in the true MDP, whereas \(\hat{\eta}[\pi]\) denotes the returns of the policy under our model. Such a statement guarantees that, as long as we improve by at least C under the model, we can guarantee improvement on the true MDP.
- \(\eta[\pi]\) 定义为真实MDP的累计折扣收益：
  $$ \eta[\pi] = \mathbb{E}_\pi\Big[ \sum_{t=0}^\infty \gamma^t r(s_t, a_t) \Big]$$
- \(\hat{\eta}[\pi]\) 定义为模型返回的累计折扣收益
- \(\eta[\pi]\) 和 \(\hat{\eta}[\pi]\) 的关系：
  $$ \eta[\pi] \gt \hat{\eta}[\pi] - C $$
- 关于 \(C(\epsilon_m, \epsilon_\pi)\) 的定义：
- \(\epsilon_m\) : 模型误差
  $$ \epsilon_m = \max_t\mathbb{E}_{s\sim \pi_\text{D,t}} [D_{TV}(p(s’,r|s,a)\Vert p_\theta(s’,r|s,a))] $$
  - \(s\sim \pi_\text{D,t}\) 表示时间步 \(t\) 采样到状态 \(s\)
- \( \epsilon_\pi\) : 策略偏移
  $$ \max_s D_{TV}(\pi \Vert \pi_\text{D}) \lt \epsilon_\pi $$

附录：动手学强化学习中介绍的 MBPO

训练流程
- 没有考虑减去误差项