RL——MOPO

本文简单介绍 MOPO(Model-Based Offline Policy Optimization)模型


MOPO基本思想

  • 主要用于解决Offline RL的问题,属于Model-based方法
  • 使用一个模型预估状态转移和奖励,同时使用训练一个误差评估器,在真实使用奖励时,使用误差评估器对奖励进行修正

MOPO方法详情

  • MOPO训练Framework
    • 基本思路是在原始Reward上给与一定的修正,对方差大的Reward采取更多的惩罚,防止高估问题发生
  • MOPO训练代码(这里是Ensemble Uncertainty形式,即Ensemble Penalty形式,其他还包括No Penalty和True Penalty形式)
    • 在收集到的静态数据集上训练 N 个模型(Probabilistic Dynamics)
    • 每次需要与环境交互时,随机选择一个模型(Dynamics)来交互
    • 修正这个Reward时,使用所有模型的方差的 F 范数中的最大值作为惩罚项,其中矩阵 F 范数的定义为:\(\Vert A \Vert_F = \sqrt{\sum_{i=1}^N\sum_{i=1}^M|a_{ij}|^2}\)
      • 也可以用均值而不是最大值: \(u^{\text{mean}}(s,a) = \frac{1}{N}\sum_{i=1}^N\Vert\Sigma_\phi^i \Vert_F\)
  • 关键代码(https://github.com/tianheyu927/mopo/blob/master/mopo/models/fake_env.py#L80-L102)
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    if self.penalty_coeff != 0:
    if not self.penalty_learned_var:
    ensemble_means_obs = ensemble_model_means[:,:,1:]
    mean_obs_means = np.mean(ensemble_means_obs, axis=0) # average predictions over models
    diffs = ensemble_means_obs - mean_obs_means
    normalize_diffs = False
    if normalize_diffs:
    obs_dim = next_obs.shape[1]
    obs_sigma = self.model.scaler.cached_sigma[0,:obs_dim]
    diffs = diffs / obs_sigma
    dists = np.linalg.norm(diffs, axis=2) # distance in obs space # np.linalg.norm 函数默认情况下会计算向量或矩阵的二范数(欧几里得范数),也可通过增加参数来实现计算不同的范数
    penalty = np.max(dists, axis=0) # max distances over models
    else:
    penalty = np.amax(np.linalg.norm(ensemble_model_stds, axis=2), axis=0)

    penalty = np.expand_dims(penalty, 1)
    assert penalty.shape == rewards.shape
    unpenalized_rewards = rewards
    penalized_rewards = rewards - self.penalty_coeff * penalty
    else:
    penalty = None
    unpenalized_rewards = rewards
    penalized_rewards = rewards