本文简单介绍 MOPO(Model-Based Offline Policy Optimization)模型
- 参考链接:
MOPO基本思想
- 主要用于解决Offline RL的问题,属于Model-based方法
- 使用一个模型预估状态转移和奖励,同时使用训练一个误差评估器,在真实使用奖励时,使用误差评估器对奖励进行修正
MOPO方法详情
- MOPO训练Framework
- 基本思路是在原始Reward上给与一定的修正,对方差大的Reward采取更多的惩罚,防止高估问题发生
- MOPO训练代码(这里是Ensemble Uncertainty形式,即Ensemble Penalty形式,其他还包括No Penalty和True Penalty形式)
- 在收集到的静态数据集上训练 N 个模型(Probabilistic Dynamics)
- 每次需要与环境交互时,随机选择一个模型(Dynamics)来交互
- 修正这个Reward时,使用所有模型的方差的 F 范数中的最大值作为惩罚项,其中矩阵 F 范数的定义为:\(\Vert A \Vert_F = \sqrt{\sum_{i=1}^N\sum_{i=1}^M|a_{ij}|^2}\)
- 也可以用均值而不是最大值: \(u^{\text{mean}}(s,a) = \frac{1}{N}\sum_{i=1}^N\Vert\Sigma_\phi^i \Vert_F\)
- 关键代码(https://github.com/tianheyu927/mopo/blob/master/mopo/models/fake_env.py#L80-L102)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23if self.penalty_coeff != 0:
if not self.penalty_learned_var:
ensemble_means_obs = ensemble_model_means[:,:,1:]
mean_obs_means = np.mean(ensemble_means_obs, axis=0) # average predictions over models
diffs = ensemble_means_obs - mean_obs_means
normalize_diffs = False
if normalize_diffs:
obs_dim = next_obs.shape[1]
obs_sigma = self.model.scaler.cached_sigma[0,:obs_dim]
diffs = diffs / obs_sigma
dists = np.linalg.norm(diffs, axis=2) # distance in obs space # np.linalg.norm 函数默认情况下会计算向量或矩阵的二范数(欧几里得范数),也可通过增加参数来实现计算不同的范数
penalty = np.max(dists, axis=0) # max distances over models
else:
penalty = np.amax(np.linalg.norm(ensemble_model_stds, axis=2), axis=0)
penalty = np.expand_dims(penalty, 1)
assert penalty.shape == rewards.shape
unpenalized_rewards = rewards
penalized_rewards = rewards - self.penalty_coeff * penalty
else:
penalty = None
unpenalized_rewards = rewards
penalized_rewards = rewards