RL——Soft-Q-Learning


回顾常规的强化学习场景

  • 目标定义:
    $$
    \begin{align}
    J(\theta) &= \mathbb{E}_{\tau\sim \pi_\theta}[R(\tau)] \\
    &= \mathbb{E}_{s_0, a_0,\ldots \sim \pi_\theta}\Big[\sum\limits_{t=0}^{\infty}\gamma^t r(s_t, a_t) \Big] \\
    &= \sum_{t=0}^\infty \mathbb{E}_{(s_t,a_t) \sim \rho_{\pi_\theta}} [r(s_t, a_t)]
    \end{align}
    $$
    • 第一行是普通PG方法推导的写法,最贴近本质
    • 第二行是TRPO文章的写法,是第一行的展开形势
    • 第三行这里 \(\sum_{t=0}^T \mathbb{E}_{(s_t,a_t) \sim \rho_{\pi_\theta}} [r(s_t, a_t)]\) 的写法是Soft Q-Learning 论文 和 SAC 论文提出来的,和之前TRPO等文章写法都不同,这里从第二行到第三行的变换可以理解为积分和求和交换顺序的写法,从状态 \(s_0\) 开始, \((s_t,a_t)\) 都是按照策略 \(\pi_\theta\) 执行下去可能遇到的概率分布
  • Q值定义:
    $$
    \begin{align}
    Q^\pi(s_t,a_t) &= \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=0}^\infty\gamma^lr(s_{t+l}, a_{t+l})\Big] \vert_{(s_t,a_t)} \\
    &= r(s_t, a_t) + \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=1}^\infty\gamma^lr(s_{t+l}, a_{t+l})\Big] \vert_{(s_t,a_t)} \\
    &= r(s_t, a_t) + \sum\limits_{l=1}^\infty \mathbb{E}_{(s_{t+l},a_{t+l}) \sim \rho_\pi}[\gamma^lr(s_{t+l}, a_{t+l})] \vert_{(s_t,a_t)}
    \end{align}
    $$
    • “ \(\vert_{(s_t,a_t)}\) “表示条件概率,一般来说,不引起歧义的情况下,也可以省略” \(\vert_{(s_t,a_t)}\) “
  • V值定义:
    $$
    \begin{align}
    V^\pi(s_t) &= \mathbb{E}_{a_t,s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=0}^\infty\gamma^lr(s_{t+l}, a_{t+l})\Big] \vert_{(s_t,a_t)} \\
    &= \mathbb{E}_{a_t \sim \pi}[Q^\pi(s_t, a_t)]
    \end{align}
    $$

Soft Q-Learning

  • 目标定义
    $$ J(\phi) = \sum_{t=0}^T \mathbb{E}_{(s_t, a_t) \sim \rho_{\pi_\phi}} [r(s_t, a_t) + \alpha \mathcal{H}(\pi_\phi(.\vert s_t))] $$
    • 这里目标中增加的熵就是Soft名字的来源,这里相对标准的强化学习,仅增加了 \(\alpha \mathcal{H}(\pi_\phi(.\vert s_t))\) 为额外目标,后续在不引起歧义的情况下,我们也用 \(\alpha \mathcal{H}(\pi_\phi(s_{t}))\) 来表示,且为了方便,后续推导中常常会视为 \(\alpha=1\),这里可以通过奖励和熵同时乘以 \(\frac{1}{\alpha}\) 来变换得到
  • Soft Q值定义
    $$
    \begin{align}
    Q_{\text{soft}}^\pi(s_t, a_t) &= r(s_t, a_t) + \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=1}^\infty\gamma^l(r(s_{t+l}, a_{t+l}) +\alpha\mathcal{H}(\pi(s_{t+l})))\Big] \\
    &= r(s_t, a_t) + \sum\limits_{l=1}^\infty \mathbb{E}_{(s_{t+l},a_{t+l}) \sim \rho_\pi}[\gamma^l(r(s_{t+l}, a_{t+l}) + \alpha\mathcal{H}(\pi(s_{t+l})))]
    \end{align}
    $$
    • 对于 \(Q_{\text{soft}}^\pi(s_t, a_t)\) 来说,已经发生的事件是“ \((s_t,a_t)\) ”,此时 \(a_t\) 是确定的动作,对应的熵 \(\mathcal{H}(\pi(s_{t}))=0\)
  • Soft V值定义,同时推导用Soft Q值表示Soft V值
    $$
    \begin{align}
    V_{\text{soft}}^\pi(s_t) &= \mathbb{E}_{a_t,s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=0}^\infty\gamma^l(r(s_{t+l}, a_{t+l}) + \alpha\mathcal{H}(\pi(s_{t+l})))\Big] \\
    &= \mathbb{E}_{a_t \sim \pi}\Big[r(s_t, a_t) + \alpha\mathcal{H}(\pi(s_{t})) + \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=1}^\infty\gamma^l(r(s_{t+l}, a_{t+l}) +\alpha\mathcal{H}(\pi(s_{t+l})))\Big]\Big] \\
    &= \mathbb{E}_{a_t \sim \pi}\Big[r(s_t, a_t) + \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=1}^\infty\gamma^l(r(s_{t+l}, a_{t+l}) +\alpha\mathcal{H}(\pi(s_{t+l})))\Big] + \alpha\mathcal{H}(\pi(s_{t}))\Big] \\
    &= \mathbb{E}_{a_t \sim \pi}[Q_{\text{soft}}^\pi(s_t, a_t)] + \alpha\mathcal{H}(\pi(s_{t})) \\
    &= \mathbb{E}_{a_t \sim \pi} [Q_{\text{soft}}^\pi(s_t, a_t)] - \alpha \mathbb{E}_{a_t \sim \pi} [\log \pi(a_t \vert s_t)]\\
    &= \mathbb{E}_{a_t \sim \pi} [Q_{\text{soft}}^\pi(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)]
    \end{align}
    $$
  • 推导用Soft V值表示Soft Q值
    $$
    \begin{aligned}
    Q_{\text{soft}}^\pi(s_t, a_t) &= r(s_t, a_t) + \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=1}^\infty\gamma^l(r(s_{t+l}, a_{t+l}) +\alpha\mathcal{H}(\pi(s_{t+l})))\Big] \\
    &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=1}^\infty\gamma^{l-1}(r(s_{t+l}, a_{t+l}) +\alpha\mathcal{H}(\pi(s_{t+l})))\Big] \\
    &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1},a_{t+1},\cdots \sim \pi}\Big[\sum\limits_{l=0}^\infty\gamma^l(r(s_{t+1+l}, a_{t+1+l}) +\alpha\mathcal{H}(\pi(s_{t+1+l})))\Big] \\
    &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \rho_{\pi}(s)} \Big[\mathbb{E}_{a_{t+1},s_{t+2},a_{t+2},\cdots \sim \pi}\Big[\sum\limits_{l=0}^\infty\gamma^l(r(s_{t+1+l}, a_{t+1+l}) + \alpha\mathcal{H}(\pi(s_{t+1+l})))\Big]\Big] \\
    &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \rho_{\pi}(s)} [V_{\text{soft}}^\pi(s_{t+1})] \\
    \end{aligned}
    $$

Soft贝尔曼期望方程

  • 通过上面的推导,我们可以得到,Soft贝尔曼期望方程为:
    $$
    \begin{aligned}
    Q_{\text{soft}}^\pi(s_t, a_t) &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \rho_{\pi}(s)} [V_{\text{soft}}^\pi(s_{t+1})] \\
    Q_{\text{soft}}^\pi(s_t, a_t) &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \rho_{\pi}(s)} [\mathbb{E}_{a_{t+1} \sim \pi} [Q_{\text{soft}}^\pi(s_{t+1}, a_{t+1}) - \alpha \log \pi(a_{t+1} \vert s_{t+1})]] \\
    Q_{\text{soft}}^\pi(s_t, a_t) &= r(s_t, a_t) + \gamma \mathbb{E}_{s_{t+1} \sim \rho_{\pi}(s), a_{t+1} \sim \pi} [Q_{\text{soft}}^\pi(s_{t+1}, a_{t+1}) - \alpha \log \pi(a_{t+1} \vert s_{t+1})] \\
    V_{\text{soft}}^\pi(s_t) &= \mathbb{E}_{a_t \sim \pi} [Q_{\text{soft}}^\pi(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)] \\
    V_{\text{soft}}^\pi(s_t) &= \mathbb{E}_{a_t \sim \pi}[Q_{\text{soft}}^\pi(s_t, a_t)] + \alpha\mathcal{H}(\pi(s_{t}))
    \end{aligned}
    $$

Soft贝尔曼最优方程

  • 首先证明,在满足如下策略改进的时候(其中 \(\tilde{\pi}\))是新策略, \(\pi\) 是旧策略,
    $$
    \tilde{\pi}(\cdot | s) \propto \exp \left(Q_{\text{soft}}^{\pi}(s, \cdot) \right), \quad \forall s.
    $$
  • 策略的Soft Q值是单调的,即:
    $$
    Q_{\text{soft}}^{\tilde{\pi}}(s, a) \geq Q_{\text{soft}}^{\pi_{\text{old}}}(s, a) ; \forall s, a.
    $$
  • 策略的Soft Q值是单调性的证明如下:
    • 以上式子证明时没有考虑温度系数 \(\alpha\),考虑温度系数 \(\alpha\) 时在所有的熵 \(\mathcal{H}(\pi(\cdot\vert\cdot))\) 时前面都加上 \(\alpha\) 即可
    • 其中使用到以下不等式 :
      $$
      \mathcal{H}(\pi(\cdot\vert s)) + \mathbb{E}_{a\sim \pi}[Q_{\text{soft}}^\pi(s,a)] \le \mathcal{H}(\tilde{\pi}(\cdot\vert s)) + \mathbb{E}_{a\sim \tilde{\pi}}[Q_{\text{soft}}^\pi(s,a)]
      $$
    • 上面的不等式又使用到以下式子(下面的式子如何推出上面的式子待确认TODO):
      $$
      \mathcal{H}(\pi(\cdot\vert s)) + \mathbb{E}_{a\sim \pi}[Q_{\text{soft}}^\pi(s,a)] = - D_{\text{KL}}(\pi(\cdot|s) || \tilde{\pi}(s, \cdot)) + \log \int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da
      $$
      • 以上式子证明过程如下:
        $$
        \begin{align}
        \text{right} &= - \int \pi(\cdot | s) \log \frac{\pi(\cdot | s)}{\tilde{\pi}(\cdot | s)} + \log \int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da \\
        &= - \int \pi(\cdot | s) \log \pi(\cdot | s) + \int \pi(\cdot | s) \log {\tilde{\pi}(\cdot | s)} + \int \pi(\cdot | s) \Big(\log \int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da\Big) \\
        &= - \int \pi(\cdot | s) \log \pi(\cdot | s) + \int \pi(\cdot | s) \log \frac{\exp \left(Q_{\text{soft}}^{\pi}(s, \cdot) \right)}{\int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da} + \int \pi(\cdot | s) \Big(\log \int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da\Big) \\
        &= - \int \pi(\cdot | s) \log \pi(\cdot | s) + \int \pi(\cdot | s) \log \exp \left(Q_{\text{soft}}^{\pi}(s, \cdot) \right) \\
        &= \mathcal{H}(\pi(\cdot\vert s)) + \mathbb{E}_{a\sim \pi}[Q_{\text{soft}}^\pi(s,a)] \\
        &= \text{left}
        \end{align}
        $$
  • 其中,由式子
    $$
    \begin{align}
    \mathcal{H}(\pi(\cdot\vert s)) + \mathbb{E}_{a\sim \pi}[Q_{\text{soft}}^\pi(s,a)] = - D_{\text{KL}}(\pi(\cdot|s) || \tilde{\pi}(s, \cdot)) + \log \int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da \\
    \end{align}
    $$
    • 其中 \(\tilde{\pi}\) 是按照 \(\tilde{\pi}(s) = \frac{\exp \left(Q_{\text{soft}}^{\pi}(s, \cdot) \right)}{\int \exp \left(Q_{\text{soft}}^{\pi}(s, a)\right) da}\) 迭代以后的策略
    • 以上推导没有增加温度系数,默认温度系数为1,实际上,如果增加温度系数,有 \(\tilde{\pi}(s) = \frac{\exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi}(s, \cdot) \right)}{\int \exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi}(s, a)\right) da}\)
    • 问题:如果策略的熵为0或温度系数为0,此时这个最优形式还能降级到 \(\pi^*(s) = \mathop{\arg\max}_a Q(s,a)\) 吗?
      • 当策略的熵为0时,是可以的,此时策略必须是确定性策略,按照上述公式,动作只能取Q值最大的那一个
      • 当温度系数为0时,此时Reward和DQN一致,最优解应该是 \(\pi^*(s) = \mathop{\arg\max}_a Q(s,a)\) 。考虑温度系数的版本 \(\tilde{\pi}(s) = \frac{\exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi}(s, \cdot) \right)}{\int \exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi}(s, a)\right) da}\),当温度系数 \(\alpha\rightarrow 0\) 时,策略近似取Q值最大的动作,等价于DQN
  • 当策略收敛以后,即经过迭代以后 \(\tilde{\pi} = \pi = \pi^*\),此时有 \(D_{\text{KL}}(\pi(\cdot|s) || \tilde{\pi}(s, \cdot)) = 0\) 成立,即
    $$ \mathcal{H}(\pi^*(\cdot\vert s)) + \mathbb{E}_{a\sim \pi^*}[Q_{\text{soft}}^{\pi^*}(s,a)] = \log \int \exp \left(Q_{\text{soft}}^{\pi^*}(s, a)\right) da $$
  • 考虑温度系数后有:
    $$ \alpha\mathcal{H}(\pi^*(\cdot\vert s)) + \mathbb{E}_{a\sim \pi^*}[Q_{\text{soft}}^{\pi^*}(s,a)] = \log \int \exp \left(Q_{\text{soft}}^{\pi^*}(s, a)\right) da $$
  • 两边同时除以 \(\alpha\) 有
    $$ \mathcal{H}(\pi^*(\cdot\vert s)) + \mathbb{E}_{a\sim \pi^*}[\frac{1}{\alpha}Q_{\text{soft}}^{\pi^*}(s,a)] = \log \int \exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi^*}(s, a)\right) da $$
  • 由于
    $$
    V_{\text{soft}}^\pi(s_t) = \mathbb{E}_{a_t \sim \pi}[Q_{\text{soft}}^\pi(s_t, a_t)] + \alpha\mathcal{H}(\pi(s_{t}))
    $$
  • 所以有
    $$
    \frac{1}{\alpha}V_{\text{soft}}^\pi(s_t) = \mathbb{E}_{a_t \sim \pi}[\frac{1}{\alpha} Q_{\text{soft}}^\pi(s_t, a_t)] + \mathcal{H}(\pi(s_{t}))
    $$
  • 最优的策略 \(\pi^*\) 对应的V值 \(V^{\pi^*}(s)\) 为:
    $$ V_{\text{soft}}^{\pi^*}(s_t) = \alpha \log \int \exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi^*}(s, a)\right) da $$

贝尔曼方程总结

贝尔曼方程

$$
\begin{align}
Q^\pi(s_t,a_t) &= r(s_t, a_t) + \gamma \mathbb{E}\_{s_{t+1} \sim p(s_{t+1}|s_t, a_t)}[V^\pi(s_{t+1})] \\\\
V^\pi(s_t) &= \mathbb{E}\_{a_t \sim \pi}[Q^\pi(s_t, a_t)] 
\end{align}
$$

贝尔曼最优方程

$$
\begin{align}
Q^{\pi^\*}(s_t,a_t) &= r(s_t, a_t) + \gamma \mathbb{E}\_{s_{t+1} \sim p(s_{t+1}|s_t, a_t)}[V^{\pi^\*}(s_{t+1})] \\\\
V^{\pi^\*}(s_t) &= \max_{a_t}[Q^{\pi^\*}(s_t, a_t)] 
\end{align}
$$

Soft贝尔曼方程

$$
\begin{aligned}
Q_{\text{soft}}^\pi(s_t, a_t) &= r(s_t, a_t) + \gamma \mathbb{E}\_{s_{t+1} \sim \rho_{\pi}(s)} [V_{\text{soft}}^\pi(s_{t+1})] \\\\
V_{\text{soft}}^\pi(s_t) &= \mathbb{E}\_{a_t \sim \pi} [Q_{\text{soft}}^\pi(s_t, a_t) - \alpha \log \pi(a_t \vert s_t)] \\\\
V_{\text{soft}}^\pi(s_t) &= \mathbb{E}\_{a_t \sim \pi}[Q_{\text{soft}}^\pi(s_t, a_t)] + \alpha\mathcal{H}(\pi(s_{t}))
\end{aligned}
$$

Soft贝尔曼最优方程

$$
\begin{aligned}
Q_{\text{soft}}^{\pi^\*}(s_t, a_t) &= r(s_t, a_t) + \gamma \mathbb{E}\_{s_{t+1} \sim \rho_{\pi}(s)} [V_{\text{soft}}^{\pi^\*}(s_{t+1})] \\\\
V_{\text{soft}}^{\pi^\*}(s_t) &= \alpha \log \int \exp \left(\frac{1}{\alpha}Q_{\text{soft}}^{\pi^\*}(s, a)\right) da
\end{aligned}
$$
  • 待更新