CA——DeepGSP

参考链接：
- 原始论文：Optimizing Multiple Performance Metrics with Deep GSP Auctions for E-commerce Advertising, Alibaba, WSDM 2021
- 博客：Deep GSP ：面向多目标优化的工业界广告智能拍卖机制

核心贡献

端到端拍卖机制
能保证单广告位下的激励兼容（Incentive Compatibility, IC）和多广告位下的对称纳什均衡（Symmetric Nash equilibrium, SNE）
离线+在线验证

问题建模

定义拍卖问题为多目标优化问题：
$$ \mathop{\text{maximize}}_{\mathbf{\mathcal{M}}} \mathbb{E}_{\mathbf{b} \sim \mathcal{D}}\Big[\sum_{j=1}^L w_j\times f_j(\mathbf{b};\mathbf{\mathcal{M}})\Big]$$
- $\mathbf{b}$：广告主出价向量
- $L$：优化目标数
- $\mathcal{D}$：广告主出价分布, $\mathbf{b}$ 是从分布 $\mathcal{D}$ 中采样得到的
- $w_j$：目标权重，是用于控制目标之间重要性的权重因子，可调整目标重要性
- 注：实际上，还应该加上激励兼容和个体理性的约束才完整

一些重要的理论

Theorem 1：IC，激励兼容

单位置拍卖机制 $\mathcal{M}<\mathcal{R},\mathcal{R}>$ 是激励兼容（Incentive-Compatible）的，当且仅当满足两个规则：
- 分配规则（Allocation Scheme） $\mathcal{R}$ 是单调的（即出价越高，竞拍获胜的概率越大，改规则也称为Monotone Allocation）；
- 计费规则（Pricing Rule） $\mathcal{P}$ 是竞胜者支付其需要保持竞胜状态所需要的最低报价（该价格也称为Critical Bid，该规则也称为Critical Bid based Pricing）：
  $$
  \begin{align}
  \mathcal{R}_i(z, \mathbf{b}_{-i}) &> \mathcal{R}_i(b_i, \mathbf{b}_{-i}) \ \text{if} \ z > b_i \text{(Monotone Allocation)} \\
  \mathcal{P}_i &= \text{inf}_{z \vert \mathcal{R}_i(z, \mathbf{b}_{-i}) = \mathcal{R}_i(b_i, \mathbf{b}_{-i})} \ \text{(Critical Bid based Pricing)}
  \end{align}
  $$
注意：Theorem 1 仅针对单位置（single slot）拍卖场景

Theorem 2: SNE，对称纳什均衡

拍卖机制 $\mathcal{M}<\mathcal{R},\mathcal{R}>$ 满足对称纳什均衡（Symmetric Nash equilibrium，SNE），当且仅当：
- 竞拍者在均衡状态下更喜欢他当前的分配到位置（相对其他位置而言，更喜欢当前位置）
  $$ \beta_i(v_i - p_i) \ge \beta_j(v_j - p_j) $$
  - 其中 $\beta_i$ 表示当前位置固有的CTR预估值（理解：$\beta_i(v_i - p_i)$ 表示当前位置点击的概率乘以点击价值和付出点击成本的差值
  - 个人理解：这里仅针对CPC计费场景，广告主针对点击来估计流量的点击价值，付费也在点击上，此时不同位置之间的最大价值差异在CTR上
注意：Theorem 2 可扩展到针对多位置（multi-slot）拍卖场景

Deep GSP Auction

分配规则和计费规则设计

分配规则 $\mathcal{R}$ ：按照 Rank Score 排序(Rank Score 是关于出价 bid 非递减的函数)，$r_i = R_\theta(b_i, \mathbf{x}_i)$，排序中最靠前的 Top-K 个广告主赢得 K 个广告位
- 这里参照了经典的GSP拍卖机制实现，选择了一个非单调递减的排序分数
计费规则 $\mathcal{P}$ ： $p_i = R_\theta^{-1}(r_{i+1},\mathbf{x}_i)$
- 用后一位广告主的分数，通过逆运算计算当前广告主的价格

Point-wise Monotonicity Loss

用于引导分配规则的单调性，思路是在损失函数中增加惩罚非单调的部分
单调损失部分呢定义如下：
$$ \mathcal{L}_{\text{mono}} = \sum_{i=1}^N \max(0, -\nabla_b R_\theta(b_i, \mathbf{x}_i)) $$
- 理解：非单调则说明梯度小于0，惩罚梯度小于0的部分即可引导 Rank Score 分数 $R_\theta(b_i, \mathbf{x}_i)$ 关于出价 $b_i$ 的梯度 $\nabla_b R_\theta(b_i, \mathbf{x}_i) \ge 0$，这种单调是不严格的，智能引导，不能强保障

Approximate Inverse Operation

用于近似计算计费规则中的逆运算操作
第一步：将 Rank Score 分数 $R_\theta(b_i, \mathbf{x}_i)$ 拆解为：
$$ r_i = R_\theta(b_i, \mathbf{x}_i) = b_i \times \pi_\theta(b_i, \mathbf{x}_i) $$
- 理解：$\pi_\theta(b_i, \mathbf{x}_i)$ 是关于 $b_i$ 的非线性函数
第二步：单调损失函数对应变为：
$$ \mathcal{L}_{\text{mono}} = \sum_{i=1}^N \max(0, -(\pi_\theta(b_i, \mathbf{x}_i) + b_i\nabla_b \pi_\theta(b_i, \mathbf{x}_i))) $$
- 复合函数求导展开
第三步：近似逆运算操作
$$ p_i = \frac{r_{i+1}}{\pi_\theta(b_i, \mathbf{x}_i)} $$
- 这里假定了 $\pi_\theta(b_i, \mathbf{x}_i)$ 是关于 $b_i$ 的一个函数，这样的好处是逆运算会变得非常简单，准确求解逆运算是需要推理神经网络（神经网络函数难以定义明确的反函数吧？）
- 论文中提到，通过观察看到了 $\pi_\theta(b_i, \mathbf{x}_i)$ 关于 bid 的变化是不敏感的
  
  We have observed from an industrial data set that the non-linear function $\pi_\theta(b_i, \mathbf{x}_i)$ is not so sensitive to the bid (please refer to the experiment results in Section 4.2.3 for more details). Thus, in payment calculation, we regard $\pi_\theta(b_i, \mathbf{x}_i)$ as a constant w.r.t. $b_i$

Deep GSP的一些讨论

Theorem 3：Deep GSP的纳什均衡

Deep GSP Auction存在对称纳什均衡状态的非空集合

Theorem 3. There exists a non-empty set of Symmetric Nash Equilibrium (SNE) states in the Deep GSP auction.
理解：Deep GSP Auction拍卖下，至少存在一个状态是满足对称纳什均衡的

Deep GSP 实现

思考：考虑到一些指标是难以通过精确数学公式定义的，所以考虑使用反馈来学习

As introduced in Section 2.2, some performance metrics are not feasible to have rigorous mathematical analyses, and we can only evaluate these metrics via actual feedback from the system after deploying the auction mechanism.

强化学习建模

状态：
- 广告信息
- 广告主信息
- 用户特征等
动作：
- 深度 Rank Score 模型的输出就是动作（连续动作），策略就是深度 Rank Score 模型本身，即策略就是 $R_\theta$
奖励：
$$ re_i = F(\mathbf{b};\mathcal{M}) - \eta \times \max(0, (1-\epsilon)\mu_i(\mathcal{M_0}) - \mu_i(\mathcal{M})) $$
- 理解：
  - $F(\mathbf{b};\mathcal{M})$ 部分是原始目标；
  - $- \eta \times \max(0, (1-\epsilon)\mu_i(\mathcal{M_0}) - \mu_i(\mathcal{M}))$ 部分表示希望引导 $\mu_i(\mathcal{M}) \ge (1-\epsilon)\mu_i(\mathcal{M_0})$ 成立
状态转移 ：
- 模型训练是单步的是单步决策问题，无需要考虑状态转移

强化学习实现更多细节

强化学习的目标是：
$$ R_\theta^* = \mathop{\arg\max}_{R_\theta} \mathbb{E}_{\mathbf{b}\sim \mathcal{D}}[re_i|R_\theta] $$
使用DDPG来训练，Critic 网络和 Policy 网络损失函数如下：
$$
\begin{align}
\text{Critic Net:} \quad \mathcal{L}(Q_\theta) &= \frac{1}{N}\sum_i(y_i - Q_\theta(s_i, a_i))^2 \\
\text{Policy Net:} \quad \mathcal{L}(R_\theta) &= \frac{1}{N}\sum_i(-Q_\theta(s_i, R_\theta(s_i)) + \gamma \times \mathcal{L}_{\text{mono}})
\end{align}
$$
- 其中 $y_i = re_i$
实现时，与原始的DDPG有两点不同：
- Deep GSP 是一个单步决策问题
DeepGSP强化框架(Figure 2)：

实验指标设置

一般业务指标

Revenue Per Mille (RPM), Add-to-Cart Rate (ACR), CTR, CVR, GMV Per Mille (GPM)等等

单调性指标

计算出价和其对应的 Deep Rank 模型输出之间的单调性关系，该指标定义为：
$$ \mathcal{T}_m = \frac{1}{n}\sum \rho_{rank_{bids}},\rho_{rank_{outputs}} $$
- 其中斯皮尔曼秩相关系数（Spearman’s rank correlation coefficient）系数的定义为：
  $$ \rho = 1 - \frac{6 \sum d_i^2}{n(n^2-1)} $$
  - $d_i$ 表示每一对样本之间的等级差异。当排序一致时，等级差异为0，此时 $\rho$ 取得最大值 1；当排序相反时，可以证明，$\rho$ 取得最小值 -1

计费错误率

计费错误率（Payment Error Rate，PER），近似计费可能引入误差，作者首先使用二分查找去搜索精确的计费 $p_i^*$，使用 $\frac{p_i}{p_i^*}$ 来表示PER

激励兼容性

作者使用 Individual Stage-IC (i-SIC) 指标来量化激励兼容性(该指标参考自A Data-Driven Metric of Incentive Compatibility)，其定义为：假设广告主的效用是 $\hat{u}(b) = b \times x(b) - p(b)$ （其中 $x(\cdot)$ 是分配概率），同时广告主的价值分布满足 $F$。则 i-SIC 可以定义为：
$$ \text{i-SIC} = \lim_{\alpha \to 0} \frac{\mathbb{E}_{v\sim F}[\hat{u}((1+\alpha)v)] - \mathbb{E}_{v\sim F}[\hat{u}((1-\alpha)v)] }{2\alpha \times \mathbb{E}_{v \sim F}[v \times x(v)]} $$
- i-SIC 通过对出价进行微小扰动并评估给竞拍者带来的效用以量化激励兼容性，可通过在拍卖日志上的黑盒模拟直接计算
- i-SIC 取值为0到1（一价拍卖对应0，二价拍卖对应1），值越大，说明激励兼容性越好
  
  We now utilize the i-SIC metric [7] to evaluate incentive compatibility (IC) of Deep GSP. The i-SIC metric is between 0 and 1, and the larger value means the better IC property
  - 可以证明：一价计费下，i-SIC的值为0；二价计费下，i-SIC 的值为1
    - 一价计费时，由于竞拍者的出价等于商户的计费，在 $\lim_{\alpha \to 0}$ 时
      - 分配概率为 $x((1+\alpha)v) = x((1-\alpha)v) = x(v)$
      - 计费为 $p((1+\alpha)v) = (1+\alpha)v \times x(v) $ 和 $ p((1-\alpha)v) = (1-\alpha)v \times x(v)$，故而有
        $$
        \begin{align}
        \mathbb{E}_{v\sim F}[&\hat{u}((1+\alpha)v)] - \mathbb{E}_{v\sim F}[\hat{u}((1-\alpha)v)] \\
        &= \mathbb{E}_{v\sim F}[(1+\alpha)v \times x(v) - (1+\alpha)v \times x(v) - ((1-\alpha)v \times x(v)-(1-\alpha)v\times x(v))] \\
        &= \mathbb{E}_{v\sim F}[0] \\
        &= 0
        \end{align}
        $$
    - 二价计费时，若竞拍者赢得拍卖，在 $\lim_{\alpha \to 0}$ 时
      - 分配概率为 $x((1+\alpha)v) = x((1-\alpha)v) = x(v)$
      - 计费为 $p((1+\alpha)v) = p((1-\alpha)v) = p(v)$，故而有:
        $$
        \begin{align}
        \mathbb{E}_{v\sim F}[\hat{u}((1+\alpha)v)] - \mathbb{E}_{v\sim F}[\hat{u}((1-\alpha)v)] &= \mathbb{E}_{v\sim F}[(1+\alpha)v \times x(v) - (1-\alpha)v \times x(v)] \\
        &= \mathbb{E}_{v\sim F}[2\alpha v \times x(v) ] \\
        &= 2\alpha \mathbb{E}_{v\sim F}[ v \times x(v) ]
        \end{align}
        $$

对照基线

GSP

Generalized Second Price auction (GSP). 排序规则是通过 expected Cost Per Milles (eCPM) 排序，支付规则时使用下一位的排序分反算计费

uGSP

Utility-based Generalized Second Price auction (uGSP)，最早提出于论文 Optimising trade-offs among stakeholders in ad auctions, Microsoft Research, 2014. 是对GSP的一种扩展，其分配规则中的排序公式为：
$$ r_i(b_i) = \lambda_1 \times b_i \times pCTR_i + o_i $$
- 其中 $o_i$ 表示其他效用指标，比如作者使用了 $o_i = \lambda_2 \times pCTR_i + \lambda_3 \times pCVR_i$
支付规则与GSP相同，都是通过下一位的排序分反算计费

附录：Smooth Transition是什么？

以下内容参考自：基于深度学习的广告拍卖机制论文阅读笔记（1）

平滑切换（Smooth Transition），即多目标优化中的各指标权重发生变化时，广告主侧的优化目标不会有明显的波动；
理解：不是“广告主的优化目标”不会有明显波动，应该是“广告主效果指标”不会有明显波动
原始论文内容：

As the importance of different performance metrics can vary over time due to business needs, we further introduce the smooth transition constraint to ensure the advertiser performance metrics not fluctuate too much when the auction mechanism switches among candidate mechanisms to achieve different optimization objectives

Another desirable property we want to achieve is Smooth Transition (ST). As discussed in Section 1, the performance objectives of the ad platform may vary due to the change of the business logic. If the new optimization objective is quite different from the previous one, the resulting auction mechanism would significantly affect the advertisers’ utilities [2]. This introduces the chaos of the auction environment. To stabilize advertisers’ utility change under different mechanisms, we choose a benchmark mechanism $\mathcal{M}_0$, and require advertisers’ utility under the new mechanism should not be less than $1-\epsilon$ of that under $\mathcal{M}_0$. The benchmark mechanism $\mathcal{M}_0$ could be the currently deployed mechanism.
$$ \mu_i(\mathcal{M}) \ge (1-\epsilon)\mu_i(\mathcal{M_0}) $$
where we set a lower bound for advertiser $i$’s utility $\mu_i$ when selecting a new auction mechanism $\mathcal{M}$. The lower bound $\bar{\mu}(\mathcal{M}_0)$ could be set as the average utility over a certain period under the benchmark mechanism $\mathcal{M}_0$. The parameter $\epsilon$ is a tolerant utility loss ratio for advertisers (0 ≤ $\epsilon$ ≤ 1). By choosing an appropriate $\epsilon$, the advertiser’s utility would not fluctuate too much when the auction mechanism is switched towards optimizing another objective.
个人理解：在切换不同候选拍卖机制去实现不同的优化目标时，广告主效果波动不大(相对之前的基线不能负向太多)，相当于能比较平滑的迁移拍卖机制