ML——AUC和GAUC

本文介绍AUC和GAUC

参考链接

图解AUC和GAUC-知乎

编程实现

AUC计算的3种实现方法

# encoding=utf8
from sklearn.metrics import roc_auc_score

## 实现1：O(Nlog(N))
def calculate_auc1(labels, predictions):
    # 将预测结果和真实标签按照预测结果从大到小的顺序进行排序
    sorted_predictions = [l for _, l in sorted(zip(predictions, labels), reverse=True)]
    print(sorted_predictions)
    # 统计正样本和负样本的数量
    positive_count = sum(labels)
    negative_count = len(labels) - positive_count

    neg_found_count = 0
    pos_gt_neg_count = 0
    # 计算正样本大于负样本的数量之和
    for label in sorted_predictions:
        if label == 1:
            pos_gt_neg_count += negative_count - neg_found_count
        else:
            neg_found_count += 1

    # 计算AUC
    auc = 1.0 * pos_gt_neg_count / (positive_count * negative_count)

    return auc

## 实现2：O(N^2)
def calculate_auc2(labels, predictions):
    pos_indexes = [i for i in range(len(labels)) if labels[i] == 1]
    neg_indexes = [i for i in range(len(labels)) if labels[i] == 0]
    p = len(pos_indexes)
    n = len(neg_indexes)

    pos_gt_neg_count = 0
    for i in pos_indexes:
        for j in neg_indexes:
            if predictions[i] > predictions[j]:
                pos_gt_neg_count += 1
            elif predictions[i] == predictions[j]:
                pos_gt_neg_count += 0.5
    return pos_gt_neg_count/(p*n)

## 实现3：O(Nlog(N))
def calculate_auc3(labels, predictions):
    # 将预测结果和真实标签按照预测结果从小到大的顺序进行排序，注意：排序是从小到大
    sorted_predictions = [[p,l] for p, l in sorted(zip(predictions, labels))]
    print(sorted_predictions)
    # 统计正样本和负样本的数量
    positive_count = sum(labels)
    negative_count = len(labels) - positive_count
    # 统计正样本的序号和，注意：序号从1开始
    positive_count_indexes_sum = sum([i+1 for i in range(len(sorted_predictions)) if sorted_predictions[i][1] == 1])
    return (positive_count_indexes_sum - 0.5*positive_count*(positive_count+1))/(positive_count*negative_count)
    pass

# 真实标签
labels = [1, 1, 0, 0, 1, 1]
# 预测结果
predictions = [0.2, 0.8, 0.3, 0.4, 0.5, 0.6]

# 计算AUC
auc1 = calculate_auc1(labels, predictions)
print("AUC1:", auc1)

# 计算AUC
auc2 = calculate_auc2(labels, predictions)
print("AUC2:", auc2)

# 计算AUC
auc3 = calculate_auc3(labels, predictions)
print("AUC3:", auc3)

# 调用官方库计算AUC
auc = roc_auc_score(labels, predictions)
print("AUC:", auc)

SQL实现

详情见：深入理解AUC
推导思路：
- 统计每个正样本大于负样本的概率（排在该正样本后面的负样本数/总的负样本数）
- 对所有正样本的概率求均值
整体推导流程：
$$
\begin{align}
AUC &= \frac{1}{N_+} \sum_{j=1}^{N_+}\frac{(r_j - j)}{N_-} \\
&= \frac{\sum_{j=1}^{N_+}r_j - N_+(N_+ + 1)/2}{N_+N_-} \\
&= \frac{\sum_{j =1}^{N_+} r_j - N_+(N_+ + 1)/2}{N_+ N_-}
\end{align}
$$
- 注意：以上公式是在按照预估值从大到小排序后的基础上计算的，实际应用上述公式时需要先排序
- 公式符号说明：对于第 $j$ 个正样本 ，假定其排序定义为 $r_j$，则在这个正样本之前共有 $(r_j-1)$ 个样本，其中有 $(j - 1)$ 个正样本，$(r_j-j)$ 个负样本，此时该正样本的预估值大于负样本的概率为：$\frac{(r_j - j)}{N_-}$

SQL实现

select
    (ry - 0.5*n1*(n1+1))/n0/n1 as auc
from(
    select
        sum(if(y=0, 1, 0)) as n0,
        sum(if(y=1, 1, 0)) as n1,
        sum(if(y=1, r, 0)) as ry
    from(
        select y, row_number() over(order by score asc) as r
        from(
            select y, score
            from some.table
        )A
    )B
)C

SQL实现（分场景+pcoc实现）

select 
    scene,
    (ry - 0.5*n1*(n1+1))/n0/n1 as auc,
    n1/(n1+n0) as ctr,
    pctr,
    pctr/(n1/(n1+n0)) as pcoc,
from(
    select 
        scene,
        sum(if(y=0, 1, 0)) as n0,
        sum(if(y=1, 1, 0)) as n1,
        sum(if(y=1, r, 0)) as ry,
        avg(score) as pctr
    from(
        select scene, score, y, row_number() over(partition by scene order by score asc) as r
        from(
            select scene, y, score
            from some.table
        )A
    )B
)C