ML——AUC和GAUC

本文介绍AUC和GAUC


参考链接


编程实现

  • AUC计算的3种实现方法
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    # encoding=utf8
    from sklearn.metrics import roc_auc_score

    ## 实现1:O(Nlog(N))
    def calculate_auc1(labels, predictions):
    # 将预测结果和真实标签按照预测结果从大到小的顺序进行排序
    sorted_predictions = [l for _, l in sorted(zip(predictions, labels), reverse=True)]
    print(sorted_predictions)
    # 统计正样本和负样本的数量
    positive_count = sum(labels)
    negative_count = len(labels) - positive_count

    neg_found_count = 0
    pos_gt_neg_count = 0
    # 计算正样本大于负样本的数量之和
    for label in sorted_predictions:
    if label == 1:
    pos_gt_neg_count += negative_count - neg_found_count
    else:
    neg_found_count += 1

    # 计算AUC
    auc = 1.0 * pos_gt_neg_count / (positive_count * negative_count)

    return auc

    ## 实现2:O(N^2)
    def calculate_auc2(labels, predictions):
    pos_indexes = [i for i in range(len(labels)) if labels[i] == 1]
    neg_indexes = [i for i in range(len(labels)) if labels[i] == 0]
    p = len(pos_indexes)
    n = len(neg_indexes)

    pos_gt_neg_count = 0
    for i in pos_indexes:
    for j in neg_indexes:
    if predictions[i] > predictions[j]:
    pos_gt_neg_count += 1
    elif predictions[i] == predictions[j]:
    pos_gt_neg_count += 0.5
    return pos_gt_neg_count/(p*n)

    ## 实现3:O(Nlog(N))
    def calculate_auc3(labels, predictions):
    # 将预测结果和真实标签按照预测结果从小到大的顺序进行排序,注意:排序是从小到大
    sorted_predictions = [[p,l] for p, l in sorted(zip(predictions, labels))]
    print(sorted_predictions)
    # 统计正样本和负样本的数量
    positive_count = sum(labels)
    negative_count = len(labels) - positive_count
    # 统计正样本的序号和,注意:序号从1开始
    positive_count_indexes_sum = sum([i+1 for i in range(len(sorted_predictions)) if sorted_predictions[i][1] == 1])
    return (positive_count_indexes_sum - 0.5*positive_count*(positive_count+1))/(positive_count*negative_count)
    pass

    # 真实标签
    labels = [1, 1, 0, 0, 1, 1]
    # 预测结果
    predictions = [0.2, 0.8, 0.3, 0.4, 0.5, 0.6]

    # 计算AUC
    auc1 = calculate_auc1(labels, predictions)
    print("AUC1:", auc1)

    # 计算AUC
    auc2 = calculate_auc2(labels, predictions)
    print("AUC2:", auc2)

    # 计算AUC
    auc3 = calculate_auc3(labels, predictions)
    print("AUC3:", auc3)

    # 调用官方库计算AUC
    auc = roc_auc_score(labels, predictions)
    print("AUC:", auc)

SQL实现

  • 详情见:深入理解AUC

  • 推导思路:

    • 统计每个正样本大于负样本的概率(排在该正样本后面的负样本数/总的负样本数)
    • 对所有正样本的概率求均值
  • 整体推导流程:
    $$
    \begin{align}
    AUC &= \frac{1}{N_+} \sum_{j=1}^{N_+}\frac{(r_j - j)}{N_-} \\
    &= \frac{\sum_{j=1}^{N_+}r_j - N_+(N_+ + 1)/2}{N_+N_-} \\
    &= \frac{\sum_{j =1}^{N_+} r_j - N_+(N_+ + 1)/2}{N_+ N_-}
    \end{align}
    $$

    • 注意:以上公式是在按照预估值从大到小排序后的基础上计算的,实际应用上述公式时需要先排序
    • 公式符号说明:对于第 \(j\) 个正样本 ,假定其排序定义为 \(r_j\),则在这个正样本之前共有 \((r_j-1)\) 个样本,其中有 \((j - 1)\) 个正样本,\((r_j-j)\) 个负样本,此时该正样本的预估值大于负样本的概率为:\(\frac{(r_j - j)}{N_-}\)
  • SQL实现

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    select
    (ry - 0.5*n1*(n1+1))/n0/n1 as auc
    from(
    select
    sum(if(y=0, 1, 0)) as n0,
    sum(if(y=1, 1, 0)) as n1,
    sum(if(y=1, r, 0)) as ry
    from(
    select y, row_number() over(order by score asc) as r
    from(
    select y, score
    from some.table
    )A
    )B
    )C
  • SQL实现(分场景+pcoc实现)

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    select 
    scene,
    (ry - 0.5*n1*(n1+1))/n0/n1 as auc,
    n1/(n1+n0) as ctr,
    pctr,
    pctr/(n1/(n1+n0)) as pcoc,
    from(
    select
    scene,
    sum(if(y=0, 1, 0)) as n0,
    sum(if(y=1, 1, 0)) as n1,
    sum(if(y=1, r, 0)) as ry,
    avg(score) as pctr
    from(
    select scene, score, y, row_number() over(partition by scene order by score asc) as r
    from(
    select scene, y, score
    from some.table
    )A
    )B
    )C