Pandas——为DataFrame的某一列实行OneHot编码

为DataFrame的某一列实行OneHot编码


使用OneHotEncoder进行编码

  • 基本实现思路:

    • 生成一个OneHotEncoder对象
    • 取出对应的列并处理成N*1维的数组,用其训练OneHotEncoder对象并进行编码转换
    • 将新编码的数据生成为新的DataFrame对象
      • 为新的编码每一列生成新的列名称
      • 为新的每行索引赋值为原始DataFrame对应的索引
    • 按照列合并两个DataFrame
    • 删除之前的列
  • 实现代码:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    from sklearn.preprocessing import OneHotEncoder

    def one_hot_for_column(df, column):
    """
    encode column of df with one hot method
    :param df: an object of DataFrame
    :param column: the name of column in df
    :return: new object of DataFrame and object of OneHotEncoder
    """
    ohe = OneHotEncoder()

    # ohe.fit(df[column].values.reshape(-1, 1))
    # col_series = ohe.transform(df[column].values.reshape(-1, 1)).toarray()
    # <==>
    col_series = ohe.fit_transform(df[column].values.reshape(-1, 1)).toarray()

    columns = ["%s_%s" % (column, str(m)) for m in range(1, col_series.shape[1] + 1)]
    sub_df = pd.DataFrame(col_series, columns=columns, dtype=int, index=df.index)
    new_df = pd.concat([df, sub_df], axis=1)
    new_df.drop(columns=column, inplace=True)
    return new_df, ohe