Pandas——为DataFrame的某一列实行OneHot编码

为DataFrame的某一列实行OneHot编码

使用OneHotEncoder进行编码

基本实现思路:
- 生成一个OneHotEncoder对象
- 取出对应的列并处理成N*1维的数组,用其训练OneHotEncoder对象并进行编码转换
- 将新编码的数据生成为新的DataFrame对象
  - 为新的编码每一列生成新的列名称
  - 为新的每行索引赋值为原始DataFrame对应的索引
- 按照列合并两个DataFrame
- 删除之前的列

实现代码:

from sklearn.preprocessing import OneHotEncoder

def one_hot_for_column(df, column):
    """
    encode column of df with one hot method
    :param df: an object of DataFrame
    :param column: the name of column in df
    :return: new object of DataFrame and object of OneHotEncoder
    """
    ohe = OneHotEncoder()

    # ohe.fit(df[column].values.reshape(-1, 1))
    # col_series = ohe.transform(df[column].values.reshape(-1, 1)).toarray()
    # <==>
    col_series = ohe.fit_transform(df[column].values.reshape(-1, 1)).toarray()
    
    columns = ["%s_%s" % (column, str(m)) for m in range(1, col_series.shape[1] + 1)]
    sub_df = pd.DataFrame(col_series, columns=columns, dtype=int, index=df.index)
    new_df = pd.concat([df, sub_df], axis=1)
    new_df.drop(columns=column, inplace=True)
    return new_df, ohe