为DataFrame的某一列实行OneHot编码
使用OneHotEncoder进行编码
基本实现思路:
- 生成一个OneHotEncoder对象
- 取出对应的列并处理成N*1维的数组,用其训练OneHotEncoder对象并进行编码转换
- 将新编码的数据生成为新的DataFrame对象
- 为新的编码每一列生成新的列名称
- 为新的每行索引赋值为原始DataFrame对应的索引
- 按照列合并两个DataFrame
- 删除之前的列
实现代码:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21from sklearn.preprocessing import OneHotEncoder
def one_hot_for_column(df, column):
"""
encode column of df with one hot method
:param df: an object of DataFrame
:param column: the name of column in df
:return: new object of DataFrame and object of OneHotEncoder
"""
ohe = OneHotEncoder()
# ohe.fit(df[column].values.reshape(-1, 1))
# col_series = ohe.transform(df[column].values.reshape(-1, 1)).toarray()
# <==>
col_series = ohe.fit_transform(df[column].values.reshape(-1, 1)).toarray()
columns = ["%s_%s" % (column, str(m)) for m in range(1, col_series.shape[1] + 1)]
sub_df = pd.DataFrame(col_series, columns=columns, dtype=int, index=df.index)
new_df = pd.concat([df, sub_df], axis=1)
new_df.drop(columns=column, inplace=True)
return new_df, ohe