0%

机器学习之数据预处理: Missing Value缺失值处理

数据预处理作为机器学习中的关键一环,这里介绍其中的Missing Value缺失值处理

abstract.png

楔子

很多时候,样本集中的数据可能会存在部分特征缺失、不完整的现象。为此需要对缺失值进行处理。常见的处理策略有:删除、填充

Drop 删除

如果 某个样本缺失大量特征 或 某项特征很多样本都没有值,我们就可以通过删除相应的样本、某项特征列 来进行处理。具体地:Pandas中提供了dropna方法用于删除包含缺失值的数据,其会将NumPy的np.nan视作缺失值

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import pandas as pd
import numpy as np

data = {
'age': [np.nan, 33, 44, 55, 89, np.nan],
'sex': [np.nan, 'female', 'female', 'male', np.nan, "male"],
'approve': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
}

df0 = pd.DataFrame(data)
print("-"*10, "Origin Data", "-"*10, "\n", df0, "\n")

# 如果某行全是缺失值,则删除该行
df1 = df0.dropna(axis=0, how="all")
print("-"*10, "df1 Data", "-"*10, "\n", df1, "\n")

# 如果某列全是缺失值,则删除该列
df2 = df1.dropna(axis=1, how="all")
print("-"*10, "df2 Data", "-"*10, "\n", df2, "\n")

# 如果某行存在缺失值,则删除该行
df3 = df2.dropna()
print("-"*10, "df3 Data", "-"*10, "\n", df3, "\n")

输出如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
---------- Origin Data ---------- 
age sex approve
0 NaN NaN NaN
1 33.0 female NaN
2 44.0 female NaN
3 55.0 male NaN
4 89.0 NaN NaN
5 NaN male NaN

---------- df1 Data ----------
age sex approve
1 33.0 female NaN
2 44.0 female NaN
3 55.0 male NaN
4 89.0 NaN NaN
5 NaN male NaN

---------- df2 Data ----------
age sex
1 33.0 female
2 44.0 female
3 55.0 male
4 89.0 NaN
5 NaN male

---------- df3 Data ----------
age sex
1 33.0 female
2 44.0 female
3 55.0 male

Impute 填充

如果样本的某项特征值缺失,可以使用数据集的整体特点来进行填充。典型的策略有:

  • 均值填充:使用数据集该特征的均值来填充,适用于数值型数据
  • 中位数填充:使用数据集该特征的中位数来填充,适用于数值型数据。相对于均值填充,其可以避免受到异常值的影响
  • 众数填充:使用数据集该特征的众数来填充,适用于类别型数据

使用Sklearn的SimpleImputer进行填充时,其默认将np.nan视作为缺失值。故可先通过Pandas将数据集中的各种不同的缺失值统一替换为np.nan

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# 数据集 特征: 年龄、性别; 目标变量: 是否批准
data = {
'age': [2, 99, -1, 13, 29, np.nan, 18],
'sex': [None, np.nan, '', 'female', 'female', 'female', 'male'],
'approve': ["Y","Y","N","Y","N","Y","N"]
}
df = pd.DataFrame(data)
print("-"*10, "Origin Data", "-"*10, "\n", df)

# 数据清洗:将 -1、空字符串、None 全部替换为 np.nan
df_clean = df.replace([-1, '', None], np.nan)
print("-"*10, "Clean Data", "-"*10, "\n", df_clean)

miss_count = df_clean.isnull().sum()
print("-"*10, "各特征项缺失值的数量", "-"*10)
print(miss_count)

# 计算各特征项缺失值的百分比
miss_percentage = ( miss_count / len(df_clean)) * 100
print("-"*10, "各特征项缺失值的比例(%)", "-"*10)
print(miss_percentage)

# 从data中分离出特征,并转换为numpy格式
X = df_clean.drop("approve", axis=1)
X = X.to_numpy()
print("-"*10, "Origin X", "-"*10, "\n", X)

age = X[:,0:1]
print("-"*10, "Origin age", "-"*10, "\n", age)
# 创建SimpleImputer实例,默认将np.nan视作为缺失值。填充策略:中位数
imputer = SimpleImputer(strategy='median')
imputer.fit(age)
impute_age = imputer.transform(age)
print("-"*10, "Impute age", "-"*10, "\n", impute_age)

sex = X[:,1:2]
print("-"*10, "Origin sex", "-"*10, "\n", sex)
# 创建SimpleImputer实例,默认将np.nan视作为缺失值。填充策略:众数
imputer = SimpleImputer(strategy='most_frequent')
imputer.fit(sex)
impute_sex = imputer.transform(sex)
print("-"*10, "Impute sex", "-"*10, "\n", impute_sex)

输出如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---------- Origin Data ---------- 
age sex approve
0 2.0 None Y
1 99.0 NaN Y
2 -1.0 N
3 13.0 female Y
4 29.0 female N
5 NaN female Y
6 18.0 male N
---------- Clean Data ----------
age sex approve
0 2.0 NaN Y
1 99.0 NaN Y
2 NaN NaN N
3 13.0 female Y
4 29.0 female N
5 NaN female Y
6 18.0 male N
---------- 各特征项缺失值的数量 ----------
age 2
sex 3
approve 0
dtype: int64
---------- 各特征项缺失值的比例(%) ----------
age 28.571429
sex 42.857143
approve 0.000000
dtype: float64
---------- Origin X ----------
[[2.0 nan]
[99.0 nan]
[nan nan]
[13.0 'female']
[29.0 'female']
[nan 'female']
[18.0 'male']]
---------- Origin age ----------
[[2.0]
[99.0]
[nan]
[13.0]
[29.0]
[nan]
[18.0]]
---------- Impute age ----------
[[ 2.]
[99.]
[18.]
[13.]
[29.]
[18.]
[18.]]
---------- Origin sex ----------
[[nan]
[nan]
[nan]
['female']
['female']
['female']
['male']]
---------- Impute sex ----------
[['female']
['female']
['female']
['female']
['female']
['female']
['male']]

参考文献

  • 机器学习 周志华著
  • 机器学习公式详解 谢文睿、秦州著
  • 图解机器学习和深度学习入门 山口达辉、松田洋之著
请我喝杯咖啡捏~

欢迎关注我的微信公众号:青灯抽丝