机器学习数据预处理详解：标准化、填充缺失值及编码离散特征

在机器学习建模过程中，数据预处理是至关重要的一步。本文将通过具体示例，详细解释数据预处理的关键步骤，包括标准化数值特征、填充缺失值以及编码离散特征。我们将使用一个简单的训练和测试数据集来说明这些步骤。

示例数据集

训练数据 (`train_data`)

Id	Feature1	Feature2	Feature3	Label
1	10	5.0	A	100
2	20	6.5	B	200
3	30	NaN	A	300

测试数据 (`test_data`)

Id	Feature1	Feature2	Feature3
4	25	5.5	B
5	35	7.0	NaN

步骤解析

1. 合并所有特征以进行预处理

首先，将训练和测试数据集的特征（不包括标签列Label）合并，以便对所有特征进行统一的预处理。

1	all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))

合并后的结果：

Feature1	Feature2	Feature3
10	5.0	A
20	6.5	B
30	NaN	A
25	5.5	B
35	7.0	NaN

2. 标准化数值特征

确定数值型特征的列，然后对这些特征进行标准化处理，使每个数值特征的均值为0，标准差为1。

1
2
3

numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / x.std())

在这个例子中，Feature1 和 Feature2 是数值型特征。首先计算它们的均值和标准差：

Feature1的均值 = (10 + 20 + 30 + 25 + 35) / 5 = 24
Feature1的标准差 ≈ 9.57
Feature2的均值 = (5.0 + 6.5 + 5.5 + 7.0) / 4 = 6.0
Feature2的标准差 ≈ 0.79

标准化后的结果：

Feature1	Feature2	Feature3
-1.46	-1.27	A
-0.42	0.63	B
0.63	NaN	A
0.10	-0.63	B
1.15	1.27	NaN

3. 填充缺失值为0

将数值型特征中的缺失值（NaN）填充为0。

1	all_features[numeric_features] = all_features[numeric_features].fillna(0)

填充缺失值后的结果：

Feature1	Feature2	Feature3
-1.46	-1.27	A
-0.42	0.63	B
0.63	0.00	A
0.10	-0.63	B
1.15	1.27	NaN

4. 处理离散数值特征

将离散特征（分类特征）进行独热编码（one-hot encoding），包括缺失值（dummy_na=True）。

1	all_features = pd.get_dummies(all_features, dummy_na=True)

编码后的结果：

Feature1	Feature2	Feature3_A	Feature3_B	Feature3_nan
-1.46	-1.27	1	0	0
-0.42	0.63	0	1	0
0.63	0.00	1	0	0
0.10	-0.63	0	1	0
1.15	1.27	0	0	1

5. 确保所有特征都是数值类型

确保所有特征的数据类型都是 float32。

1	all_features = all_features.astype(np.float32)

最终结果是一个完全由数值型特征组成的DataFrame，并且所有特征都经过标准化和缺失值处理，准备好用于后续的模型训练和预测：

最终结果：

Feature1	Feature2	Feature3_A	Feature3_B	Feature3_nan
-1.46	-1.27	1.0	0.0	0.0
-0.42	0.63	0.0	1.0	0.0
0.63	0.00	1.0	0.0	0.0
0.10	-0.63	0.0	1.0	0.0
1.15	1.27	0.0	0.0	1.0