流程图:

1.读数据表

成人收入预测数据集是由Ronny Kohavi和Barry Becker从美国某地区1994年的人口普查局数据库中提取的。该数据集包含32561位成人年收入及14个相关的指标。可以用此数据集来进行收入的预测,预测任务是确定一个人的年收入是否超过5万美元。 首先读取数据集,并查看数据集的前五行。

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

此数据集共有15个变量,其中有9个分类变量依次是工作类型workclass, 受教育程度education, 婚姻状态marital_status,职业occupation,家庭成员关系 relationship, 种族race, 性别sex, 国籍native_country, 收入salary;有6个连续型变量分别是年龄age,序号fnlwgt,受教育时长education_num,资本收益capital_gain,资本损失capital_loss,每周工作小时数hours_per_week

2.缺失值检测

接着我们详细查看变量的基本情况以及数据中是否存在缺失值。

数据缺失值情况:

列名 缺失值数量
age 0
workclass 0
fnlwgt 0
education 0
education-num 0
marital-status 0
occupation 0
relationship 0
race 0
sex 0

过滤的缺失值行数:0

通过缺失值检测发现不存在缺失值。但通过观察数据集发现,数据中有三组变量存在异常取值,接下来应对异常值进行处理。分别对工作类型workclass、职业occupation、国籍native_country三组分类数据异常值进行替换,即取值为的异常值替换为unknown

3.工作类型异常值替换

对工作类型workclass进行异常值替换。

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

工作类型workclass的异常取值已成功替换。

4.职业异常值替换

对职业occupation进行异常值替换。

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

职业occupation异常取值已成功替换。

5.国籍异常值替换

对国籍native-country异常取值进行替换。

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

国籍native-country异常取值已成功替换。

6.字段基本统计信息

查看数据集中数据的基本统计信息。

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
样本数 32561 32561 32561 32561 32561 32561 32561 32561 32561 32561 32561 32561 32561 32561 32561
不同取值个数 9 16 7 15 6 5 2 42 2
众数 Private HS-grad Married-civ-spouse Prof-specialty Husband White Male United-States <=50K
众数的频数 22696 10501 14976 4140 13193 27816 21790 29170 24720
均值 38.5816467553 189778.366512085 10.0806793403 1077.6488437087 87.303829735 40.4374558521
标准差 13.6404325536 105549.9776970222 2.5727203321 7385.2920848403 402.960218649 12.3474286817
最小值 17 12285 1 0 0 1
下四分位数 28 117827 9 0 0 40
中位数 37 178356 10 0 0 40
上四分位数 48 237051 12 0 0 45
最大值 90 1484705 16 99999 4356 99

可以看出年龄age、序号fnlwgt、受教育时长education-num、资本收益capital-gain、资本损失capital-loss、每周工作小时数hours-per-week为数值型变量,其余均为分类变量。数值型变量中序号fnlwgt、资本收益capital-gain和资本损失capital-loss,数据分布都较为分散,最大值都是均值的数十倍。

首先,对标签列个体年收入salary进行可视化分析,分别绘制年收入salary分布的饼状图、柱状图直观显示数据的分布情况,便于后续建模。

 绘制年收入salary的柱状图,观察频数分布情况。

由于个体收入与工作类型有直接影响,所以对工作类型workclass进行可视化分析,统计各工作类型的分布并绘制柱状图,对比各工作的收入占比。 

对受教育时间education-num绘制柱状图,观察数据分布情况。

绘制收入salary分布与个体性别sex的柱状图,分析数据间的关系。

 

17.逻辑回归

使用训练集训练逻辑回归模型,得到的各个特征的系数如下表所示:

系数

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0.564702 -0.196987 0.032008 0.050004 0.854733 -0.431347 0.012287 -0.09928 0.102477 0.467217 2.338441 0.265758 0.423848 -0.001297

可以看出,资本收益的系数最高为2.338,其次是受教育时长系数为0.855,年龄系数为0.565,这与日常知识一致,有资本收益,受教育时间长的个体收入水平一般较高。下面进行模型预测。

18.模型预测

利用训练出的逻辑回归模型对测试集进行预测,结果如下:

age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary salary_predict
-0.8490804496 -0.2379060115 -0.1199390202 1.2148687394 -0.0313600271 -1.7340583484 -1.4835817968 1.5893223617 0.3936675268 -1.4223307593 -0.1459204836 -0.216659527 0.2885296159 0.2513776468 0 0
-0.8490804496 -0.2379060115 0.2529895706 -0.335436928 1.1347387638 0.9216339465 0.5956350429 -0.2778050392 0.3936675268 -1.4223307593 -0.1459204836 -0.216659527 -0.035429447 0.2513776468 0 0
-0.9957056174 -0.2379060115 0.6298973802 -0.8522054837 0.7460391668 -0.4062122009 1.0576832296 -0.9001808395 0.3936675268 0.703071345 -0.1459204836 -0.216659527 -0.035429447 0.2513776468 0 0
0.5438586447 -0.2379060115 -0.3992328043 -1.6273583174 -2.7522572057 -0.4062122009 1.5197314162 -0.9001808395 -4.3189090683 0.703071345 -0.1459204836 4.503481865 -0.035429447 0.2513776468 0 0
0.4705460608 -0.2379060115 -0.1606502177 -2.4025111511 -1.1974588179 -1.7340583484 1.5197314162 -0.2778050392 0.3936675268 0.703071345 -0.1459204836 6.791584054 2.8802021189 0.2513776468 1 1

19.分类模型评估

将预测结果和真实值进行比较来对逻辑回归模型进行评估。得到的分类报告和混淆矩阵如下:

分类报告(classification report)

标签 精确率(Precision) 召回率(Recall) F1值(F1-score)
0.0 0.91 0.76 0.83
1.0 0.51 0.77 0.62
accuracy 0.76 0.76 0.76
macro avg 0.71 0.77 0.72
weighted avg 0.81 0.76 0.78

混淆矩阵(confusion matrix)

由分类报告可以看出:预测为0(salary<=50k)的精确率高达0.91;预测为1(salary>50k)的精确度为0.51,考虑是因为数据中salary>50k的人数过少,不足总数的25%,故导致分类结果不准确。模型结果的AUC值为0.85,此分类模型有较好的预测效果。

总结

本案例中我们首先对数据进行缺失值检测,并通过观察原始数据发现存在异常取值,对异常值进行了替换;接着通过探索年收入与性别、工作类型等的关系,通过可视化的方法对变量之间的关系进行了描述。最后经过特征编码利用机器学习中逻辑回归对个体年收入进行预测,分类效果较好。

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐