数据透视表

import pandas as pd
example = pd.DataFrame({'Amount': [74., 235., 175., 100., 115., 245., 180., 90., 88., 129., 273., 300.],
              'Category': ['Transportation', 'Grocery', 'Household', 'Entertainment', 
                           'Transportation', 'Grocery', 'Household', 'Entertainment', 
                           'Transportation', 'Grocery', 'Household', 'Entertainment'],
              'Month': ['January', 'January', 'January', 'January', 
                        'February', 'February', 'February', 'February', 
                        'Marth', 'Marth', 'Marth', 'Marth']})
example
Amount Category Month
0 74.0 Transportation January
1 235.0 Grocery January
2 175.0 Household January
3 100.0 Entertainment January
4 115.0 Transportation February
5 245.0 Grocery February
6 180.0 Household February
7 90.0 Entertainment February
8 88.0 Transportation Marth
9 129.0 Grocery Marth
10 273.0 Household Marth
11 300.0 Entertainment Marth
上面这个数据表感觉非常的杂乱,数据透视的意思就是按照我们的意思重新组织这张数据表
new_example = example.pivot(index='Category', columns='Month', values='Amount')    
index 表示以原先数据的哪个属性的值为行坐标进行分类 columns 表示以原先数据的哪个属性的值为新的属性
values 表示以原先数据的哪个属性的值填充前面两个 index 和 columns 定义出的新表格
new_example
Month February January Marth
Category
Entertainment 90.0 100.0 300.0
Grocery 245.0 235.0 129.0
Household 180.0 175.0 273.0
Transportation 115.0 74.0 88.0
new_example.sum(axis=1)
Category
Entertainment     490.0
Grocery           609.0
Household         628.0
Transportation    277.0
dtype: float64
new_example.sum(axis=0)
Month
February    630.0
January     584.0
Marth       790.0
dtype: float64
经过重新组织后,数据表的呈现方式更有意义
df = pd.read_csv('../../datasets/titanic/test.csv')
df.head(5)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
需求: 统计男女在不同船舱等级的票价
df.pivot_table(index='Sex', columns='Pclass', values='Fare')   # 默认就是求平均值
Pclass 1 2 3
Sex
female 115.591168 26.438750 13.735129
male 75.586551 20.184654 11.826350
df.pivot_table(index='Sex', columns='Pclass', values='Fare', aggfunc='min')  # 求最小值
Pclass 1 2 3
Sex
female 25.7 10.5000 6.9500
male 0.0 9.6875 3.1708
需求: 统计不同船舱等级里男女的平均年龄
df.pivot_table(index='Pclass', columns='Sex', values='Age')
Sex female male
Pclass
1 41.333333 40.520000
2 24.376552 30.940678
3 23.073400 24.525104
Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐