avatar

目录
pandas 10套练习

此练习来自kesci.com

这十套练习,教你如何使用Pandas做数据分析

python
1
2
import pandas as pd
import numpy as np

chipotle 快餐数据(基本操作)

看基本数据

python
1
chipo = pd.read_csv("chipotle.tsv", sep="\t")
python
1
chipo.head(10)
order_id quantity item_name choice_description item_price
0 1 1 Chips and Fresh Tomato Salsa NaN $2.39
1 1 1 Izze [Clementine] $3.39
2 1 1 Nantucket Nectar [Apple] $3.39
3 1 1 Chips and Tomatillo-Green Chili Salsa NaN $2.39
4 2 2 Chicken Bowl [Tomatillo-Red Chili Salsa (Hot), [Black Beans... $16.98
5 3 1 Chicken Bowl [Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou... $10.98
6 3 1 Side of Chips NaN $1.69
7 4 1 Steak Burrito [Tomatillo Red Chili Salsa, [Fajita Vegetables... $11.75
8 4 1 Steak Soft Tacos [Tomatillo Green Chili Salsa, [Pinto Beans, Ch... $9.25
9 5 1 Steak Burrito [Fresh Tomato Salsa, [Rice, Black Beans, Pinto... $9.25
python
1
chipo.shape
(4622, 5)
python
1
chipo.columns
Index(['order_id', 'quantity', 'item_name', 'choice_description',
       'item_price'],
      dtype='object')
python
1
chipo.index
RangeIndex(start=0, stop=4622, step=1)

被下单最多的商品(item)?

as_index=False 意思是不让item_name变成index.
agg里面的字典表示你可以对每一列用不同的函数

python
1
2
3
c = chipo.groupby(['item_name'],as_index=False).agg({'quantity':sum})
c.sort_values(['quantity'],ascending=False,inplace=True)
c.head()
item_name quantity
17 Chicken Bowl 761
18 Chicken Burrito 591
25 Chips and Guacamole 506
39 Steak Burrito 386
10 Canned Soft Drink 351

在item_name这一列中,一共有多少种商品被下单?¶

python
1
chipo['item_name'].nunique()
50

步骤11 在choice_description中,下单次数最多的商品是什么?

python
1
2
3
temp = chipo.groupby(['choice_description'],as_index=False).agg({'quantity':sum})
temp.sort_values(by='quantity',inplace=True,ascending=False)
temp
choice_description quantity
15 [Diet Coke] 159
14 [Coke] 143
583 [Sprite] 89
256 [Fresh Tomato Salsa, [Rice, Black Beans, Chees... 49
257 [Fresh Tomato Salsa, [Rice, Black Beans, Chees... 42
... ... ...
496 [Roasted Chili Corn Salsa, [Fajita Vegetables,... 1
499 [Roasted Chili Corn Salsa, [Fajita Vegetables,... 1
502 [Roasted Chili Corn Salsa, [Fajita Vegetables,... 1
504 [Roasted Chili Corn Salsa, [Guacamole, Sour Cr... 1
1042 [[Tomatillo-Red Chili Salsa (Hot), Tomatillo-G... 1

1043 rows × 2 columns

一共有多少商品被下单?

python
1
chipo.loc[:, 'quantity'].sum()
4972

将item_price转换为浮点数

python
1
chipo['item_price'] = chipo.loc[:, 'item_price'].apply(lambda x: x[1:]).astype(np.float64)
python
1
chipo['item_price']
0        2.39
1        3.39
2        3.39
3        2.39
4       16.98
        ...  
4617    11.75
4618    11.75
4619    11.25
4620     8.75
4621     8.75
Name: item_price, Length: 4622, dtype: float64

在该数据集对应的时期内,收入(revenue)是多少

python
1
chipo['sum_earn'] = chipo['quantity']*chipo['item_price']
python
1
chipo['sum_earn'].sum()
39237.02

在该数据集对应的时期内,一共有多少订单?

python
1
chipo['order_id'].nunique()
1834

每一单(order)对应的平均总价是多少?

python
1
chipo.groupby('order_id').agg({'sum_earn':sum}).mean()
sum_earn    21.394231
dtype: float64

一共有多少种不同的商品被售出

python
1
chipo['item_name'].nunique()
50

探索2012欧洲杯数据(数据过滤排序)

python
1
euro12 = pd.read_csv('Euro2012_stats.csv')
python
1
euro12.head(5)
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19

5 rows × 35 columns

python
1
euro12['Goals']
0      4
1      4
2      4
3      5
4      3
5     10
6      5
7      6
8      2
9      2
10     6
11     1
12     5
13    12
14     5
15     2
Name: Goals, dtype: int64

有多少球队参与了2012欧洲杯?

python
1
euro12['Team'].nunique()
16

该数据集中一共有多少列(columns)?

python
1
euro12.shape[1]
35

将数据集中的列Team, Yellow Cards和Red Cards单独存为一个名叫discipline的数据框

python
1
discipline = euro12.loc[:,['Team', 'Yellow Cards', 'Red Cards']]
python
1
discipline.head()
Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0

对数据框discipline按照先Red Cards再Yellow Cards进行排序

python
1
discipline.sort_values(by=['Red Cards', 'Yellow Cards'])
Team Yellow Cards Red Cards
2 Denmark 4 0
5 Germany 4 0
3 England 5 0
8 Netherlands 5 0
15 Ukraine 5 0
4 France 6 0
12 Russia 6 0
1 Czech Republic 7 0
14 Sweden 7 0
0 Croatia 9 0
13 Spain 11 0
10 Portugal 12 0
7 Italy 16 0
11 Republic of Ireland 6 1
9 Poland 7 1
6 Greece 9 1

计算每个球队拿到的黄牌数的平均值

python
1
discipline.groupby(['Team']).agg({'Yellow Cards':np.mean})
Yellow Cards
Team
Croatia 9
Czech Republic 7
Denmark 4
England 5
France 6
Germany 4
Greece 9
Italy 16
Netherlands 5
Poland 7
Portugal 12
Republic of Ireland 6
Russia 6
Spain 11
Sweden 7
Ukraine 5

找到进球数Goals超过6的球队数据

python
1
euro12.loc[euro12.Goals>6, :]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 10 62.6% 63 49 12 4 0 15 15 17
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 15 93.8% 102 83 19 11 0 17 17 18

2 rows × 35 columns

选取以字母G开头的球队数据

python
1
%timeit euro12.loc[euro12.loc[:, "Team"].apply(lambda x: x.startswith("G"))]
631 µs ± 7.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
python
1
%timeit euro12[euro12.Team.str.startswith('G')]
524 µs ± 3.51 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

选取前7列

python
1
euro12.iloc[:, 0:7]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked)
0 Croatia 4 13 12 51.9% 16.0% 32
1 Czech Republic 4 13 18 41.9% 12.9% 39
2 Denmark 4 10 10 50.0% 20.0% 27
3 England 5 11 18 50.0% 17.2% 40
4 France 3 22 24 37.9% 6.5% 65
5 Germany 10 32 32 47.8% 15.6% 80
6 Greece 5 8 18 30.7% 19.2% 32
7 Italy 6 34 45 43.0% 7.5% 110
8 Netherlands 2 12 36 25.0% 4.1% 60
9 Poland 2 15 23 39.4% 5.2% 48
10 Portugal 6 22 42 34.3% 9.3% 82
11 Republic of Ireland 1 7 12 36.8% 5.2% 28
12 Russia 5 9 31 22.5% 12.5% 59
13 Spain 12 42 33 55.9% 16.0% 100
14 Sweden 5 17 19 47.2% 13.8% 39
15 Ukraine 2 7 26 21.2% 6.0% 38

选取除了最后3列之外的全部列

python
1
euro12.iloc[:, :-3]
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Clean Sheets Blocks Goals conceded Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 0 10 3 13 81.3% 41 62 2 9 0
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 1 10 6 9 60.1% 53 73 8 7 0
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 1 10 5 10 66.7% 25 38 8 4 0
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 2 29 3 22 88.1% 43 45 6 5 0
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 1 7 5 6 54.6% 36 51 5 6 0
5 Germany 10 32 32 47.8% 15.6% 80 2 1 0 ... 1 11 6 10 62.6% 63 49 12 4 0
6 Greece 5 8 18 30.7% 19.2% 32 1 1 1 ... 1 23 7 13 65.1% 67 48 12 9 1
7 Italy 6 34 45 43.0% 7.5% 110 2 0 0 ... 2 18 7 20 74.1% 101 89 16 16 0
8 Netherlands 2 12 36 25.0% 4.1% 60 2 0 0 ... 0 9 5 12 70.6% 35 30 3 5 0
9 Poland 2 15 23 39.4% 5.2% 48 0 0 0 ... 0 8 3 6 66.7% 48 56 3 7 1
10 Portugal 6 22 42 34.3% 9.3% 82 6 0 0 ... 2 11 4 10 71.5% 73 90 10 12 0
11 Republic of Ireland 1 7 12 36.8% 5.2% 28 0 0 0 ... 0 23 9 17 65.4% 43 51 11 6 1
12 Russia 5 9 31 22.5% 12.5% 59 2 0 0 ... 0 8 3 10 77.0% 34 43 4 6 0
13 Spain 12 42 33 55.9% 16.0% 100 0 1 0 ... 5 8 1 15 93.8% 102 83 19 11 0
14 Sweden 5 17 19 47.2% 13.8% 39 3 0 0 ... 1 12 5 8 61.6% 35 51 7 7 0
15 Ukraine 2 7 26 21.2% 6.0% 38 0 0 0 ... 0 4 4 13 76.5% 48 31 4 5 0

16 rows × 32 columns

找到英格兰(England)、意大利(Italy)和俄罗斯(Russia)的射正率(Shooting Accuracy)

python
1
euro12[euro12.Team.isin(['England', 'Italy', 'Russia'])].loc[:,["Team", "Shooting Accuracy"]]
Team Shooting Accuracy
3 England 50.0%
7 Italy 43.0%
12 Russia 22.5%
python
1
euro12.head()
Team Goals Shots on target Shots off target Shooting Accuracy % Goals-to-shots Total shots (inc. Blocked) Hit Woodwork Penalty goals Penalties not scored ... Saves made Saves-to-shots ratio Fouls Won Fouls Conceded Offsides Yellow Cards Red Cards Subs on Subs off Players Used
0 Croatia 4 13 12 51.9% 16.0% 32 0 0 0 ... 13 81.3% 41 62 2 9 0 9 9 16
1 Czech Republic 4 13 18 41.9% 12.9% 39 0 0 0 ... 9 60.1% 53 73 8 7 0 11 11 19
2 Denmark 4 10 10 50.0% 20.0% 27 1 0 0 ... 10 66.7% 25 38 8 4 0 7 7 15
3 England 5 11 18 50.0% 17.2% 40 0 0 0 ... 22 88.1% 43 45 6 5 0 11 11 16
4 France 3 22 24 37.9% 6.5% 65 1 0 0 ... 6 54.6% 36 51 5 6 0 11 11 19

5 rows × 35 columns

python
1
discipline.head()
Team Yellow Cards Red Cards
0 Croatia 9 0
1 Czech Republic 7 0
2 Denmark 4 0
3 England 5 0
4 France 6 0

探索酒类消费数据(数据分类统计)

python
1
drinks = pd.read_csv('drinks.csv')
python
1
drinks.head(5)
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
python
1
drinks.describe()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
count 193.000000 193.000000 193.000000 193.000000
mean 106.160622 80.994819 49.450777 4.717098
std 101.143103 88.284312 79.697598 3.773298
min 0.000000 0.000000 0.000000 0.000000
25% 20.000000 4.000000 1.000000 1.300000
50% 76.000000 56.000000 8.000000 4.200000
75% 188.000000 128.000000 59.000000 7.200000
max 376.000000 438.000000 370.000000 14.400000
python
1
drinks.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 193 entries, 0 to 192
Data columns (total 6 columns):
country                         193 non-null object
beer_servings                   193 non-null int64
spirit_servings                 193 non-null int64
wine_servings                   193 non-null int64
total_litres_of_pure_alcohol    193 non-null float64
continent                       170 non-null object
dtypes: float64(1), int64(3), object(2)
memory usage: 9.2+ KB

哪个大陆(continent)平均消耗的啤酒(beer)更多?

python
1
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
python
1
drinks.groupby('continent').agg({'beer_servings':np.mean}).sort_values('beer_servings', ascending=False)
beer_servings
continent
EU 193.777778
SA 175.083333
OC 89.687500
AF 61.471698
AS 37.045455

打印出每个大陆(continent)的红酒消耗(wine_servings)的描述性统计值

python
1
drinks.loc[:, ['continent','wine_servings']].groupby('continent').describe()
wine_servings
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
python
1
%timeit drinks.loc[:, ['continent','wine_servings']].groupby('continent').describe()
21 ms ± 577 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
python
1
drinks.groupby('continent').wine_servings.describe()
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
python
1
%timeit drinks.groupby('continent').wine_servings.describe()
9.78 ms ± 292 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
python
1
drinks.groupby('continent').agg({'wine_servings': 'describe'})
wine_servings
count mean std min 25% 50% 75% max
continent
AF 53.0 16.264151 38.846419 0.0 1.0 2.0 13.00 233.0
AS 44.0 9.068182 21.667034 0.0 0.0 1.0 8.00 123.0
EU 45.0 142.222222 97.421738 0.0 59.0 128.0 195.00 370.0
OC 16.0 35.625000 64.555790 0.0 1.0 8.5 23.25 212.0
SA 12.0 62.416667 88.620189 1.0 3.0 12.0 98.50 221.0
python
1
%timeit drinks.groupby('continent').agg({'wine_servings': 'describe'})
10.6 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

打印出每个大陆每种酒类别的消耗平均值

python
1
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
python
1
drinks.groupby('continent').mean()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 61.471698 16.339623 16.264151 3.007547
AS 37.045455 60.840909 9.068182 2.170455
EU 193.777778 132.555556 142.222222 8.617778
OC 89.687500 58.437500 35.625000 3.381250
SA 175.083333 114.750000 62.416667 6.308333

打印出每个大陆每种酒类别的消耗中位数

python
1
drinks.groupby('continent').median()
beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
continent
AF 32.0 3.0 2.0 2.30
AS 17.5 16.0 1.0 1.20
EU 219.0 122.0 128.0 10.00
OC 52.5 37.0 8.5 1.75
SA 162.5 108.5 12.0 6.85

打印出每个大陆对spirit饮品消耗的平均值,最大值和最小值¶

python
1
drinks.head()
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol continent
0 Afghanistan 0 0 0 0.0 AS
1 Albania 89 132 54 4.9 EU
2 Algeria 25 0 14 0.7 AF
3 Andorra 245 138 312 12.4 EU
4 Angola 217 57 45 5.9 AF
python
1
drinks.groupby('continent').agg({'spirit_servings':['mean','min','max']})
spirit_servings
mean min max
continent
AF 16.339623 0 152
AS 60.840909 0 326
EU 132.555556 0 373
OC 58.437500 0 254
SA 114.750000 25 302

探索1960 - 2014 美国犯罪数据(apply函数)

python
1
crime = pd.read_csv('US_Crime_Rates_1960_2014.csv')
python
1
crime.head()
Year Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
0 1960 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1 1961 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
2 1962 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
3 1963 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
4 1964 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800
python
1
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null int64
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: int64(12)
memory usage: 5.3 KB
python
1
crime.describe()
Year Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
count 55.00000 5.500000e+01 5.500000e+01 5.500000e+01 5.500000e+01 55.000000 55.000000 55.000000 5.500000e+01 5.500000e+01 5.500000e+01 5.500000e+01
mean 1987.00000 2.461556e+08 1.060315e+07 1.188806e+06 9.412499e+06 17317.236364 72714.800000 416449.890909 6.811818e+05 2.424017e+06 5.959947e+06 1.028614e+06
std 16.02082 4.166216e+07 3.175931e+06 4.597107e+05 2.738512e+06 4267.442664 28278.006893 150506.403782 2.960255e+05 7.259004e+05 1.846401e+06 3.455693e+05
min 1960.00000 1.793232e+08 3.384200e+06 2.884600e+05 3.095700e+06 8530.000000 17190.000000 106670.000000 1.543200e+05 9.121000e+05 1.855400e+06 3.282000e+05
25% 1973.50000 2.106215e+08 9.096958e+06 9.253150e+05 8.060014e+06 14819.000000 53400.000000 354911.500000 4.384300e+05 2.105336e+06 4.843350e+06 7.896260e+05
50% 1987.00000 2.422829e+08 1.140151e+07 1.322390e+06 1.018259e+07 17030.000000 84230.000000 423557.000000 7.412910e+05 2.329950e+06 6.591900e+06 1.032200e+06
75% 2000.50000 2.833697e+08 1.308449e+07 1.432762e+06 1.160509e+07 20561.500000 92930.000000 512137.000000 9.002150e+05 3.073600e+06 7.168450e+06 1.239302e+06
max 2014.00000 3.188571e+08 1.487290e+07 1.932270e+06 1.296110e+07 24700.000000 109060.000000 687730.000000 1.135610e+06 3.795200e+06 8.142200e+06 1.661700e+06

将Year的数据类型转换为 datetime64

python
1
2
crime.Year = pd.to_datetime(crime.Year, format='%Y')
crime.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55 entries, 0 to 54
Data columns (total 12 columns):
Year                  55 non-null datetime64[ns]
Population            55 non-null int64
Total                 55 non-null int64
Violent               55 non-null int64
Property              55 non-null int64
Murder                55 non-null int64
Forcible_Rape         55 non-null int64
Robbery               55 non-null int64
Aggravated_assault    55 non-null int64
Burglary              55 non-null int64
Larceny_Theft         55 non-null int64
Vehicle_Theft         55 non-null int64
dtypes: datetime64[ns](1), int64(11)
memory usage: 5.3 KB

将列Year设置为数据框的索引

python
1
2
crime.set_index('Year', inplace=True)
crime.head(5)
Population Total Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 179323175 3384200 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961-01-01 182992000 3488000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962-01-01 185771000 3752200 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963-01-01 188483000 4109500 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964-01-01 191141000 4564600 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800

删除名为Total的列

python
1
crime.drop(['Total'],axis=1,inplace=True)
python
1
crime.head()
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 179323175 288460 3095700 9110 17190 107840 154320 912100 1855400 328200
1961-01-01 182992000 289390 3198600 8740 17220 106670 156760 949600 1913000 336000
1962-01-01 185771000 301510 3450700 8530 17550 110860 164570 994300 2089600 366800
1963-01-01 188483000 316970 3792500 8640 17650 116470 174210 1086400 2297800 408300
1964-01-01 191141000 364220 4200400 9360 21420 130390 203050 1213200 2514400 472800

按照Year对数据框进行10年的采样并求和

python
1
2
3
4
5
6
7
8
crimes = crime.resample('10AS').sum()
# 用resample去得到“Population”列的最大值
population = crime['Population'].resample('10AS').max()

# 更新 "Population"
crimes['Population'] = population

crimes
Population Violent Property Murder Forcible_Rape Robbery Aggravated_assault Burglary Larceny_Theft Vehicle_Theft
Year
1960-01-01 201385000 4134930 45160900 106180 236720 1633510 2158520 13321100 26547700 5292100
1970-01-01 220099000 9607930 91383800 192230 554570 4159020 4702120 28486000 53157800 9739900
1980-01-01 248239000 14074328 117048900 206439 865639 5383109 7619130 33073494 72040253 11935411
1990-01-01 272690813 17527048 119053499 211664 998827 5748930 10568963 26750015 77679366 14624418
2000-01-01 307006550 13968056 100944369 163068 922499 4230366 8652124 21565176 67970291 11412834
2010-01-01 318857056 6072017 44095950 72867 421059 1749809 3764142 10125170 30401698 3569080

何时是美国历史上生存最危险的年代?

python
1
crime.idxmax()
Population           2014-01-01
Violent              1992-01-01
Property             1991-01-01
Murder               1991-01-01
Forcible_Rape        1992-01-01
Robbery              1991-01-01
Aggravated_assault   1993-01-01
Burglary             1980-01-01
Larceny_Theft        1991-01-01
Vehicle_Theft        1991-01-01
dtype: datetime64[ns]

探索虚拟姓名数据(数据整合)

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 运行以下代码
raw_data_1 = {
'subject_id': ['1', '2', '3', '4', '5'],
'first_name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'last_name': ['Anderson', 'Ackerman', 'Ali', 'Aoni', 'Atiches']}

raw_data_2 = {
'subject_id': ['4', '5', '6', '7', '8'],
'first_name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'last_name': ['Bonder', 'Black', 'Balwner', 'Brice', 'Btisan']}

raw_data_3 = {
'subject_id': ['1', '2', '3', '4', '5', '7', '8', '9', '10', '11'],
'test_id': [51, 15, 15, 61, 16, 14, 15, 1, 61, 16]}
python
1
2
3
4
# 运行以下代码
data1 = pd.DataFrame(raw_data_1, columns = ['subject_id', 'first_name', 'last_name'])
data2 = pd.DataFrame(raw_data_2, columns = ['subject_id', 'first_name', 'last_name'])
data3 = pd.DataFrame(raw_data_3, columns = ['subject_id','test_id'])
python
1
data1
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
python
1
data2
subject_id first_name last_name
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
python
1
data3
subject_id test_id
0 1 51
1 2 15
2 3 15
3 4 61
4 5 16
5 7 14
6 8 15
7 9 1
8 10 61
9 11 16

将data1和data2两个数据框按照行的维度进行合并,命名为all_data

python
1
2
all_data = pd.concat([data1, data2], axis=0)
all_data
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan

将data1和data2两个数据框按照列的维度进行合并,命名为all_data_col

python
1
2
all_data_col = pd.concat([data1, data2], axis=1)
all_data_col
subject_id first_name last_name subject_id first_name last_name
0 1 Alex Anderson 4 Billy Bonder
1 2 Amy Ackerman 5 Brian Black
2 3 Allen Ali 6 Bran Balwner
3 4 Alice Aoni 7 Bryce Brice
4 5 Ayoung Atiches 8 Betty Btisan

按照subject_id的值对all_data和data3作合并

python
1
2
%%time
pd.merge(all_data, data3, on='subject_id')
Wall time: 3.07 ms
subject_id first_name last_name test_id
0 1 Alex Anderson 51
1 2 Amy Ackerman 15
2 3 Allen Ali 15
3 4 Alice Aoni 61
4 4 Billy Bonder 61
5 5 Ayoung Atiches 16
6 5 Brian Black 16
7 7 Bryce Brice 14
8 8 Betty Btisan 15
python
1
2
%%time
all_data.join(data3.set_index('subject_id'),on='subject_id')
Wall time: 3.38 ms
subject_id first_name last_name test_id
0 1 Alex Anderson 51.0
1 2 Amy Ackerman 15.0
2 3 Allen Ali 15.0
3 4 Alice Aoni 61.0
4 5 Ayoung Atiches 16.0
0 4 Billy Bonder 61.0
1 5 Brian Black 16.0
2 6 Bran Balwner NaN
3 7 Bryce Brice 14.0
4 8 Betty Btisan 15.0

对data1和data2按照subject_id作连接

python
1
pd.concat([data1, data2])
subject_id first_name last_name
0 1 Alex Anderson
1 2 Amy Ackerman
2 3 Allen Ali
3 4 Alice Aoni
4 5 Ayoung Atiches
0 4 Billy Bonder
1 5 Brian Black
2 6 Bran Balwner
3 7 Bryce Brice
4 8 Betty Btisan
python
1
pd.merge(data1, data2, on='subject_id', how='outer')
subject_id first_name_x last_name_x first_name_y last_name_y
0 1 Alex Anderson NaN NaN
1 2 Amy Ackerman NaN NaN
2 3 Allen Ali NaN NaN
3 4 Alice Aoni Billy Bonder
4 5 Ayoung Atiches Brian Black
5 6 NaN NaN Bran Balwner
6 7 NaN NaN Bryce Brice
7 8 NaN NaN Betty Btisan

探索风速数据(描述统计)

python
1
import datetime
python
1
2
path6 = "wind.data"
data = pd.read_table(path6, sep = "\s+", parse_dates = [[0,1,2]])
python
1
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 2061-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 2061-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 2061-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 2061-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 2061-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83
python
1
data.describe()
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
count 6568.000000 6571.000000 6572.000000 6569.000000 6572.000000 6574.000000 6571.000000 6572.000000 6571.000000 6573.000000 6574.000000 6570.000000
mean 12.362987 10.644314 11.660526 6.306468 10.455834 7.092254 9.797343 8.495053 8.493590 8.707332 13.121007 15.599079
std 5.618413 5.267356 5.008450 3.605811 4.936125 3.968683 4.977555 4.499449 4.166872 4.503954 5.835037 6.699794
min 0.670000 0.210000 1.500000 0.000000 0.130000 0.000000 0.000000 0.000000 0.000000 0.040000 0.130000 0.670000
25% 8.120000 6.670000 8.000000 3.580000 6.750000 4.000000 6.000000 5.090000 5.370000 5.330000 8.710000 10.710000
50% 11.710000 10.170000 10.920000 5.750000 9.960000 6.830000 9.210000 8.080000 8.170000 8.290000 12.500000 15.000000
75% 15.920000 14.040000 14.670000 8.420000 13.540000 9.670000 12.960000 11.420000 11.190000 11.630000 16.880000 19.830000
max 35.800000 33.370000 33.840000 28.460000 37.540000 26.160000 30.370000 31.080000 25.880000 28.210000 42.380000 42.540000
python
1
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6574 entries, 0 to 6573
Data columns (total 13 columns):
Yr_Mo_Dy    6574 non-null datetime64[ns]
RPT         6568 non-null float64
VAL         6571 non-null float64
ROS         6572 non-null float64
KIL         6569 non-null float64
SHA         6572 non-null float64
BIR         6574 non-null float64
DUB         6571 non-null float64
CLA         6572 non-null float64
MUL         6571 non-null float64
CLO         6573 non-null float64
BEL         6574 non-null float64
MAL         6570 non-null float64
dtypes: datetime64[ns](1), float64(12)
memory usage: 667.8 KB

2061年?我们真的有这一年的数据?创建一个函数并用它去修复这个bug

python
1
2
3
def fix_year(date):
new_year = date.year - 100 if date.year> 1989 else date.year
return datetime.date(new_year, date.month, date.day)
python
1
data['Yr_Mo_Dy'] = data['Yr_Mo_Dy'].apply(fix_year)
python
1
data.head()
Yr_Mo_Dy RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
0 1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1 1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
2 1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
3 1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
4 1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

将日期设为索引,注意数据类型,应该是datetime64[ns]

python
1
2
3
4
%%time
data['Yr_Mo_Dy']= pd.to_datetime(data["Yr_Mo_Dy"])
data = data.set_index('Yr_Mo_Dy')
data.head()
Wall time: 3.41 ms
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04
1961-01-02 14.71 NaN 10.83 6.50 12.62 7.67 11.50 10.04 9.79 9.67 17.54 13.83
1961-01-03 18.50 16.88 12.33 10.13 11.17 6.17 11.25 NaN 8.50 7.67 12.75 12.71
1961-01-04 10.58 6.63 11.75 4.58 4.54 2.88 8.63 1.79 5.83 5.88 5.46 10.88
1961-01-05 13.33 13.25 11.42 6.17 10.71 8.21 11.92 6.54 10.92 10.34 12.92 11.83

对应每一个location,一共有多少数据值缺失

python
1
data.isnull().sum()
RPT    6
VAL    3
ROS    2
KIL    5
SHA    2
BIR    0
DUB    3
CLA    2
MUL    3
CLO    1
BEL    0
MAL    4
dtype: int64

对应每一个location,一共有多少完整的数据值

python
1
data.count()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64
python
1
2
#官方答案
data.shape[0] - data.isnull().sum()
RPT    6568
VAL    6571
ROS    6572
KIL    6569
SHA    6572
BIR    6574
DUB    6571
CLA    6572
MUL    6571
CLO    6573
BEL    6574
MAL    6570
dtype: int64

对于全体数据,计算风速的平均值

python
1
data.mean().mean()
10.227982360836924

创建一个名为loc_stats的数据框去计算并存储每个location的风速最小值,最大值,平均值和标准差

python
1
loc_states = pd.DataFrame()
python
1
2
3
4
loc_states['min'] = data.min()
loc_states['max'] = data.max()
loc_states['mean'] = data.mean()
loc_states['std'] = data.std()
python
1
loc_states
min max mean std
RPT 0.67 35.80 12.362987 5.618413
VAL 0.21 33.37 10.644314 5.267356
ROS 1.50 33.84 11.660526 5.008450
KIL 0.00 28.46 6.306468 3.605811
SHA 0.13 37.54 10.455834 4.936125
BIR 0.00 26.16 7.092254 3.968683
DUB 0.00 30.37 9.797343 4.977555
CLA 0.00 31.08 8.495053 4.499449
MUL 0.00 25.88 8.493590 4.166872
CLO 0.04 28.21 8.707332 4.503954
BEL 0.13 42.38 13.121007 5.835037
MAL 0.67 42.54 15.599079 6.699794

创建一个名为day_stats的数据框去计算并存储所有location的风速最小值,最大值,平均值和标准差

python
1
2
3
4
5
6
7
8
9
%%time
day_stats = pd.Series()
data_stack = data.stack()
day_stats['min'] = data_stack.min() # min
day_stats['max'] = data_stack.max() # max
day_stats['mean'] = data_stack.mean() # mean
day_stats['std'] = data_stack.std(ddof=1) # standard deviations

day_stats.head()
Wall time: 11 ms





min      0.000000
max     42.540000
mean    10.227884
std      5.603872
dtype: float64
python
1
2
3
4
5
6
7
8
9
10
11
# 官方答案
# create the dataframe
day_stats = pd.DataFrame()

# this time we determine axis equals to one so it gets each row.
day_stats['min'] = data.min(axis = 1) # min
day_stats['max'] = data.max(axis = 1) # max
day_stats['mean'] = data.mean(axis = 1) # mean
day_stats['std'] = data.std(axis = 1) # standard deviations

day_stats.head()
min max mean std
Yr_Mo_Dy
1961-01-01 9.29 18.50 13.018182 2.808875
1961-01-02 6.50 17.54 11.336364 3.188994
1961-01-03 6.17 18.50 11.641818 3.681912
1961-01-04 1.79 11.75 6.619167 3.198126
1961-01-05 6.17 13.33 10.630000 2.445356

对于每一个location,计算一月份的平均风速¶

注意,1961年的1月和1962年的1月应该区别对待

python
1
2
3
4
5
data['date'] = data.index
# creates a column for each value from date
data['month'] = data['date'].apply(lambda date: date.month)
data['year'] = data['date'].apply(lambda date: date.year)
data['day'] = data['date'].apply(lambda date: date.day)
python
1
2
# gets all value from the month 1 and assign to janyary_winds
january_winds = data[data['month'] == 1]
python
1
january_winds.loc[:,'RPT':"MAL"].mean()
RPT    14.847325
VAL    12.914560
ROS    13.299624
KIL     7.199498
SHA    11.667734
BIR     8.054839
DUB    11.819355
CLA     9.512047
MUL     9.543208
CLO    10.053566
BEL    14.550520
MAL    18.028763
dtype: float64

对于数据记录按照年为频率取样

python
1
2
%%time
data.query('month == 1 and day == 1')
Wall time: 3.38 ms
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date month year day
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1 1961 1
1962-01-01 9.29 3.42 11.54 3.50 2.21 1.96 10.41 2.79 3.54 5.17 4.38 7.92 1962-01-01 1 1962 1
1963-01-01 15.59 13.62 19.79 8.38 12.25 10.00 23.45 15.71 13.59 14.37 17.58 34.13 1963-01-01 1 1963 1
1964-01-01 25.80 22.13 18.21 13.25 21.29 14.79 14.12 19.58 13.25 16.75 28.96 21.00 1964-01-01 1 1964 1
1965-01-01 9.54 11.92 9.00 4.38 6.08 5.21 10.25 6.08 5.71 8.63 12.04 17.41 1965-01-01 1 1965 1
1966-01-01 22.04 21.50 17.08 12.75 22.17 15.59 21.79 18.12 16.66 17.83 28.33 23.79 1966-01-01 1 1966 1
1967-01-01 6.46 4.46 6.50 3.21 6.67 3.79 11.38 3.83 7.71 9.08 10.67 20.91 1967-01-01 1 1967 1
1968-01-01 30.04 17.88 16.25 16.25 21.79 12.54 18.16 16.62 18.75 17.62 22.25 27.29 1968-01-01 1 1968 1
1969-01-01 6.13 1.63 5.41 1.08 2.54 1.00 8.50 2.42 4.58 6.34 9.17 16.71 1969-01-01 1 1969 1
1970-01-01 9.59 2.96 11.79 3.42 6.13 4.08 9.00 4.46 7.29 3.50 7.33 13.00 1970-01-01 1 1970 1
1971-01-01 3.71 0.79 4.71 0.17 1.42 1.04 4.63 0.75 1.54 1.08 4.21 9.54 1971-01-01 1 1971 1
1972-01-01 9.29 3.63 14.54 4.25 6.75 4.42 13.00 5.33 10.04 8.54 8.71 19.17 1972-01-01 1 1972 1
1973-01-01 16.50 15.92 14.62 7.41 8.29 11.21 13.54 7.79 10.46 10.79 13.37 9.71 1973-01-01 1 1973 1
1974-01-01 23.21 16.54 16.08 9.75 15.83 11.46 9.54 13.54 13.83 16.66 17.21 25.29 1974-01-01 1 1974 1
1975-01-01 14.04 13.54 11.29 5.46 12.58 5.58 8.12 8.96 9.29 5.17 7.71 11.63 1975-01-01 1 1975 1
1976-01-01 18.34 17.67 14.83 8.00 16.62 10.13 13.17 9.04 13.13 5.75 11.38 14.96 1976-01-01 1 1976 1
1977-01-01 20.04 11.92 20.25 9.13 9.29 8.04 10.75 5.88 9.00 9.00 14.88 25.70 1977-01-01 1 1977 1
1978-01-01 8.33 7.12 7.71 3.54 8.50 7.50 14.71 10.00 11.83 10.00 15.09 20.46 1978-01-01 1 1978 1
python
1
2
%%time
data[(data['month']==1) & (data['day']==1)]
Wall time: 997 µs
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date month year day
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1 1961 1
1962-01-01 9.29 3.42 11.54 3.50 2.21 1.96 10.41 2.79 3.54 5.17 4.38 7.92 1962-01-01 1 1962 1
1963-01-01 15.59 13.62 19.79 8.38 12.25 10.00 23.45 15.71 13.59 14.37 17.58 34.13 1963-01-01 1 1963 1
1964-01-01 25.80 22.13 18.21 13.25 21.29 14.79 14.12 19.58 13.25 16.75 28.96 21.00 1964-01-01 1 1964 1
1965-01-01 9.54 11.92 9.00 4.38 6.08 5.21 10.25 6.08 5.71 8.63 12.04 17.41 1965-01-01 1 1965 1
1966-01-01 22.04 21.50 17.08 12.75 22.17 15.59 21.79 18.12 16.66 17.83 28.33 23.79 1966-01-01 1 1966 1
1967-01-01 6.46 4.46 6.50 3.21 6.67 3.79 11.38 3.83 7.71 9.08 10.67 20.91 1967-01-01 1 1967 1
1968-01-01 30.04 17.88 16.25 16.25 21.79 12.54 18.16 16.62 18.75 17.62 22.25 27.29 1968-01-01 1 1968 1
1969-01-01 6.13 1.63 5.41 1.08 2.54 1.00 8.50 2.42 4.58 6.34 9.17 16.71 1969-01-01 1 1969 1
1970-01-01 9.59 2.96 11.79 3.42 6.13 4.08 9.00 4.46 7.29 3.50 7.33 13.00 1970-01-01 1 1970 1
1971-01-01 3.71 0.79 4.71 0.17 1.42 1.04 4.63 0.75 1.54 1.08 4.21 9.54 1971-01-01 1 1971 1
1972-01-01 9.29 3.63 14.54 4.25 6.75 4.42 13.00 5.33 10.04 8.54 8.71 19.17 1972-01-01 1 1972 1
1973-01-01 16.50 15.92 14.62 7.41 8.29 11.21 13.54 7.79 10.46 10.79 13.37 9.71 1973-01-01 1 1973 1
1974-01-01 23.21 16.54 16.08 9.75 15.83 11.46 9.54 13.54 13.83 16.66 17.21 25.29 1974-01-01 1 1974 1
1975-01-01 14.04 13.54 11.29 5.46 12.58 5.58 8.12 8.96 9.29 5.17 7.71 11.63 1975-01-01 1 1975 1
1976-01-01 18.34 17.67 14.83 8.00 16.62 10.13 13.17 9.04 13.13 5.75 11.38 14.96 1976-01-01 1 1976 1
1977-01-01 20.04 11.92 20.25 9.13 9.29 8.04 10.75 5.88 9.00 9.00 14.88 25.70 1977-01-01 1 1977 1
1978-01-01 8.33 7.12 7.71 3.54 8.50 7.50 14.71 10.00 11.83 10.00 15.09 20.46 1978-01-01 1 1978 1

对于数据记录按照月为频率取样

python
1
data[data["day"]==1]
RPT VAL ROS KIL SHA BIR DUB CLA MUL CLO BEL MAL date month year day
Yr_Mo_Dy
1961-01-01 15.04 14.96 13.17 9.29 NaN 9.87 13.67 10.25 10.83 12.58 18.50 15.04 1961-01-01 1 1961 1
1961-02-01 14.25 15.12 9.04 5.88 12.08 7.17 10.17 3.63 6.50 5.50 9.17 8.00 1961-02-01 2 1961 1
1961-03-01 12.67 13.13 11.79 6.42 9.79 8.54 10.25 13.29 NaN 12.21 20.62 NaN 1961-03-01 3 1961 1
1961-04-01 8.38 6.34 8.33 6.75 9.33 9.54 11.67 8.21 11.21 6.46 11.96 7.17 1961-04-01 4 1961 1
1961-05-01 15.87 13.88 15.37 9.79 13.46 10.17 9.96 14.04 9.75 9.92 18.63 11.12 1961-05-01 5 1961 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1978-08-01 19.33 15.09 20.17 8.83 12.62 10.41 9.33 12.33 9.50 9.92 15.75 18.00 1978-08-01 8 1978 1
1978-09-01 8.42 6.13 9.87 5.25 3.21 5.71 7.25 3.50 7.33 6.50 7.62 15.96 1978-09-01 9 1978 1
1978-10-01 9.50 6.83 10.50 3.88 6.13 4.58 4.21 6.50 6.38 6.54 10.63 14.09 1978-10-01 10 1978 1
1978-11-01 13.59 16.75 11.25 7.08 11.04 8.33 8.17 11.29 10.75 11.25 23.13 25.00 1978-11-01 11 1978 1
1978-12-01 21.29 16.29 24.04 12.79 18.21 19.29 21.54 17.21 16.71 17.83 17.75 25.70 1978-12-01 12 1978 1

216 rows × 16 columns

探索泰坦尼克灾难数据(数据可视化)

python
1
2
3
4
5
6
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

%matplotlib inline
python
1
path7 = 'train.csv'  # train.csv
python
1
titanic = pd.read_csv(path7)
python
1
titanic.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

将PassengerId设置为索引

python
1
titanic.set_index('PassengerId').head()
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

绘制一个展示男女乘客比例的扇形图

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# 运行以下代码
# sum the instances of males and females
males = (titanic['Sex'] == 'male').sum()
females = (titanic['Sex'] == 'female').sum()

# put them into a list called proportions
proportions = [males, females]

# Create a pie chart
plt.pie(
# using proportions
proportions,

# with the labels being officer names
labels = ['Males', 'Females'],

# with no shadows
shadow = False,

# with colors
colors = ['blue','red'],

# with one slide exploded out
explode = (0.15 , 0),

# with the start angle at 90%
startangle = 90,

# with the percent listed as a fraction
autopct = '%1.1f%%'
)

# View the plot drop above
plt.axis('equal')

# Set labels
plt.title("Sex Proportion")

# View the plot
plt.tight_layout()
plt.show()

png

绘制一个展示船票Fare, 与乘客年龄和性别的散点图

python
1
2
3
4
5
6
7
8
9
10
11
# 运行以下代码
# creates the plot using
lm = sns.lmplot(x = 'Age', y = 'Fare', data = titanic, hue = 'Sex', fit_reg=False)

# set title
lm.set(title = 'Fare x Age')

# get the axes object and tweak it
axes = lm.axes
axes[0,0].set_ylim(-5,)
axes[0,0].set_xlim(-5,85)
(-5, 85)

png

有多少人生还

python
1
titanic.Survived.sum()
342

绘制一个展示船票价格的直方图

python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# 运行以下代码
# sort the values from the top to the least value and slice the first 5 items
df = titanic.Fare.sort_values(ascending = False)
df

# create bins interval using numpy
binsVal = np.arange(0,600,10)
binsVal

# create the plot
plt.hist(df, bins = binsVal)

# Set the title and labels
plt.xlabel('Fare')
plt.ylabel('Frequency')
plt.title('Fare Payed Histrogram')

# show the plot
plt.show()

png

探索Pokemon数据(创建数据框)

python
1
2
3
4
5
6
raw_data = {"name": ['Bulbasaur', 'Charmander','Squirtle','Caterpie'],
"evolution": ['Ivysaur','Charmeleon','Wartortle','Metapod'],
"type": ['grass', 'fire', 'water', 'bug'],
"hp": [45, 39, 44, 45],
"pokedex": ['yes', 'no','yes','no']
}
python
1
2
pokemon = pd.DataFrame(raw_data)
pokemon.head()
name evolution type hp pokedex
0 Bulbasaur Ivysaur grass 45 yes
1 Charmander Charmeleon fire 39 no
2 Squirtle Wartortle water 44 yes
3 Caterpie Metapod bug 45 no

数据框的列排序是字母顺序,请重新修改为name, type, hp, evolution, pokedex这个顺序

python
1
2
pokemon = pokemon[['name', 'type', 'hp', 'evolution','pokedex']]
pokemon
name type hp evolution pokedex
0 Bulbasaur grass 45 Ivysaur yes
1 Charmander fire 39 Charmeleon no
2 Squirtle water 44 Wartortle yes
3 Caterpie bug 45 Metapod no

添加一个列place

python
1
2
pokemon['place'] = ['park','street','lake','forest']
pokemon
name type hp evolution pokedex place
0 Bulbasaur grass 45 Ivysaur yes park
1 Charmander fire 39 Charmeleon no street
2 Squirtle water 44 Wartortle yes lake
3 Caterpie bug 45 Metapod no forest

查看每个列的数据类型

python
1
pokemon.dtypes
name         object
type         object
hp            int64
evolution    object
pokedex      object
place        object
dtype: object
python
1
pokemon.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 6 columns):
name         4 non-null object
type         4 non-null object
hp           4 non-null int64
evolution    4 non-null object
pokedex      4 non-null object
place        4 non-null object
dtypes: int64(1), object(5)
memory usage: 320.0+ bytes

探索Apple公司股价数据(时间序列)

python
1
2
3
4
5
6
7
8
# 运行以下代码
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt

%matplotlib inline
python
1
path9 = 'Apple_stock.csv'
python
1
2
apple = pd.read_csv(path9)
apple.head()
Date Open High Low Close Volume Adj Close
0 2014-07-08 96.27 96.80 93.92 95.35 65130000 95.35
1 2014-07-07 94.14 95.99 94.10 95.97 56305400 95.97
2 2014-07-03 93.67 94.10 93.20 94.03 22891800 94.03
3 2014-07-02 93.87 94.06 93.09 93.48 28420900 93.48
4 2014-07-01 93.52 94.07 93.13 93.52 38170200 93.52
python
1
apple.dtypes
Date          object
Open         float64
High         float64
Low          float64
Close        float64
Volume         int64
Adj Close    float64
dtype: object
python
1
apple.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8465 entries, 0 to 8464
Data columns (total 7 columns):
Date         8465 non-null object
Open         8465 non-null float64
High         8465 non-null float64
Low          8465 non-null float64
Close        8465 non-null float64
Volume       8465 non-null int64
Adj Close    8465 non-null float64
dtypes: float64(5), int64(1), object(1)
memory usage: 463.1+ KB
python
1
apple.describe()
Open High Low Close Volume Adj Close
count 8465.000000 8465.000000 8465.000000 8465.000000 8.465000e+03 8465.000000
mean 98.593926 99.848683 97.226790 98.549199 9.391214e+07 11.095014
std 141.965350 143.155674 140.548326 141.877244 8.993202e+07 21.133016
min 11.120000 11.120000 11.000000 11.000000 2.504000e+05 0.170000
25% 25.750000 26.000000 25.120000 25.630000 3.704960e+07 0.900000
50% 40.750000 41.500000 40.000000 40.750000 6.636000e+07 1.410000
75% 79.500000 81.000000 78.740000 80.000000 1.177701e+08 8.660000
max 702.410000 705.070000 699.570000 702.100000 1.855410e+09 96.280000

将Date这个列转换为datetime类型

python
1
apple['Date'] = pd.to_datetime(apple['Date'])
python
1
apple.dtypes
Date         datetime64[ns]
Open                float64
High                float64
Low                 float64
Close               float64
Volume                int64
Adj Close           float64
dtype: object

将Date设置为索引

python
1
apple = apple.set_index('Date')
python
1
apple.head()
Open High Low Close Volume Adj Close
Date
2014-07-08 96.27 96.80 93.92 95.35 65130000 95.35
2014-07-07 94.14 95.99 94.10 95.97 56305400 95.97
2014-07-03 93.67 94.10 93.20 94.03 22891800 94.03
2014-07-02 93.87 94.06 93.09 93.48 28420900 93.48
2014-07-01 93.52 94.07 93.13 93.52 38170200 93.52

有重复的日期吗?

python
1
len(apple.index) - len(apple.index.unique())
0

官方解法

python
1
apple.index.is_unique
True

将index设置为升序

python
1
apple.sort_index(ascending=True, inplace=True)
python
1
apple.head()
Open High Low Close Volume Adj Close
Date
1980-12-12 28.75 28.87 28.75 28.75 117258400 0.45
1980-12-15 27.38 27.38 27.25 27.25 43971200 0.42
1980-12-16 25.37 25.37 25.25 25.25 26432000 0.39
1980-12-17 25.87 26.00 25.87 25.87 21610400 0.40
1980-12-18 26.63 26.75 26.63 26.63 18362400 0.41

找到每个月的最后一个交易日(business day)

python
1
2
3
apple['year']=apple.index.year
apple['month']=apple.index.month
apple['day']=apple.index.day
python
1
apple.day.shift(-1).fillna(100)
Date
1980-12-12     15.0
1980-12-15     16.0
1980-12-16     17.0
1980-12-17     18.0
1980-12-18     19.0
              ...  
2014-07-01      2.0
2014-07-02      3.0
2014-07-03      7.0
2014-07-07      8.0
2014-07-08    100.0
Name: day, Length: 8465, dtype: float64
python
1
diff = apple.day - apple.day.shift(-1).fillna(100)
python
1
apple.loc[diff[diff>0].index]
Open High Low Close Volume Adj Close year month day
Date
1980-12-31 34.25 34.25 34.13 34.13 8937600 0.53 1980 12 31
1981-01-30 28.50 28.50 28.25 28.25 11547200 0.44 1981 1 30
1981-02-27 26.50 26.75 26.50 26.50 3690400 0.41 1981 2 27
1981-03-31 24.75 24.75 24.50 24.50 3998400 0.38 1981 3 31
1981-04-30 28.38 28.62 28.38 28.38 3152800 0.44 1981 4 30
... ... ... ... ... ... ... ... ... ...
2014-02-28 529.08 532.75 522.12 526.24 92992200 74.76 2014 2 28
2014-03-31 539.23 540.81 535.93 536.74 42167300 76.25 2014 3 31
2014-04-30 592.64 599.43 589.80 590.09 114160200 83.83 2014 4 30
2014-05-30 637.98 644.17 628.90 633.00 141005200 90.43 2014 5 30
2014-06-30 92.10 93.73 92.09 92.93 49482300 92.93 2014 6 30

403 rows × 9 columns

数据集中最早的日期和最晚的日期相差多少天

python
1
apple.index.max()-apple.index.min()
Timedelta('12261 days 00:00:00')
python
1
(apple.index.max()-apple.index.min()).days
12261

在数据中一共有多少个月?

python
1
2
apple_months = apple.resample('BM').mean()
len(apple_months.index)
404

按照时间顺序可视化Adj Close值

python
1
2
3
# 运行以下代码
# makes the plot and assign it to a variable
appl_open = apple['Adj Close'].plot(title = "Apple Stock")

png

探索Iris纸鸢花数据(删除数据)

python
1
path10 ='iris.csv'   # iris.csv
python
1
2
iris = pd.read_csv(path10,header=None)
iris.head()
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

创建数据框的列名称

python
1
iris.columns=(['sepal_length','sepal_width', 'petal_length', 'petal_width', 'class'])
python
1
iris.head()
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

数据框中有缺失值吗

python
1
iris.isnull().any()
sepal_length    False
sepal_width     False
petal_length    False
petal_width     False
class           False
dtype: bool

将列petal_length的第10到19行设置为缺失值

python
1
iris.loc[10:20, 'petal_length'] = np.nan
python
1
iris.loc[10:20,:]
sepal_length sepal_width petal_length petal_width class
10 5.4 3.7 NaN 0.2 Iris-setosa
11 4.8 3.4 NaN 0.2 Iris-setosa
12 4.8 3.0 NaN 0.1 Iris-setosa
13 4.3 3.0 NaN 0.1 Iris-setosa
14 5.8 4.0 NaN 0.2 Iris-setosa
15 5.7 4.4 NaN 0.4 Iris-setosa
16 5.4 3.9 NaN 0.4 Iris-setosa
17 5.1 3.5 NaN 0.3 Iris-setosa
18 5.7 3.8 NaN 0.3 Iris-setosa
19 5.1 3.8 NaN 0.3 Iris-setosa
20 5.4 3.4 NaN 0.2 Iris-setosa

将缺失值全部替换为1.0

python
1
iris = iris.fillna(1.0)
python
1
iris.head(20)
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.0 0.2 Iris-setosa
11 4.8 3.4 1.0 0.2 Iris-setosa
12 4.8 3.0 1.0 0.1 Iris-setosa
13 4.3 3.0 1.0 0.1 Iris-setosa
14 5.8 4.0 1.0 0.2 Iris-setosa
15 5.7 4.4 1.0 0.4 Iris-setosa
16 5.4 3.9 1.0 0.4 Iris-setosa
17 5.1 3.5 1.0 0.3 Iris-setosa
18 5.7 3.8 1.0 0.3 Iris-setosa
19 5.1 3.8 1.0 0.3 Iris-setosa

删除列class

python
1
iris.drop('class',inplace=True,axis=1)
python
1
iris.head()
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2

将数据框前三行设置为缺失值

python
1
2
iris.iloc[0:3, :] = np.nan
iris
sepal_length sepal_width petal_length petal_width
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

删除有缺失值的行

python
1
iris.dropna(axis=0, inplace=True)
python
1
iris
sepal_length sepal_width petal_length petal_width
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
5 5.4 3.9 1.7 0.4
6 4.6 3.4 1.4 0.3
7 5.0 3.4 1.5 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

147 rows × 4 columns

重新设置索引

python
1
2
iris.reset_index(inplace=True, drop=True)
iris
sepal_length sepal_width petal_length petal_width
0 4.6 3.1 1.5 0.2
1 5.0 3.6 1.4 0.2
2 5.4 3.9 1.7 0.4
3 4.6 3.4 1.4 0.3
4 5.0 3.4 1.5 0.2
... ... ... ... ...
142 6.7 3.0 5.2 2.3
143 6.3 2.5 5.0 1.9
144 6.5 3.0 5.2 2.0
145 6.2 3.4 5.4 2.3
146 5.9 3.0 5.1 1.8

147 rows × 4 columns


评论