Day 7. Machine Learning [ Decision Trees ] ( Weather Classification )

2019. 6. 16. 19:40

import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

In [6]:

data = pd.read_csv('./daily_weather.csv')

In [8]:

data.describe()

Out[8]:

	number	air_pressure_9am	air_temp_9am	avg_wind_direction_9am	avg_wind_speed_9am	max_wind_direction_9am	max_wind_speed_9am	rain_accumulation_9am	rain_duration_9am	relative_humidity_9am	relative_humidity_3pm
count	1095.000000	1092.000000	1090.000000	1091.000000	1092.000000	1092.000000	1091.000000	1089.000000	1092.000000	1095.000000	1095.000000
mean	547.000000	918.882551	64.933001	142.235511	5.508284	148.953518	7.019514	0.203079	294.108052	34.241402	35.344727
std	316.243577	3.184161	11.175514	69.137859	4.552813	67.238013	5.598209	1.593952	1598.078779	25.472067	22.524079
min	0.000000	907.990000	36.752000	15.500000	0.693451	28.900000	1.185578	0.000000	0.000000	6.090000	5.300000
25%	273.500000	916.550000	57.281000	65.972506	2.248768	76.553003	3.067477	0.000000	0.000000	15.092243	17.395000
50%	547.000000	918.921045	65.715479	166.000000	3.871333	177.300000	4.943637	0.000000	0.000000	23.179259	24.380000
75%	820.500000	921.160073	73.450974	191.000000	7.337163	201.233153	8.947760	0.000000	0.000000	45.400000	52.060000
max	1094.000000	929.320000	98.906000	343.400000	23.554978	312.200000	29.840780	24.020000	17704.000000	92.620000	92.250000

In [9]:

data.isnull().any()

Out[9]:

number                    False
air_pressure_9am           True
air_temp_9am               True
avg_wind_direction_9am     True
avg_wind_speed_9am         True
max_wind_direction_9am     True
max_wind_speed_9am         True
rain_accumulation_9am      True
rain_duration_9am          True
relative_humidity_9am     False
relative_humidity_3pm     False
dtype: bool

In [24]:

data[data.isnull().any(axis = 1)].count()

Out[24]:

number                    31
air_pressure_9am          28
air_temp_9am              26
avg_wind_direction_9am    27
avg_wind_speed_9am        28
max_wind_direction_9am    28
max_wind_speed_9am        27
rain_accumulation_9am     25
rain_duration_9am         28
relative_humidity_9am     31
relative_humidity_3pm     31
dtype: int64

In [25]:

del data['number']

In [26]:

#Store number of rows
before_rows = data.shape[0]
before_rows

Out[26]:

In [28]:

data = data.dropna()

In [29]:

before_rows - data.shape[0]

Out[29]:

In [30]:

clean_data = data.copy()

In [31]:

#Turn Boolean Value To Integer My Multipling 1 
clean_data['high_humidity_label'] = (clean_data['relative_humidity_3pm'] > 24.99 ) * 1
clean_data['high_humidity_label']

Out[31]:

0       1
1       0
2       0
3       0
4       1
5       1
6       0
7       1
8       0
9       1
10      1
11      1
12      1
13      1
14      0
15      0
17      0
18      1
19      0
20      0
21      1
22      0
23      1
24      0
25      1
26      1
27      1
28      1
29      1
30      1
       ..
1064    1
1065    1
1067    1
1068    1
1069    1
1070    1
1071    1
1072    0
1073    1
1074    1
1075    0
1076    0
1077    1
1078    0
1079    1
1080    0
1081    0
1082    1
1083    1
1084    1
1085    1
1086    1
1087    1
1088    1
1089    1
1090    1
1091    1
1092    1
1093    1
1094    0
Name: high_humidity_label, Length: 1064, dtype: int64

In [36]:

# df[] = Serise , df[[]]  = dataframe 
y = clean_data[['high_humidity_label']].copy()
type(y)

Out[36]:

pandas.core.frame.DataFrame

In [37]:

y.head()

Out[37]:

	high_humidity_label
0	1
1	0
2	0
3	0
4	1

In [38]:

clean_data['relative_humidity_3pm'].head()

Out[38]:

0    36.160000
1    19.426597
2    14.460000
3    12.742547
4    76.740000
Name: relative_humidity_3pm, dtype: float64

In [40]:

data.columns

Out[40]:

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm'],
      dtype='object')

In [41]:

morning_features = ['air_pressure_9am','air_temp_9am','avg_wind_direction_9am',
                   'avg_wind_speed_9am','max_wind_direction_9am','max_wind_speed_9am',
                   'rain_accumulation_9am','rain_duration_9am']

In [42]:

clean_data.columns

Out[42]:

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am', 'relative_humidity_9am',
       'relative_humidity_3pm', 'high_humidity_label'],
      dtype='object')

In [43]:

x = clean_data[morning_features].copy()

In [44]:

x.columns

Out[44]:

Index(['air_pressure_9am', 'air_temp_9am', 'avg_wind_direction_9am',
       'avg_wind_speed_9am', 'max_wind_direction_9am', 'max_wind_speed_9am',
       'rain_accumulation_9am', 'rain_duration_9am'],
      dtype='object')

In [45]:

y.columns

Out[45]:

Index(['high_humidity_label'], dtype='object')

In [46]:

# Take Two DataFrame And Split Those Into Four
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.33, random_state = 324 )

In [47]:

type(x_train)

Out[47]:

pandas.core.frame.DataFrame

In [48]:

type(x_test)

Out[48]:

pandas.core.frame.DataFrame

In [49]:

type(y_train)

Out[49]:

pandas.core.frame.DataFrame

In [50]:

type(y_test)

Out[50]:

pandas.core.frame.DataFrame

In [51]:

x_train.head()

Out[51]:

	air_pressure_9am	air_temp_9am	avg_wind_direction_9am	avg_wind_speed_9am	max_wind_direction_9am	max_wind_speed_9am
841	918.370000	72.932000	184.500000	2.013246	186.700000	2.773806
75	920.100000	53.492000	186.100000	13.444009	193.800000	15.367778
95	927.610000	54.896000	55.000000	4.988376	53.400000	7.202947
895	919.235153	65.951112	194.343333	2.942019	216.569792	3.658810
699	919.888128	68.687822	228.517730	3.960858	247.954028	5.185547

In [52]:

y_train.describe()

Out[52]:

	high_humidity_label
count	712.000000
mean	0.494382
std	0.500320
min	0.000000
25%	0.000000
50%	0.000000
75%	1.000000
max	1.000000

In [53]:

humidity_classifier = DecisionTreeClassifier(max_leaf_nodes = 10 , random_state = 0)
humidity_classifier.fit(x_train, y_train)

Out[53]:

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=0,
            splitter='best')

In [54]:

type(humidity_classifier)

Out[54]:

sklearn.tree.tree.DecisionTreeClassifier

In [55]:

predictions = humidity_classifier.predict(x_test)

In [56]:

predictions[ : 10]

Out[56]:

array([0, 0, 1, 1, 1, 1, 0, 0, 0, 1])

In [57]:

y_test[ : 10]

Out[57]:

	high_humidity_label
456	0
845	0
693	1
259	1
723	1
224	1
300	1
442	0
585	1
1057	1

In [58]:

accuracy_score( y_true = y_test, y_pred = predictions )

Out[58]:

0.8153409090909091

In [ ]:

저작자표시

'Python Library > Pandas' 카테고리의 다른 글

Day 7. Machine Learning [ Linear Regression ] ( European Soccer Data ) (0)	2019.06.16
Day 7. Machine Learning [ K - Means ] ( Local Clustering ) (0)	2019.06.16
Day 6. Handling Timestamps with Pandas (0)	2019.06.16
Day 6. String Operations with Pandas (0)	2019.06.16
Day 6. Frequent operations with pandas -Summary (0)	2019.06.16

Software knowledge worth spreading

Day 7. Machine Learning [ Decision Trees ] ( Weather Classification )

'Python Library > Pandas' 카테고리의 다른 글

+ Recent posts

티스토리툴바