1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | # -*- coding: utf-8 -*- """Day 14_RandomForest_mushroom.ipynb Automatically generated by Colaboratory. Original file is located at """ from google.colab import drive drive.mount('/gdrive') PATH = "/gdrive/My Drive/Colab Notebooks/resources/" # %matplotlib inline import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl import pandas as pd import seaborn as sns import time from scipy.stats import norm, multivariate_normal # installing packages for interactive graphs import ipywidgets as widgets from IPython.display import display from ipywidgets import interact, interactive, fixed, interact_manual, IntSlider ############################################## ################ Helper Function############## ############################################## def my_df_dropNas(df, columns): for col in columns: df = df[ df[col].notna() ] return df def my_checkNas( x ): y = x.apply(lambda x : ( sum(x.isna()) )) return y @interact_manual( x =IntSlider(0,0,12) ) def test_model( x ): print(x) ## 웹에서 다운로드 import urllib.request as req url = "http://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/"+"/agaricus-lepiota.data" req.urlretrieve(url,"mushdata.csv") df = pd.read_csv("mushdata.csv",header = None) df.shape """#각 카테고리를 숫자로 변환""" ## 아스키 코드를 기반으로 숫자 <-> 문자 변환 print(ord('a')) print(chr(97)) label = [] data = [] for row_index, row in df.iterrows(): label.append(row.iloc[0]) row_data = [] for v in row.iloc[1:]: row_data.append(ord(v)) data.append(row_data) len(label) from sklearn.model_selection import train_test_split ## Split test and train data data_train, data_test, label_train, label_test = train_test_split(data, label) len(data_train) - len(data_test) from sklearn.ensemble import RandomForestClassifier from sklearn import metrics ## 데이터 학습 clf = RandomForestClassifier() clf.fit(data_train, label_train) predict = clf.predict(data_test) ## 정확도 측정 ac_score = metrics.accuracy_score(label_test, predict) ## 여러 측정 방법 적용 rep = metrics.classification_report(label_test, predict) ac_score print(rep) """데이터 수치화에서 중요한것. 빨강 = 1 파랑 = 2 초록 = 3 흰색 = 4 숫자 사이에 특별한 관련성은 없다. 해당 숫자가 카테고리화를 위한 변수인지, 혹은 수치형 변수인지 잘 생각해야 한다. 빨강 => 1000 파랑 => 0100 카테고리화를 위해서 원핫인코딩을 해야한다. """ label = [] data = [] attr_list = [] for row_index , row in df.iterrows(): label.append( row.ix[0] ) exdata = [] ## col = 열번호 , v = 각 문자 for col, v in enumerate(row.ix[1:]): if row_index == 0: ## attr = "dic" : {}, "cnt" : 0 ## 딕셔너리 초기화 attr = { "dic" : {}, "cnt" : 0 } attr_list.append(attr) else : attr = attr_list[col] ## 변수의 특징을 표현하는 변수의 종류가 최대 12개라고 가정. d = [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] if v in attr['dic']: idx = attr['dic'][v] else: idx = attr['cnt'] attr['dic'][v] = idx attr['cnt'] += 1 d[idx] = 1 ## idx 번째만 1인 리스트가 들어가게 된다. exdata.append(d) data.append(exdata) data[0] data_train, data_test, label_train, label_test = train_test_split(data, label) | cs |
'딥러닝 모델 설계 > Machine Learning' 카테고리의 다른 글
Day 14_house_price (0) | 2019.07.30 |
---|---|
Day 13_bikeShare_DataSet (0) | 2019.07.25 |
Day 11_Web Scraping (0) | 2019.07.17 |
Day 10_DecisionTree_With_Preprocessing (0) | 2019.07.17 |
Day 09_Advanced Group_Agg_Apply (0) | 2019.07.12 |