Day 4. Grouping, sorting, visualizing

2019. 7. 6. 01:27

그룹화

In [0]:

#그룹화 객체
df.groupby('year')

Out[0]:

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fa2ead06da0>

In [0]:

# 같은 연도 그룹별 기대수명의 평균.
df.groupby('year')['lifeExp'].mean()

Out[0]:

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

In [0]:

dfg  = df.groupby('year')
dfgy = dfg['lifeExp']

dfgy.mean()

Out[0]:

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

연도 , 대륙별로 그룹화 헤서 기대수명, gdp 평균.

In [0]:

# 연도, 대륙으로 그룹화하고 각 그룹에 기대수명과 gdp의 평균 출력.

df.groupby(['year', 'continent'])[['lifeExp','gdpPercap']].mean().loc[1952].loc["Africa"]

Out[0]:

lifeExp        39.135500
gdpPercap    1252.572466
Name: Africa, dtype: float64

In [0]:

# 대륙별 국가의 갯수 nunique

df.groupby(['continent'])['country'].nunique()

Out[0]:

continent
Africa      52
Americas    25
Asia        33
Europe      30
Oceania      2
Name: country, dtype: int64

시각화

In [0]:

gyle = df.groupby(['year'])['lifeExp'].mean()

In [0]:

#Serise에 대해서 바로 플롯함수 호출

gyle.plot(grid = True)

Out[0]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa2eab5f9b0>

In [0]:

scientists = df_csv.copy()

In [0]:

scientists.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 5 columns):
Name          8 non-null object
Born          8 non-null object
Died          8 non-null object
Age           8 non-null int64
Occupation    8 non-null object
dtypes: int64(1), object(4)
memory usage: 400.0+ bytes

In [0]:

#나이가 평균보다 큰 나이만 출력하시오,

age = scientists['Age']
meanAge = age.mean()

In [0]:

meanAge

Out[0]:

59.125

In [0]:

age[ age > meanAge ]

Out[0]:

1    61
2    90
3    66
7    77
Name: Age, dtype: int64

정렬

In [0]:

## 인덱스를 기준으로 오름차순 정렬.
age.sort_index()

Out[0]:

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

In [0]:

age.sort_index( ascending = False )

Out[0]:

7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

In [0]:

## 값을 기준으로 정렬.
age.sort_values()

age.sort_values(ascending = False)

Out[0]:

2    90
7    77
3    66
1    61
4    56
5    45
6    41
0    37
Name: Age, dtype: int64

In [0]:

#특정 컬럼의 데이터 타입만 궁금할 경우
scientists['Born'].dtypes  # 결과 O => 오브젝트

Out[0]:

dtype('O')

In [0]:

## 문자로 된 날짜 형식의 데이터는 날짜 타입으로 변경 시켜줘야 한다.

## format = "%Y-%m-%d" 기본 포멧

bd = pd.to_datetime( scientists['Born'] , format = "%Y-%m-%d" )

In [0]:

#파생 변수 몇일간 살았는지 계산

dd = pd.to_datetime( scientists['Died'] , format = "%Y-%m-%d" )

Out[0]:

	Name	Born	Died	Age	Occupation
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist
1	William Gosset	1876-06-13	1937-10-16	61	Statistician
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist
5	John Snow	1813-03-15	1858-06-16	45	Physician
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician

In [0]:

scientists['age_days'] = dd - bd

In [0]:

scientists

Out[0]:

	Name	Born	Died	Age	Occupation	age_days
0	Rosaline Franklin	1920-07-25	1958-04-16	37	Chemist	13779 days
1	William Gosset	1876-06-13	1937-10-16	61	Statistician	22404 days
2	Florence Nightingale	1820-05-12	1910-08-13	90	Nurse	32964 days
3	Marie Curie	1867-11-07	1934-07-04	66	Chemist	24345 days
4	Rachel Carson	1907-05-27	1964-04-14	56	Biologist	20777 days
5	John Snow	1813-03-15	1858-06-16	45	Physician	16529 days
6	Alan Turing	1912-06-23	1954-06-07	41	Computer Scientist	15324 days
7	Johann Gauss	1777-04-30	1855-02-23	77	Mathematician	28422 days

컬럼 삭제

In [0]:

scientists.columns

Out[0]:

Index(['Name', 'Born', 'Died', 'Age', 'Occupation', 'age_days'], dtype='object')

In [0]:

scientists.drop(['Age'], axis = 1)

Out[0]:

	Name	Born	Died	Occupation	age_days
0	Rosaline Franklin	1920-07-25	1958-04-16	Chemist	13779 days
1	William Gosset	1876-06-13	1937-10-16	Statistician	22404 days
2	Florence Nightingale	1820-05-12	1910-08-13	Nurse	32964 days
3	Marie Curie	1867-11-07	1934-07-04	Chemist	24345 days
4	Rachel Carson	1907-05-27	1964-04-14	Biologist	20777 days
5	John Snow	1813-03-15	1858-06-16	Physician	16529 days
6	Alan Turing	1912-06-23	1954-06-07	Computer Scientist	15324 days
7	Johann Gauss	1777-04-30	1855-02-23	Mathematician	28422 days

Seaborn

In [0]:

import seaborn as sns

In [0]:

# 1~4의 평균, 분산, 상관관계, 회귀선이 모두 같다.

#서로 다른 데이터 셋이 있을때, 단순히 파라미터가 같다는점으로 미루어 비슷한 데이터라고 섯부르게 판단하지 말것.
#시각화의 중요성.

#데이터 셋을 그룹화 기준으로 삼아서 평균, 분산을 조사하고 데이터를 시각화 해 보자.

sns.load_dataset("anscombe")[:10]

Out[0]:

	dataset	x	y
0	I	10.0	8.04
1	I	8.0	6.95
2	I	13.0	7.58
3	I	9.0	8.81
4	I	11.0	8.33
5	I	14.0	9.96
6	I	6.0	7.24
7	I	4.0	4.26
8	I	12.0	10.84
9	I	7.0	4.82

In [0]:

df1 = anscombe[anscombe['dataset']=='I'] 
df2 = anscombe[anscombe['dataset']=='II'] 
df3 = anscombe[anscombe['dataset']=='III'] 
df4 = anscombe[anscombe['dataset']=='IIII']

In [0]:

%matplotlib inline
# show 함수를 쓰지 않고도 결과 확인 가능
import matplotlib.pyplot as plt

#선으로 출력
plt.plot(df1['x'], df1['y'])

Out[0]:

[<matplotlib.lines.Line2D at 0x7fa2e6eaee80>]

In [0]:

## 선이 아닌 스케터 플롯으로 출력
plt.plot(df1['x'], df1['y'], 'o')

Out[0]:

[<matplotlib.lines.Line2D at 0x7fa2e6e132b0>]

시각화 할때 창을 나누어서 출력하는게 서로 비교하기가 좋다

In [0]:

# figure 객체 생성 => 그림판 , 또는 백지

# add_subplot(2,2,1) : 2행 2열로 나누고 시계방향으로 1,2,3,4가 된다. 

fig = plt.figure()
axes1 = fig.add_subplot(2,2,1)
axes2 = fig.add_subplot(2,2,2)
axes3 = fig.add_subplot(2,2,3)
axes4 = fig.add_subplot(2,2,4)

axis = 단수, axes = 복수

In [0]:

axes1.plot(df1['x'],df1['y'], 'o')
axes2.plot(df2['x'],df2['y'], 'o')
axes3.plot(df3['x'],df3['y'], 'o')
axes4.plot(df4['x'],df4['y'], 'o')

Out[0]:

[<matplotlib.lines.Line2D at 0x7fa2e6ba0d30>]

In [0]:

fig

Out[0]:

In [0]:

axes1.set_title("dataset_1")
axes2.set_title("dataset_2")
axes3.set_title("dataset_3")
axes4.set_title("dataset_4")

fig.suptitle("Anscombe Data")
fig.tight_layout()
fig

Out[0]:

In [0]:

저작자표시 (새창열림)

'딥러닝 모델 설계 > Machine Learning' 카테고리의 다른 글

Day 05.Melt_PivotTable (0)	2019.07.08
Day 5. Sort_String_Binomial_Distribution (0)	2019.07.08
Day 4. Data Reconstruction (0)	2019.07.06
Day 4. binomializing [ OneHotEncoding ] (0)	2019.07.06
Day 03. Scaler (0)	2019.07.03

Software knowledge worth spreading

Day 4. Grouping, sorting, visualizing

그룹화

연도 , 대륙별로 그룹화 헤서 기대수명, gdp 평균.

시각화

정렬

Seaborn

'딥러닝 모델 설계 > Machine Learning' 카테고리의 다른 글

+ Recent posts

티스토리툴바