Day 6. Frequent operations with pandas - aggregation

2019. 6. 15. 23:05

tags = read_csv("./ml/tags.csv", sep = ",")
tags.head()

Out[26]:

	userId	movieId	tag	timestamp
0	2	60756	funny	1445714994
1	2	60756	Highly quotable	1445714996
2	2	60756	will ferrell	1445714992
3	2	89774	Boxing story	1445715207
4	2	89774	MMA	1445715200

In [27]:

tags.describe()

Out[27]:

	userId	movieId	timestamp
count	3683.000000	3683.000000	3.683000e+03
mean	431.149335	27252.013576	1.320032e+09
std	158.472553	43490.558803	1.721025e+08
min	2.000000	1.000000	1.137179e+09
25%	424.000000	1262.500000	1.137521e+09
50%	474.000000	4454.000000	1.269833e+09
75%	477.000000	39263.000000	1.498457e+09
max	610.000000	193565.000000	1.537099e+09

In [28]:

tags.shape

Out[28]:

(3683, 4)

In [29]:

movies = read_csv("./ml/movies.csv", sep = ",")
movies.head()

Out[29]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

In [30]:

movies.shape

Out[30]:

(9742, 3)

In [31]:

ratings = read_csv("./ml/ratings.csv")
ratings.head()

Out[31]:

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

In [32]:

ratings.shape

Out[32]:

(100836, 4)

In [38]:

#ratings[1000:1010]
ratings[-10 : ] #last ten rows

Out[38]:

	userId	movieId	rating	timestamp
100826	610	162350	3.5	1493849971
100827	610	163937	3.5	1493848789
100828	610	163981	3.5	1493850155
100829	610	164179	5.0	1493845631
100830	610	166528	4.0	1493879365
100831	610	166534	4.0	1493848402
100832	610	168248	5.0	1493850091
100833	610	168250	5.0	1494273047
100834	610	168252	5.0	1493846352
100835	610	170875	3.0	1493846415

In [39]:

tag_counts = tags['tag'].value_counts() #counts values in tag column
tag_counts[ : 10] # most ten values

Out[39]:

In Netflix queue     131
atmospheric           36
superhero             24
thought-provoking     24
funny                 23
surreal               23
Disney                23
religion              22
psychology            21
quirky                21
Name: tag, dtype: int64

In [40]:

tag_counts[-10 : ] #least ten values

Out[40]:

brilliant              1
Insurance              1
parrots                1
President              1
Neil Patrick Harris    1
Renee Zellweger        1
Classic                1
crucifixion            1
Boston                 1
tricky                 1
Name: tag, dtype: int64

In [42]:

tag_counts['sci-fi']  # it's series type, so find value with idx

Out[42]:

In [44]:

tag_counts[ : 10].plot(kind = 'bar', figsize = (15, 10))

Out[44]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fbac652c320>

In [45]:

is_highly_rated = ratings['rating'] >= 4.0
is_highly_rated.head()

Out[45]:

0    True
1    True
2    True
3    True
4    True
Name: rating, dtype: bool

In [46]:

ratings[is_highly_rated][-5 : ]

Out[46]:

	userId	movieId	rating	timestamp
100830	610	166528	4.0	1493879365
100831	610	166534	4.0	1493848402
100832	610	168248	5.0	1493850091
100833	610	168250	5.0	1494273047
100834	610	168252	5.0	1493846352

In [49]:

is_animation = movies['genres'].str.contains('Animation')
is_animation.head()

Out[49]:

0     True
1    False
2    False
3    False
4    False
Name: genres, dtype: bool

In [50]:

movies[is_animation][5:15]

Out[50]:

	movieId	title	genres
322	364	Lion King, The (1994)	Adventure\|Animation\|Children\|Drama\|Musical\|IMAX
483	551	Nightmare Before Christmas, The (1993)	Animation\|Children\|Fantasy\|Musical
488	558	Pagemaster, The (1994)	Action\|Adventure\|Animation\|Children\|Fantasy
506	588	Aladdin (1992)	Adventure\|Animation\|Children\|Comedy\|Musical
511	594	Snow White and the Seven Dwarfs (1937)	Animation\|Children\|Drama\|Fantasy\|Musical
512	595	Beauty and the Beast (1991)	Animation\|Children\|Fantasy\|Musical\|Romance\|IMAX
513	596	Pinocchio (1940)	Animation\|Children\|Fantasy\|Musical
522	610	Heavy Metal (1981)	Action\|Adventure\|Animation\|Horror\|Sci-Fi
527	616	Aristocats, The (1970)	Animation\|Children
534	631	All Dogs Go to Heaven 2 (1996)	Adventure\|Animation\|Children\|Fantasy\|Musical\|R...

In [51]:

ratings_count = ratings[['movieId', 'rating']].groupby('rating').count()
ratings_count.head()

Out[51]:

	movieId
rating
0.5	1370
1.0	2811
1.5	1791
2.0	7551
2.5	5550

In [54]:

average_rating = ratings[['movieId', 'rating']].groupby('movieId').mean()
average_rating.tail()

Out[54]:

	rating
movieId
193581	4.0
193583	3.5
193585	3.5
193587	3.5
193609	4.0

In [55]:

movie_count = ratings[['movieId', 'rating']].groupby('movieId').count()
movie_count.head()   #how many movies in the movie.csv

Out[55]:

	rating
movieId
1	215
2	110
3	52
4	7
5	49

In [ ]:

저작자표시

'Python Library > Pandas' 카테고리의 다른 글

Day 6. Frequent operations with pandas -Summary (0)	2019.06.16
Day 6. Frequent operations with pandas - merging (0)	2019.06.15
Day 6. Frequent operations with pandas - subsetting, filtering, delegation (0)	2019.06.15
Day 6. Simple visualization with pandas (0)	2019.06.15
Day 6.Movie Data Analysis Part.2 (0)	2019.06.15

Software knowledge worth spreading

Day 6. Frequent operations with pandas - aggregation

'Python Library > Pandas' 카테고리의 다른 글

+ Recent posts

티스토리툴바