MovieLens 1M数据集
图2-2:按Windows和非Windows用户统计的最常出现的时区
图2-3:按Windows和非Windows用户比例统计的最常出现的时区
MovieLens 1M数据集含有来自6000名用户对4000部电影的100万条评分数据。它分为三个表:评分、用户信息和电影信息。将该数据从zip文件中解压出来之后,可以通过pandas.read_table将各个表分别读到一个pandas DataFrame对象中:
- import pandas as pd
- unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
- users = pd.read_table('ml-1m/users.dat', sep='::', header=None, names=unames)
- rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
- ratings = pd.read_table('ml-1m/ratings.dat', sep='::', header=None, names=rnames)
- mnames = ['movie_id', 'title', 'genres']
- movies = pd.read_table('ml-1m/movies.dat', sep='::', header=None, names=mnames)
利用Python的切片语法,通过查看每个DataFrame的前几行即可验证数据加载工作是否一切顺利:
- In [334]: users[:5]
- Out[334]:
- user_id gender age occupation zip
- 0 1 F 1 10 48067
- 1 2 M 56 16 70072
- 2 3 M 25 15 55117
- 3 4 M 45 7 02460
- 4 5 M 25 20 55455
- In [335]: ratings[:5]
- Out[335]:
- user_id movie_id rating timestamp
- 0 1 1193 5 978300760
- 1 1 661 3 978302109
- 2 1 914 3 978301968
- 3 1 3408 4 978300275
- 4 1 2355 5 978824291
- In [336]: movies[:5]
- Out[336]:
- movie_id title genres
- 0 1 Toy Story (1995) Animation|Children's|Comedy
- 1 2 Jumanji (1995) Adventure|Children's|Fantasy
- 2 3 Grumpier Old Men (1995) Comedy|Romance
- 3 4 Waiting to Exhale (1995) Comedy|Drama
- 4 5 Father of the Bride Part II (1995) Comedy
- In [337]: ratings
- Out[337]:
- <class 'pandas.core.frame.DataFrame'>
- Int64Index: 1000209 entries, 0 to 1000208
- Data columns:
- user_id 1000209 non-null values
- movie_id 1000209 non-null values
- rating 1000209 non-null values
- timestamp 1000209 non-null values
- dtypes: int64(4)
注意,其中的年龄和职业是以编码形式给出的,它们的具体含义请参考该数据集的README文件。分析散布在三个表中的数据可不是一件轻松的事情。假设我们想要根据性别和年龄计算某部电影的平均得分,如果将所有数据都合并到一个表中的话问题就简单多了。我们先用pandas的merge函数将ratings跟users合并到一起,然后再将movies也合并进去。pandas会根据列名的重叠情况推断出哪些列是合并(或连接)键:
- In [338]: data = pd.merge(pd.merge(ratings, users), movies)
- In [339]: data
- Out[339]:
- <class 'pandas.core.frame.DataFrame'>
- Int64Index: 1000209 entries, 0 to 1000208
- Data columns:
- user_id 1000209 non-null values
- movie_id 1000209 non-null values
- rating 1000209 non-null values
- timestamp 1000209 non-null values
- gender 1000209 non-null values
- age 1000209 non-null values
- occupation 1000209 non-null values
- zip 1000209 non-null values
- title 1000209 non-null values
- genres 1000209 non-null values
- dtypes: int64(6), object(4)
- In [340]: data.ix[0]
- Out[340]:
- user_id 1
- movie_id 1
- rating 5
- timestamp 978824268
- gender F
- age 1
- occupation 10
- zip 48067
- title Toy Story (1995)
- genres Animation|Children's|Comedy
- Name: 0
现在,只要稍微熟悉一下pandas,就能轻松地根据任意个用户或电影属性对评分数据进行聚合操作了。为了按性别计算每部电影的平均得分,我们可以使用pivot_table方法:
- In [341]: mean_ratings = data.pivot_table('rating', rows='title',
- ....: cols='gender', aggfunc='mean')
- In [342]: mean_ratings[:5]
- Out[342]:
- gender F M
- title
- $1,000,000 Duck (1971) 3.375000 2.761905
- 'Night Mother (1986) 3.388889 3.352941
- 'Til There Was You (1997) 2.675676 2.733333
- 'burbs, The (1989) 2.793478 2.962085
- ...And Justice for All (1979) 3.828571 3.689024
该操作产生了另一个DataFrame,其内容为电影平均得分,行标为电影名称,列标为性别。现在,我打算过滤掉评分数据不够250条的电影(随便选的一个数字)。为了达到这个目的,我先对title进行分组,然后利用size()得到一个含有各电影分组大小的Series对象:
- In [343]: ratings_by_title = data.groupby('title').size()
- In [344]: ratings_by_title[:10]
- Out[344]:
- title
- $1,000,000 Duck (1971) 37
- 'Night Mother (1986) 70
- 'Til There Was You (1997) 52
- 'burbs, The (1989) 303
- ...And Justice for All (1979) 199
- 1-900 (1994) 2
- 10 Things I Hate About You (1999) 700
- 101 Dalmatians (1961) 565
- 101 Dalmatians (1996) 364
- 12 Angry Men (1957) 616
- In [345]: active_titles = ratings_by_title.index[ratings_by_title >= 250]
- In [346]: active_titles
- Out[346]:
- Index(['burbs, The (1989), 10 Things I Hate About You (1999),
- 101 Dalmatians (1961), ..., Young Sherlock Holmes (1985),
- Zero Effect (1998), eXistenZ (1999)], dtype=object)
该索引中含有评分数据大于250条的电影名称,然后我们就可以据此从前面的mean_ratings中选取所需的行了:
- In [347]: mean_ratings = mean_ratings.ix[active_titles]
- In [348]: mean_ratings
- Out[348]:
- <class 'pandas.core.frame.DataFrame'>
- Index: 1216 entries, 'burbs, The (1989) to eXistenZ (1999)
- Data columns:
- F 1216 non-null values
- M 1216 non-null values
- dtypes: float64(2)
为了了解女性观众最喜欢的电影,我们可以对F列降序排列:
- In [350]: top_female_ratings = mean_ratings.sort_index(by='F', ascending=False)
- In [351]: top_female_ratings[:10]
- Out[351]:
- gender F M
- title
- Close Shave, A (1995) 4.644444 4.473795
- Wrong Trousers, The (1993) 4.588235 4.478261
- Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) 4.572650 4.464589
- Wallace & Gromit: The Best of Aardman Animation (1996) 4.563107 4.385075
- Schindler's List (1993) 4.562602 4.491415
- Shawshank Redemption, The (1994) 4.539075 4.560625
- Grand Day Out, A (1992) 4.537879 4.293255
- To Kill a Mockingbird (1962) 4.536667 4.372611
- Creature Comforts (1990) 4.513889 4.272277
- Usual Suspects, The (1995) 4.513317 4.518248
计算评分分歧
假设我们想要找出男性和女性观众分歧最大的电影。一个办法是给mean_ratings加上一个用于存放平均得分之差的列,并对其进行排序:
- In [352]: mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
按"diff"排序即可得到分歧最大且女性观众更喜欢的电影:
- In [353]: sorted_by_diff = mean_ratings.sort_index(by='diff')
- In [354]: sorted_by_diff[:15]
- Out[354]:
- gender F M diff
- title
- Dirty Dancing (1987) 3.790378 2.959596 -0.830782
- Jumpin' Jack Flash (1986) 3.254717 2.578358 -0.676359
- Grease (1978) 3.975265 3.367041 -0.608224
- Little Women (1994) 3.870588 3.321739 -0.548849
- Steel Magnolias (1989) 3.901734 3.365957 -0.535777
- Anastasia (1997) 3.800000 3.281609 -0.518391
- Rocky Horror Picture Show, The (1975) 3.673016 3.160131 -0.512885
- Color Purple, The (1985) 4.158192 3.659341 -0.498851
- Age of Innocence, The (1993) 3.827068 3.339506 -0.487561
- Free Willy (1993) 2.921348 2.438776 -0.482573
- French Kiss (1995) 3.535714 3.056962 -0.478752
- Little Shop of Horrors, The (1960) 3.650000 3.179688 -0.470312
- Guys and Dolls (1955) 4.051724 3.583333 -0.468391
- Mary Poppins (1964) 4.197740 3.730594 -0.467147
- Patch Adams (1998) 3.473282 3.008746 -0.464536
对排序结果反序并取出前15行,得到的则是男性观众更喜欢的电影:
- # 对行反序,并取出前15行
- In [355]: sorted_by_diff[::-1][:15]
- Out[355]:
- gender F M diff
- title
- Good, The Bad and The Ugly, The (1966) 3.494949 4.221300 0.726351
- Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359
- Dumb & Dumber (1994) 2.697987 3.336595 0.638608
- Longest Day, The (1962) 3.411765 4.031447 0.619682
- Cable Guy, The (1996) 2.250000 2.863787 0.613787
- Evil Dead II (Dead By Dawn) (1987) 3.297297 3.909283 0.611985
- Hidden, The (1987) 3.137931 3.745098 0.607167
- Rocky III (1982) 2.361702 2.943503 0.581801
- Caddyshack (1980) 3.396135 3.969737 0.573602
- For a Few Dollars More (1965) 3.409091 3.953795 0.544704
- Porky's (1981) 2.296875 2.836364 0.539489
- Animal House (1978) 3.628906 4.167192 0.538286
- Exorcist, The (1973) 3.537634 4.067239 0.529605
- Fright Night (1985) 2.973684 3.500000 0.526316
- Barb Wire (1996) 1.585366 2.100386 0.515020
如果只是想要找出分歧最大的电影(不考虑性别因素),则可以计算得分数据的方差或标准差:
- # 根据电影名称分组的得分数据的标准差
- In [356]: rating_std_by_title = data.groupby('title')['rating'].std()
- # 根据active_titles进行过滤
- In [357]: rating_std_by_title = rating_std_by_title.ix[active_titles]
- # 根据值对Series进行降序排列
- In [358]: rating_std_by_title.order(ascending=False)[:10]
- Out[358]:
- title
- Dumb & Dumber (1994) 1.321333
- Blair Witch Project, The (1999) 1.316368
- Natural Born Killers (1994) 1.307198
- Tank Girl (1995) 1.277695
- Rocky Horror Picture Show, The (1975) 1.260177
- Eyes Wide Shut (1999) 1.259624
- Evita (1996) 1.253631
- Billy Madison (1995) 1.249970
- Fear and Loathing in Las Vegas (1998) 1.246408
- Bicentennial Man (1999) 1.245533
- Name: rating
可能你已经注意到了,电影分类是以竖线(|)分隔的字符串形式给出的。如果想对电影分类进行分析的话,就需要先将其转换成更有用的形式才行。我将在本书后续章节中讲到这种转换处理,到时还会再用到这个数据。