示例:USDA食品数据库

美国农业部(USDA)制作了一份有关食物营养信息的数据库。Ashley Williams(一名来自英国的技术牛人)制作了该数据的JSON版(http://ashleyw.co.uk/project/food-nutrient-database)。其中的记录如下所示:

  1. {
  2. "id": 21441,
  3. "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, Wing, meat and skin with breading",
  4. "tags": ["KFC"],
  5. "manufacturer": "Kentucky Fried Chicken",
  6. "group": "Fast Foods",
  7. "portions": [
  8. {
  9. "amount": 1,
  10. "unit": "wing, with skin",
  11. "grams": 68.0
  12. },
  13. ...
  14. ],
  15. "nutrients": [
  16. {
  17. "value": 20.8,
  18. "units": "g",
  19. "description": "Protein",
  20. "group": "Composition"
  21. },
  22.  
  23. ...
  24. ]
  25. }

每种食物都带有若干标识性属性以及两个有关营养成分和分量的列表。这种形式的数据不是很适合分析工作,因此我们需要做一些规整化以使其具有更好用的形式。

从上面列举的那个网址下载并解压数据之后,你可以用任何喜欢的JSON库将其加载到Python中。我用的是Python内置的json模块:

  1. In [256]: import json
  2.  
  3. In [257]: db = json.load(open('ch07/foods-2011-10-03.json'))
  4.  
  5. In [258]: len(db)
  6. Out[258]: 6636

db中的每个条目都是一个含有某种食物全部数据的字典。nutrients字段是一个字典列表,其中的每个字典对应一种营养成分:

  1. In [259]: db[0].keys()
  2. Out[259]:
  3. [u'portions',
  4. u'description',
  5. u'tags',
  6. u'nutrients',
  7. u'group',
  8. u'id',
  9. u'manufacturer']
  10. In [260]: db[0]['nutrients'][0]
  11. Out[260]:
  12. {u'description': u'Protein',
  13. u'group': u'Composition',
  14. u'units': u'g',
  15. u'value': 25.18}
  16.  
  17. In [261]: nutrients = DataFrame(db[0]['nutrients'])
  18.  
  19. In [262]: nutrients[:7]
  20. Out[262]:
  21. description group units value
  22. 0 Protein Composition g 25.18
  23. 1 Total lipid (fat) Composition g 29.20
  24. 2 Carbohydrate, by difference Composition g 3.06
  25. 3 Ash Other g 3.28
  26. 4 Energy Energy kcal 376.00
  27. 5 Water Composition g 39.28
  28. 6 Energy Energy kJ 1573.00

在将字典列表转换为DataFrame时,可以只抽取其中的一部分字段。这里,我们将取出食物的名称、分类、编号以及制造商等信息:

  1. In [263]: info_keys = ['description', 'group', 'id', 'manufacturer']
  2.  
  3. In [264]: info = DataFrame(db, columns=info_keys)
  4.  
  5. In [265]: info[:5]
  6. Out[265]:
  7. description group id manufacturer
  8. 0 Cheese, caraway Dairy and Egg Products 1008
  9. 1 Cheese, cheddar Dairy and Egg Products 1009
  10. 2 Cheese, edam Dairy and Egg Products 1018
  11. 3 Cheese, feta Dairy and Egg Products 1019
  12. 4 Cheese, mozzarella, part skim milk Dairy and Egg Products 1028
  13.  
  14. In [266]: info
  15. Out[266]:
  16. <class 'pandas.core.frame.DataFrame'>
  17. Int64Index: 6636 entries, 0 to 6635
  18. Data columns:
  19. description 6636 non-null values
  20. group 6636 non-null values
  21. id 6636 non-null values
  22. manufacturer 5195 non-null values
  23. dtypes: int64(1), object(3)

通过value_counts,你可以查看食物类别的分布情况:

  1. In [267]: pd.value_counts(info.group)[:10]
  2. Out[267]:
  3. Vegetables and Vegetable Products 812
  4. Beef Products 618
  5. Baked Products 496
  6. Breakfast Cereals 403
  7. Legumes and Legume Products 365
  8. Fast Foods 365
  9. Lamb, Veal, and Game Products 345
  10. Sweets 341
  11. Pork Products 328
  12. Fruits and Fruit Juices 328

现在,为了对全部营养数据做一些分析,最简单的办法是将所有食物的营养成分整合到一个大表中。我们分几个步骤来实现该目的。首先,将各食物的营养成分列表转换为一个DataFrame,并添加一个表示编号的列,然后将该DataFrame添加到一个列表中。最后通过concat将这些东西连接起来就可以了:

  1. nutrients = []
  2.  
  3. for rec in db:
  4. fnuts = DataFrame(rec['nutrients'])
  5. fnuts['id'] = rec['id']
  6. nutrients.append(fnuts)
  7.  
  8. nutrients = pd.concat(nutrients, ignore_index=True)

如果一切顺利的话,nutrients应该是下面这样的:

  1. In [269]: nutrients
  2. Out[269]:
  3. <class 'pandas.core.frame.DataFrame'>
  4. Int64Index: 389355 entries, 0 to 389354
  5. Data columns:
  6. description 389355 non-null values
  7. group 389355 non-null values
  8. units 389355 non-null values
  9. value 389355 non-null values
  10. id 389355 non-null values
  11. dtypes: float64(1), int64(1), object(3)

我发现这个DataFrame中无论如何都会有一些重复项,所以直接丢弃就可以了:

  1. In [270]: nutrients.duplicated().sum()
  2. Out[270]: 14179
  3.  
  4. In [271]: nutrients = nutrients.drop_duplicates()

由于两个DataFrame对象中都有"group"和"description",所以为了明确到底谁是谁,我们需要对它们进行重命名:

  1. In [272]: col_mapping = {'description' : 'food',
  2. ...: 'group' : 'fgroup'}
  3.  
  4. In [273]: info = info.rename(columns=col_mapping, copy=False)
  5.  
  6. In [274]: info
  7. Out[274]:
  8. <class 'pandas.core.frame.DataFrame'>
  9. Int64Index: 6636 entries, 0 to 6635
  10. Data columns:
  11. food 6636 non-null values
  12. fgroup 6636 non-null values
  13. id 6636 non-null values
  14. manufacturer 5195 non-null values
  15. dtypes: int64(1), object(3)
  16.  
  17. In [275]: col_mapping = {'description' : 'nutrient',
  18. ...: 'group' : 'nutgroup'}
  19.  
  20. In [276]: nutrients = nutrients.rename(columns=col_mapping, copy=False)
  21.  
  22. In [277]: nutrients
  23. Out[277]:
  24. <class 'pandas.core.frame.DataFrame'>
  25. Int64Index: 375176 entries, 0 to 389354
  26. Data columns:
  27. nutrient 375176 non-null values
  28. nutgroup 375176 non-null values
  29. units 375176 non-null values
  30. value 375176 non-null values
  31. id 375176 non-null values
  32. dtypes: float64(1), int64(1), object(3)

做完这些事情之后,就可以将info跟nutrients合并起来:

  1. In [278]: ndata = pd.merge(nutrients, info, on='id', how='outer')
  2.  
  3. In [279]: ndata
  4. Out[279]:
  5. <class 'pandas.core.frame.DataFrame'>
  6. Int64Index: 375176 entries, 0 to 375175
  7. Data columns:
  8. nutrient 375176 non-null values
  9. nutgroup 375176 non-null values
  10. units 375176 non-null values
  11. value 375176 non-null values
  12. id 375176 non-null values
  13. food 375176 non-null values
  14. fgroup 375176 non-null values
  15. manufacturer 293054 non-null values
  16. dtypes: float64(1), int64(1), object(6)
  17.  
  18. In [280]: ndata.ix[30000]
  19. Out[280]:
  20. nutrient Folic acid
  21. nutgroup Vitamins
  22. units mcg
  23. value 0
  24. id 5658
  25. food Ostrich, top loin, cooked
  26. fgroup Poultry Products
  27. manufacturer
  28. Name: 30000

接下来的两章中将介绍切片和切块、聚合、图形化方面的工具,所以在你掌握了那些方法之后可以再用这个数据集来练练手。比如说,我们可以根据食物分类和营养类型画出一张中位值图(如图7-1所示):

示例:USDA食品数据库 - 图1

图7-1:根据营养分类得出的锌中位值

  1. In [281]: result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
  2.  
  3. In [282]: result['Zinc, Zn'].order().plot(kind='barh')

只要稍微动一动脑子,就可以发现各营养成分最为丰富的食物是什么了:

  1. by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])
  2.  
  3. get_maximum = lambda x: x.xs(x.value.idxmax())
  4. get_minimum = lambda x: x.xs(x.value.idxmin())
  5.  
  6. max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]
  7.  
  8. # 让food小一点
  9. max_foods.food = max_foods.food.str[:50]

由于得到的DataFrame很大,所以不方便在书里面全部打印出来。这里只给出"Amino Acids"营养分组:

  1. In [284]: max_foods.ix['Amino Acids']['food']
  2. Out[284]:
  3. nutrient
  4. Alanine Gelatins, dry powder, unsweetened
  5. Arginine Seeds, sesame flour, low-fat
  6. Aspartic acid Soy protein isolate
  7. Cystine Seeds, cottonseed flour, low fat (glandless)
  8. Glutamic acid Soy protein isolate
  9. Glycine Gelatins, dry powder, unsweetened
  10. Histidine Whale, beluga, meat, dried (Alaska Native)
  11. Hydroxyproline KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINAL R
  12. Isoleucine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  13. Leucine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  14. Lysine Seal, bearded (Oogruk), meat, dried (Alaska Nativ
  15. Methionine Fish, cod, Atlantic, dried and salted
  16. Phenylalanine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  17. Proline Gelatins, dry powder, unsweetened
  18. Serine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  19. Threonine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  20. Tryptophan Sea lion, Steller, meat with fat (Alaska Native)
  21. Tyrosine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  22. Valine Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
  23. Name: food