示例：USDA食品数据库

示例：USDA食品数据库

美国农业部（USDA）制作了一份有关食物营养信息的数据库。Ashley Williams（一名来自英国的技术牛人）制作了该数据的JSON版（http://ashleyw.co.uk/project/food-nutrient-database）。其中的记录如下所示：

{
  "id": 21441,
  "description": "KENTUCKY FRIED CHICKEN, Fried Chicken, EXTRA CRISPY, Wing, meat and skin with breading",
  "tags": ["KFC"],
  "manufacturer": "Kentucky Fried Chicken",
  "group": "Fast Foods",
  "portions": [
    {
      "amount": 1,
      "unit": "wing, with skin",
      "grams": 68.0
    },
    ...
  ],
  "nutrients": [
    {
      "value": 20.8,
      "units": "g",
      "description": "Protein",
      "group": "Composition"
    },
 
    ...
  ]
}

每种食物都带有若干标识性属性以及两个有关营养成分和分量的列表。这种形式的数据不是很适合分析工作，因此我们需要做一些规整化以使其具有更好用的形式。

从上面列举的那个网址下载并解压数据之后，你可以用任何喜欢的JSON库将其加载到Python中。我用的是Python内置的json模块：

In [256]: import json
 
In [257]: db = json.load(open('ch07/foods-2011-10-03.json'))
 
In [258]: len(db)
Out[258]: 6636

db中的每个条目都是一个含有某种食物全部数据的字典。nutrients字段是一个字典列表，其中的每个字典对应一种营养成分：

In [259]: db[0].keys()
Out[259]:
[u'portions',
 u'description',
 u'tags',
 u'nutrients',
 u'group',
 u'id',
 u'manufacturer']
In [260]: db[0]['nutrients'][0]
Out[260]:
{u'description': u'Protein',
 u'group': u'Composition',
 u'units': u'g',
 u'value': 25.18}
 
In [261]: nutrients = DataFrame(db[0]['nutrients'])
 
In [262]: nutrients[:7]
Out[262]:
                   description        group units    value
0                      Protein  Composition     g    25.18
1            Total lipid (fat)  Composition     g    29.20
2  Carbohydrate, by difference  Composition     g     3.06
3                          Ash        Other     g     3.28
4                       Energy       Energy  kcal   376.00
5                        Water  Composition     g    39.28
6                       Energy       Energy    kJ  1573.00

在将字典列表转换为DataFrame时，可以只抽取其中的一部分字段。这里，我们将取出食物的名称、分类、编号以及制造商等信息：

In [263]: info_keys = ['description', 'group', 'id', 'manufacturer']
 
In [264]: info = DataFrame(db, columns=info_keys)
 
In [265]: info[:5]
Out[265]:
                          description                   group    id manufacturer
0                     Cheese, caraway  Dairy and Egg Products  1008
1                     Cheese, cheddar  Dairy and Egg Products  1009
2                        Cheese, edam  Dairy and Egg Products  1018
3                        Cheese, feta  Dairy and Egg Products  1019
4  Cheese, mozzarella, part skim milk  Dairy and Egg Products  1028
 
In [266]: info
Out[266]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6636 entries, 0 to 6635
Data columns:
description     6636  non-null values
group           6636  non-null values
id              6636  non-null values
manufacturer    5195  non-null values
dtypes: int64(1), object(3)

通过value_counts，你可以查看食物类别的分布情况：

In [267]: pd.value_counts(info.group)[:10]
Out[267]:
Vegetables and Vegetable Products    812
Beef Products                        618
Baked Products                       496
Breakfast Cereals                    403
Legumes and Legume Products          365
Fast Foods                           365
Lamb, Veal, and Game Products        345
Sweets                               341
Pork Products                        328
Fruits and Fruit Juices              328

现在，为了对全部营养数据做一些分析，最简单的办法是将所有食物的营养成分整合到一个大表中。我们分几个步骤来实现该目的。首先，将各食物的营养成分列表转换为一个DataFrame，并添加一个表示编号的列，然后将该DataFrame添加到一个列表中。最后通过concat将这些东西连接起来就可以了：

nutrients = []
 
for rec in db:
    fnuts = DataFrame(rec['nutrients'])
    fnuts['id'] = rec['id']
    nutrients.append(fnuts)
 
nutrients = pd.concat(nutrients, ignore_index=True)

如果一切顺利的话，nutrients应该是下面这样的：

In [269]: nutrients
Out[269]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 389355 entries, 0 to 389354
Data columns:
description    389355  non-null values
group          389355  non-null values
units          389355  non-null values
value          389355  non-null values
id             389355  non-null values
dtypes: float64(1), int64(1), object(3)

我发现这个DataFrame中无论如何都会有一些重复项，所以直接丢弃就可以了：

In [270]: nutrients.duplicated().sum()
Out[270]: 14179
 
In [271]: nutrients = nutrients.drop_duplicates()

由于两个DataFrame对象中都有"group"和"description"，所以为了明确到底谁是谁，我们需要对它们进行重命名：

In [272]: col_mapping = {'description' : 'food',
     ...:                'group'       : 'fgroup'}
 
In [273]: info = info.rename(columns=col_mapping, copy=False)
 
In [274]: info
Out[274]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6636 entries, 0 to 6635
Data columns:
food            6636  non-null values
fgroup          6636  non-null values
id              6636  non-null values
manufacturer    5195  non-null values
dtypes: int64(1), object(3)
 
In [275]: col_mapping = {'description' : 'nutrient',
     ...:                'group' : 'nutgroup'}
 
In [276]: nutrients = nutrients.rename(columns=col_mapping, copy=False)
 
In [277]: nutrients
Out[277]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 389354
Data columns:
nutrient    375176  non-null values
nutgroup    375176  non-null values
units       375176  non-null values
value       375176  non-null values
id          375176  non-null values
dtypes: float64(1), int64(1), object(3)

做完这些事情之后，就可以将info跟nutrients合并起来：

In [278]: ndata = pd.merge(nutrients, info, on='id', how='outer')
 
In [279]: ndata
Out[279]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 375176 entries, 0 to 375175
Data columns:
nutrient        375176  non-null values
nutgroup        375176  non-null values
units           375176  non-null values
value           375176  non-null values
id              375176  non-null values
food            375176  non-null values
fgroup          375176  non-null values
manufacturer    293054  non-null values
dtypes: float64(1), int64(1), object(6)
 
In [280]: ndata.ix[30000]
Out[280]:
nutrient                       Folic acid
nutgroup                         Vitamins
units                                 mcg
value                                   0
id                                   5658
food            Ostrich, top loin, cooked
fgroup                   Poultry Products
manufacturer
Name: 30000

接下来的两章中将介绍切片和切块、聚合、图形化方面的工具，所以在你掌握了那些方法之后可以再用这个数据集来练练手。比如说，我们可以根据食物分类和营养类型画出一张中位值图（如图7-1所示）：

示例：USDA食品数据库 - 图1

图7-1：根据营养分类得出的锌中位值

In [281]: result = ndata.groupby(['nutrient', 'fgroup'])['value'].quantile(0.5)
 
In [282]: result['Zinc, Zn'].order().plot(kind='barh')

只要稍微动一动脑子，就可以发现各营养成分最为丰富的食物是什么了：

by_nutrient = ndata.groupby(['nutgroup', 'nutrient'])
 
get_maximum = lambda x: x.xs(x.value.idxmax())
get_minimum = lambda x: x.xs(x.value.idxmin())
 
max_foods = by_nutrient.apply(get_maximum)[['value', 'food']]
 
# 让food小一点
max_foods.food = max_foods.food.str[:50]

由于得到的DataFrame很大，所以不方便在书里面全部打印出来。这里只给出"Amino Acids"营养分组：

In [284]: max_foods.ix['Amino Acids']['food']
Out[284]:
nutrient
Alanine                           Gelatins, dry powder, unsweetened
Arginine                               Seeds, sesame flour, low-fat
Aspartic acid                                   Soy protein isolate
Cystine                Seeds, cottonseed flour, low fat (glandless)
Glutamic acid                                   Soy protein isolate
Glycine                           Gelatins, dry powder, unsweetened
Histidine                Whale, beluga, meat, dried (Alaska Native)
Hydroxyproline    KENTUCKY FRIED CHICKEN, Fried Chicken, ORIGINAL R
Isoleucine        Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Leucine           Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Lysine            Seal, bearded (Oogruk), meat, dried (Alaska Nativ
Methionine                    Fish, cod, Atlantic, dried and salted
Phenylalanine     Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Proline                           Gelatins, dry powder, unsweetened
Serine            Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Threonine         Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Tryptophan         Sea lion, Steller, meat with fat (Alaska Native)
Tyrosine          Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Valine            Soy protein isolate, PROTEIN TECHNOLOGIES INTERNA
Name: food