使用HTML和Web API

使用HTML和Web API

许多网站都有一些通过JSON或其他格式提供数据的公共API。通过Python访问这些API的办法有不少。一个简单易用的办法（推荐）是requests包（http://docs.python-requests.org）。为了在Twitter上搜索"python pandas"，我们可以发送一个HTTP GET请求，如下所示：

In [944]: import requests
 
In [945]: url = 'http://search.twitter.com/search.json?q=python%20pandas'
 
In [946]: resp = requests.get(url)
 
In [947]: resp
Out[947]: <Response [200]>

Response对象的text属性含有GET请求的内容。许多Web API返回的都是JSON字符串，我们必须将其加载到一个Python对象中：

In [948]: import json
 
In [949]: data = json.loads(resp.text)
 
In [950]: data.keys()
Out[950]:
[u'next_page',
u'completed_in',
u'max_id_str',
u'since_id_str',
u'refresh_url',
u'results',
u'since_id',
u'results_per_page',
u'query',
u'max_id',
u'page']

响应结果中的results字段含有一组tweet，每条tweet被表示为一个Python字典，如下所示：

{u'created_at': u'Mon, 25 Jun 2012 17:50:33 +0000',
u'from_user': u'wesmckinn',
u'from_user_id': 115494880,
u'from_user_id_str': u'115494880',
u'from_user_name': u'Wes McKinney',
u'geo': None,
u'id': 217313849177686018,
u'id_str': u'217313849177686018',
u'iso_language_code': u'pt',
u'metadata': {u'result_type': u'recent'},
u'source': u'<a href="http://twitter.com/">web</a>',
u'text': u'Lunchtime pandas-fu http://t.co/SI70xZZQ #pydata',
u'to_user': None,
u'to_user_id': 0,
u'to_user_id_str': u'0',
u'to_user_name': None}

我们用一个列表定义出感兴趣的tweet字段，然后将results列表传给DataFrame：

In [951]: tweet_fields = ['created_at', 'from_user', 'id', 'text']
 
In [952]: tweets = DataFrame(data['results'], columns=tweet_fields)
 
In [953]: tweets
Out[953]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns:
created_at    15  non-null values
from_user     15  non-null values
id            15  non-null values
text          15  non-null values
dtypes: int64(1), object(3)

现在，DataFrame中的每一行就有了来自一条tweet的数据：

In [121]: tweets.ix[7]
Out[121]:
created_at                  Thu, 23 Jul 2012 09:54:00 +0000
from_user                                          deblike
id                                      227419585803059201
text          pandas: powerful Python data analysis toolkit
Name: 7

要想能够直接得到便于分析的DataFrame对象，只需再多费些精力创建出对常见Web API的更高级接口即可。