3.1　为什么要用MapReduce

3.1　为什么要用MapReduce

MapReduce的流行是有理由的。它非常简单、易于实现且扩展性强。大家可以通过它轻易地编写出同时在多台主机上运行的程序，也可以使用Ruby、Python、PHP和C++等非Java类语言编写Map或Reduce程序，还可以在任何安装Hadoop的集群中运行同样的程序，不论这个集群有多少台主机。MapReduce适合处理海量数据，因为它会被多台主机同时处理，这样通常会有较快的速度。

下面来看一个例子。

引文分析是评价论文好坏的一个非常重要的方面，本例只对其中最简单的一部分，即论文的被引用次数进行了统计。假设有很多篇论文（百万级），且每篇论文的引文形式如下所示：

References

David M.Blei, Andrew Y.Ng, and Michael I.Jordan.

2003.Latent dirichlet allocation.Journal of Machine

Learning Research，3：993-1022.

Samuel Brody and Noemie Elhadad.2010.An unsupervised

aspect-sentiment model for online reviews.In

NAACL'10.

Jaime Carbonell and Jade Goldstein.1998.The use of

mmr, diversity-based reranking for reordering documents

and producing summaries.In SIGIR'98，pages

335-336.

Dennis Chong and James N.Druckman.2010.Identifying

frames in political news.In Erik P.Bucy and

R.Lance Holbert, editors, Sourcebook for Political

Communication Research：Methods, Measures, and

Analytical Techniques.Routledge.

Cindy Chung and James W.Pennebaker.2007.The psychological

function of function words.Social Communication：

Frontiers of Social Psychology, pages 343-

359.

G¨unes Erkan and Dragomir R.Radev.2004.Lexrank：

graph-based lexical centrality as salience in text summarization.

J.Artif.Int.Res.，22（1）：457-479.

Stephan Greene and Philip Resnik.2009.More than

words：syntactic packaging and implicit sentiment.In

NAACL'09，pages 503-511.

Aria Haghighi and Lucy Vanderwende.2009.Exploring

content models for multi-document summarization.In

NAACL'09，pages 362-370.

Sanda Harabagiu, Andrew Hickl, and Finley Lacatusu.

2006.Negation, contrast and contradiction in text processing.

在单机运行时，想要完成这个统计任务，需要先切分出所有论文的名字存入一个Hash表中，然后遍历所有论文，查看引文信息，一一计数。因为文章数量很多，需要进行很多次内外存交换，这无疑会延长程序的执行时间。但在MapReduce中，这是一个WordCount就能解决的问题。

3.1 为什么要用MapReduce