14.2.2 Pig的下载、安装和配置

当前Pig最新版本为0.10.0,除此之外,Pig还有其他版本,如0.9.2、0.8.1两个版本,用户可以根据需要从Apache官方网站上下载相应的版本。本书使用最新版的Pig 0.10.0,安装包下载地址如下:


http://www.apache.org/dyn/closer.cgi/pig


Pig的安装包下载完成后,需要使用tar-xvf pig-..*.tar.gz命令将其解压。我们可以将Pig放在系统中的任意位置上,并且只需要配置相应的环境变量就可以使用Pig了。不过我们建议将Pig放在Hadoop目录下,方便以后的操作。

解压完成后,需要设置Pig相应的环境变量。环境变量有多种设置方法,用户可以根据自己的需要进行选择。这里我们选择对profile文件进行修改,来设置Pig相应的环境变量。打开“/etc/profile”文件,插入下面的一条语句,保存关闭文件后需要重启系统以使环境变量设置生效:


export PIG_HOME=/<path-to-pigDir>

export PATH=$PIG_HOME/bin:$PIG_HOME/conf:$PATH


当环境变量设置生效后,我们可以通过“pig-help”命令来查看Pig是否安装成功。Pig安装成功后会出现如下所示的提示:


hadoop@master:~/hadoop-1.0.1/pig-0.10.0$pig-help

Apache Pig version 0.10.0(r1328203)

compiled Apr 19 2012,22:54:12

USAGE:Pig[options][-]:Run interactively in grunt shell.

Pig[options]-e[xecute]cmd[cmd……]:Run cmd(s).

Pig[options][-f[ile]]file:Run cmds found in file.

options include:

-4,-log4jconf-Log4j configuration file, overrides log conf

-b,-brief-Brief logging(no timestamps)

-c,-check-Syntax check

-d,-debug-Debug level, INFO is default

-e,-execute-Commands to execute(within quotes)

-f,-file-Path to the script to execute

-g,-embedded-ScriptEngine classname or keyword for the ScriptEngine

-h,-help-Display this message.You can specify topic to get help for that topic.

properties is the only topic currently supported:-h properties.

-i,-version-Display version information

-l,-logfile-Path to client side log file;default is current working directory.

-m,-param_file-Path to the parameter file

-p,-param-Key value pair of the form param=val

-r,-dryrun-Produces script with substituted parameters.Script is not executed.

-t,-optimizer_off-Turn optimizations off.The following values are supported:

SplitFilter-Split filter conditions

PushUpFilter-Filter as early as possible

MergeFilter-Merge filter conditions

PushDownForeachFlatten-Join or explode as late as possible

LimitOptimizer-Limit as early as possible

ColumnMapKeyPrune-Remove unused data

AddForEach-Add ForEach to remove unneeded columns

MergeForEach-Merge adjacent ForEach

GroupByConstParallelSetter-Force parallel 1 for"group all"statement

All-Disable all optimizations

All optimizations listed here are enabled by default.Optimization values

are case insensitive.

-v,-verbose-Print all error messages to screen

-w,-warning-Turn warning logging on;also turns warning aggregation off

-x,-exectype-Set execution mode:local|mapreduce, default is mapreduce.

-F,-stop_on_failure-Aborts execution on the first failed job;default is off

-M,-no_multiquery-Turn multiquery optimization off;default is on

-P,-propertyFile-Path to property file