附录Ｄ　练习答案

附录Ｄ　练习答案

练习1-1

如果你遇到困难，请咨询你的系统管理员，或者在R-help邮件列表上提问。

练习1-2

使用冒号操作符来创建一个向量，然后使用sd函数：

sd(0:100)

练习1-3

输入demo(plotmath)并按下回车键，或者点击图形看看它能提供什么。

练习2-1

简单的除法就可以得到倒数，然后使用atan计算反正切（arc）：

`atan( 1 / 1:1000 )

使用<-来给变量赋值：

 x <- ： - 1:1000 
 y <- atan(1/x)
 z <- 1 /tan(y)

练习2-2

对于比较两个应该包含相同数字的向量，通常all.equal就是你所需要的：

x == z 
identical(x, z) 
all.equal(x, z) 
all.equal(x, z, tolerance = 0)

练习2-3

包含在以下三个向量的精确值可能不同：

true_and_missing <- c(NA, TRUE, NA) 
false_and_missing <- c(FALSE, FALSE, NA) 
mixed <- c(TRUE, FALSE, NA) 
any(true_and_missing) 
any(false_and_missing) 
any(mixed) 
all(true_and_missing) 
all(false_and_missing) 
all(mixed)

练习3-1

class(Inf) 
class(NA) 
class(NaN) 
class("")

把class代替为typeof、mode和storage.mode重复以上例子。

练习3-2

pets <- factor(sample( 
   c("dog", "cat", "hamster", "goldfish"), 
   1000, 
   replace = TRUE 
)) 
head(pets) 
summary(pets)

建议转换为因子，但这并非强制。

练习3-3

carrot <- 1 
potato <- 2 
swede <- 3
ls(pattern = "a")

你定义的蔬菜可能会有所不同。

练习4-1

创建序列有几种方法，其中包括冒号运算符。此解决方案使用seq_len和seq_along：

n <- seq_len(20) 
triangular <- n * (n + 1) / 2 
names(triangular) <- letters[seq_along(n)] 
triangular[c("a", "e", "i", "o")]

练习4-2

同样地，创建从-11序列为0再到11的序列有许多种不同的方法。abs能计算一个数的绝对值：

diag(abs(seq.int(-11, 11)))

练习4-3

威尔金森（Wilkinson）矩阵有一个有趣的属性，它们的大部分特征值几乎相等。21x21矩阵是最经常使用的：

identity_20_by_21 <- diag(rep.int(1, 20), 20, 21) 
below_the_diagonal <- rbind(0, identity_20_by_21) 
identity_21_by_20 <- diag(rep.int(1, 20), 21, 20) 
above_the_diagonal <- cbind(0, identity_21_by_20) 
on_the_diagonal <- diag(abs(seq.int(-10, 10))) 
wilkinson_21 <- below_the_diagonal + above_the_diagonal + on_the_diagonal
 eigen(wilkinson_21)$values

练习5-1

为简单起见，这里我已经手动指定哪些数字是平方。你能想出一个方法来自动确定一个数是否为平方数吗？

list( 
  "0 to 9" = c(0, 1, 4, 9), 
  "10 to 19" = 16, 
  "20 to 29" = 25, 
  "30 to 39" = 36, 
  "40 to 49" = 49, 
  "50 to 59" = NULL, 
  "60 to 69" = 64, 
  "70 to 79" = NULL, 
  "80 to 89" = 81, 
  "90 to 99" = NULL
)

我们还可以自动计算平方数。以下的cut函数将创建不同的范围组：0〜9、10〜19，依此类推，然后返回一个向量，指出每个平方数在哪一组。然后split函数把平方数向量分开成一个列表，其中每个元素包含所对应组里的值：

x <- 0:99 
sqrt_x <- sqrt(x) 
is_square_number <- sqrt_x == floor(sqrt_x) 
square_numbers <- x[is_square_number] 
groups <- cut( 
   square_numbers, 
   seq.int(min(x), max(x), 10), 
   include.lowest = TRUE, 
   right = FALSE 
) 
split(square_numbers, groups)

练习5-2

有许多种不同的方法可以获得iris数据集的数值子集。实验一下吧！

iris_numeric <- iris[, 1:4] 
colMeans(iris_numeric)

练习5-3

这个解决方案是许多对数据框取索引和求子集的方法之一：

beaver1$id <- 1 
beaver2$id <- 2 
both_beavers <- rbind(beaver1, beaver2) 
subset(both_beavers, as.logical(activ))

练习6-1

我们可以使用new.env创建一个环境。之后的语法与列表一样：

multiples_of_pi <- new.env() 
multiples_of_pi[["two_pi"]] <- 2 * pi 
multiples_of_pi$three_pi <- 3 * pi 
assign("four_pi", 4 * pi, multiples_of_pi) 
ls(multiples_of_pi)
## [1] "four_pi" "three_pi" "two_pi"

练习6-2

这比你想象的要更容易。我们只需要在除以二时使用模运算符来获得余数，然后一切就能自动工作，也包括了非无限值：

is_even <- function(x) (x %% 2) == 0 
is_even(c(-5:5, Inf, -Inf, NA, NaN))
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE 
## [12] NA NA NA NA

练习6-3

同样，该函数只需要一行即可。formals和body函数能完成最难的那部分工作，而我们只需要返回其结果列表：

args_and_body <- function(fn) 
{ 
  list(arguments = formals(fn), body = body(fn)) 
} 
args_and_body(var)
## $arguments 
## $arguments$x 
## 
## 
## $arguments$y 
## NULL 
## 
## $arguments$na.rm 
## [1] FALSE 
## 
## $arguments$use 
## 
## 
## 
## $body 
## { 
##     if (missing(use)) 
##        use <- if (na.rm) 
##           "na.or.complete" 
##        else "everything" 
##     na.method <- pmatch(use, c("all.obs", "complete.obs", 
##        "pairwise.complete.obs", "everything", "na.or.complete")) 
##     if (is.na(na.method)) 
##        stop("invalid 'use' argument") 
##     if (is.data.frame(x)) 
##        x <- as.matrix(x) 
##     else stopifnot(is.atomic(x)) 
##     if (is.data.frame(y)) 
##        y <- as.matrix(y) 
##     else stopifnot(is.atomic(y)) 
##     .Call(C_cov, x, y, na.method, FALSE) 
## }
args_and_body(alarm)
## $arguments 
## NULL 
## 
## $body 
## { 
##    cat("\a") 
##    flush.console() 
## }

练习7-1

formatC(pi, digits = 16)
## [1] "3.141592653589793"

如果你要把formatC替换成format或prettyNum这也适用。

练习7-2

调用strsplit并使用正则表达式来匹配空格和标点符号。记得？能匹配一个字符：

# 使用一个可选的逗号，必须的空格，可以的连字符以及一个可选的空格分开strsplit(x, ",? -? ?")
## [[1]] 
## [1] "Swan" "swam" "over" "the" "pond" "Swim" "swan" "swim!" 
## 
## [[2]] 
## [1] "Swan" "swam" "back" "again" "Well" "swum" "swan!"
# 或者把最后的连字符的空格括起来
strsplit(x, ",? (- )?")
## [[1]] 
## [1] "Swan" "swam" "over" "the" "pond" "Swim" "swan" "swim!" 
## 
## [[2]] 
## [1] "Swan" "swam" "back" "again" "Well" "swum" "swan!"

练习7-3

默认情况下，cut所创建的间隔只包括上限而没有下限。为了获得最小值（3）的计数，我们需要在它下面（例如，在2）加入一个额外的断点。尝试使用plyr包中的count替换掉table：

scores <- three_d6(1000) 
bonuses <- cut( 
   scores, 
   c(2, 3, 5, 8, 12, 15, 17, 18), 
   labels = -3:3 
) 
table(bonuses)
## bonuses 
## -3 -2 -1 0 1 2 3 
## 4 39 186 486 233 47 5

练习8-1

使用if区分其条件。运算符%in％能使条件的检查更容易：

score <- two_d6(1) 
if(score %in% c(2, 3, 12)) 
{ 
   game_status <- FALSE 
   point <- NA 
} else if(score %in% c(7, 11)) 
{ 
   game_status <- TRUE 
   point <- NA 
} else 
{ 
   game_status <- NA 
   point <- score 
}

练习8-2

因为我们不明确代码所要执行的次数，所以使用repeat循环是最合适的：

if(is.na(game_status)) 
{ 
   repeat({ 
      score <- two_d6(1) 
      if(score == 7) 
      { 
         game_status <- FALSE 
         break 
      } else 
      if(score == point) 
     { 
         game_status <- TRUE 
         break 
      } 
   }) 
}

练习8-3

因为我们知道需要执行的循环（从最短单词的长度到最长单词的长度）的次数，for循环是最合适的：

nchar_sea_shells <- nchar(sea_shells) 
for(i in min(nchar_sea_shells):max(nchar_sea_shells)) 
{ 
   message("These words have ", i, " letters:")      　　
   print(toString(unique(sea_shells[nchar_sea_shells == i]))) 
}
## These words have 2 letters:
## [1] "by, So, if, on"
## These words have 3 letters:
## [1] "She, sea, the, The, she, are, I'm"
## These words have 4 letters:
## [1] "sure"
## These words have 5 letters:
## [1] "sells"
## These words have 6 letters:
## [1] "shells, surely"
## These words have 7 letters:
## [1] ""
## These words have 8 letters:
## [1] "seashore"
## These words have 9 letters:
## [1] "seashells"

练习9-1

由于输入的是一个列表，输出的始终相同而且长度已知，所以vapply是最好的选择：

vapply(wayans, length, integer(1))
## Dwayne Kim Keenen Ivory     Damon       Kim     Shawn 
##          0            5         4         0         3 
##     Marlon        Nadia    Elvira    Diedre    Vonnie 
##          2            0         2         5         0

练习9-2

1.我们可以通过使用str、head和class以及阅读帮助页面?state.x77大致感受一下此数据集：

## num [1:50, 1:8] 3615 365 2212 2110 21198 ... 
## - attr(*, "dimnames")=List of 2 
##  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ... 
##  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
 
##              Population Income Illiteracy Life Exp Murder HS Grad Frost 
## Alabama      3615   3624      2.1     69.05 15.1   41.3   20 
## Alaska       365    6315      1.5     69.31  11.3  66.7   152 
## Arizona      2212    4530     1.8     70.55  7.8   58.1   15 
## Arkansas     2110    3378     1.9     70.66  10.1  39.9   65 
## California   21198   5114     1.1     71.71  10.3  62.6   20 
## Colorado           2541   4884        0.7   72.06  6.8    63.9   166 
##               Area 
## Alabama      50708 
## Alaska      566432 
## Arizona     113417 
## Arkansas     51945 
## California  156361 
## Colorado    103766
 
 
## [1] "matrix"

2.现在我们知道此数据是一个矩阵，因此apply函数最合适。我们要循环每列，因此我们把2作为维度值传递给它：

## Population    Income  Illiteracy   Life Exp    Murder    HS Grad 
##   4246.420  4435.800       1.170    70.879     7.378     53.108 
##      Frost      Area 
##    104.460 70735.880
## Population    Income  Illiteracy   Life Exp    Murder    HS Grad 
##  4.464e+03 6.145e+02   6.095e-01  1.342e+00 3.692e+00  8.077e+00 
##      Frost      Area 
##  5.198e+01 8.533e+04

练习9-3

tapply是R基本包中的最佳选择。而plyr包中的ddply也能很好地工作：

with(commute_data, tapply(time, mode, quantile, prob = 0.75))
##  bike   bus   car train 
## 63.55 55.62 42.33 36.29
ddply(commute_data, .(mode), summarize, time_p75 = quantile(time, 0.75))
##   mode time_p75 
## 1 bike    63.55 
## 2  bus    55.62 
## 3  car    42.33 
## 4 train   36.29

练习10-1

在R的图形用户界面中，单击Package→“Install package(s)… ”，选择一个镜像，然后选择Hmisc。如果下载或安装失败，则请确保你有互联网接入和写入库目录的权限。你也可以尝试访问CRAN的网站来手动下载。问题中最常见原因是权限受到限制，在这种情况下请与你的系统管理员联系。

练习10-2

install.packages(“lubridate”)

你可能希望指定一个库进行安装。

练习10-3

installed.packages函数能在你的机器上获取软件包的详细信息及其位置（以下的结果可能跟你机器上的位置不同）：

pkgs <- installed.packages() 
table(pkgs[, "LibPath"])
## 
## C:/Program Files/R/R-devel/library D:/R/library 
##                                 29          169

练习11-1

此示例使用strptime解析和使用strtime来进行格式化，除此之外还有其他很多方法：

in_string <- c("1940-07-07", "1940-10-09", "1942-06-18", "1943-02-25") 
(parsed <- strptime(in_string, "%Y-%m-%d"))
## [1] "1940-07-07" "1940-10-09" "1942-06-18" "1943-02-25"
(out_string <- strftime(parsed, "%a %d %b %y"))
## [1] "Sun 07 Jul 40" "Wed 09 Oct 40" "Thu 18 Jun 42" "Thu 25 Feb 43"

练习11-2

?Sys.timezone帮助页面建议了使用以下代码导入时区文件：

tzfile <- file.path(R.home("share"), "zoneinfo", "zone.tab") 
tzones <- read.delim( 
   tzfile, 
   row.names = NULL, 
   header = FALSE, 
   col.names = c("country", "coords", "name", "comments"), 
   as.is = TRUE, 
   fill = TRUE, 
   comment.char = "#" 
)

找到你所在时区的最佳方法取决于你在哪里。从View(tzones)开始，如有需要，使用subset来缩小搜索范围。

练习11-3

你可以通过访问一个基础R包中的POSIXlt日期组件解决此问题，但是使用lubridate的month和day函数会更加清楚：

zodiac_sign <- function(x) 
{ 
   month_x <- month(x, label = TRUE) 
   day_x <- day(x) 
   switch( 
      month_x, 
      Jan = if(day_x < 20) "Capricorn" else "Aquarius", 
      Feb = if(day_x < 19) "Aquarius" else "Pisces", 
      Mar = if(day_x < 21) "Pisces" else "Aries", 
      Apr = if(day_x < 20) "Aries" else "Taurus", 
      May = if(day_x < 21) "Taurus" else "Gemini", 
      Jun = if(day_x < 21) "Gemini" else "Cancer", 
      Jul = if(day_x < 23) "Cancer" else "Leo", 
      Aug = if(day_x < 23) "Leo" else "Virgo", 
      Sep = if(day_x < 23) "Virgo" else "Libra", 
      Oct = if(day_x < 23) "Libra" else "Scorpio", 
      Nov = if(day_x < 22) "Scorpio" else "Sagittarius", 
      Dec = if(day_x < 22) "Sagittarius" else "Capricorn" 
   ) 
} 
#可如下使用 
nicolaus_copernicus_birth_date <- as.Date("1473-02-19")
zodiac_sign(nicolaus_copernicus_birth_date)
## [1] "Pisces"

请注意，switch语句的使用意味着这个函数的实现并非向量化。你可以通过大量使用ifelse语句来实现矢量化，或采取更容易地调用Vectorize(zodiac_sign)的方法。

练习12-1

使用system.file找到该文件并使用read.csv导入数据：

hafu_file <- system.file("extdata", "hafu.csv", package = "learningr") 
hafu_data <- read.csv(hafu_file)

练习12-2

使用xlsx包中的read.xlsx2函数导入数据。该数据在第一个（也是唯一的）工作表，而且最好指定每列的数据的类型：

library(xlsx) 
gonorrhoea_file <- system.file( 
  "extdata", 
   "multi-drug-resistant gonorrhoea infection.xls", 
   package = "learningr" 
) 
gonorrhoea_data <- read.xlsx2( 
   gonorrhoea_file, 
   sheetIndex = 1, 
   colClasses = c("integer", "character", "character", "numeric") 
)

练习12-3

使用RSQLite包的方法有好几种。一次一个步骤，我们可以像这样连接数据库：

library(RSQLite) 
driver <- dbDriver("SQLite") 
db_file <- system.file("extdata", "crabtag.sqlite", package = "learningr") 
conn <- dbConnect(driver, db_file) 
query <- "SELECT * FROM Daylog" 
head(daylog <- dbGetQuery(conn, query))
##   Tag ID Mission Day       Date Max Temp Min Temp Max Depth Min Depth
## 1 A03401           0 08/08/2008   27.734   25.203      0.06     -0.07
## 2 A03401           1 09/08/2008   25.203   23.859      0.06     -0.07
## 3 A03401           2 10/08/2008   24.016   23.500     -0.07     -0.10
## 4 A03401           3 11/08/2008   26.453   23.281     -0.04     -0.10
## 5 A03401           4 12/08/2008   27.047   23.609     -0.10     -0.26
## 6 A03401           5 13/08/2008   24.625   23.438     -0.04     -0.13
##   Batt Volts
## 1       3.06
## 2       3.09
## 3       3.09
## 4       3.09
## 5       3.09
## 6       3.09
dbDisconnect(conn)
## [1] TRUE
dbUnloadDriver(driver)
## [1] TRUE

或者放肆一些，使用在第十二章中定义的函数：

head(daylog <- query_crab_tag_db("SELECT * FROM Daylog"))

或者，我们可以使用dbReadTable函数再进一步简化：

get_table_from_crab_tag_db <- function(tbl) 
{ 
  driver <- dbDriver("SQLite") 
  db_file <- system.file("extdata", "crabtag.sqlite", package = "learningr") 
  conn <- dbConnect(driver, db_file) 
  on.exit( 
    { 
      dbDisconnect(conn) 
      dbUnloadDriver(driver) 
    } 
  ) 
  dbReadTable(conn, tbl) 
} 
head(daylog <- get_table_from_crab_tag_db("Daylog"))
##   Tag.ID Mission.Day       Date Max.Temp Min.Temp Max.Depth Min.Depth
## 1 A03401           0 08/08/2008   27.734   25.203      0.06     -0.07
## 2 A03401           1 09/08/2008   25.203   23.859      0.06     -0.07
## 3 A03401           2 10/08/2008   24.016   23.500     -0.07     -0.10
## 4 A03401           3 11/08/2008   26.453   23.281     -0.04     -0.10
## 5 A03401           4 12/08/2008   27.047   23.609     -0.10     -0.26
## 6 A03401           5 13/08/2008   24.625   23.438     -0.04     -0.13
##   Batt.Volts
## 1       3.06
## 2       3.09
## 3       3.09
## 4       3.09
## 5       3.09
## 6       3.09

练习13-1

1.为了检测问号，使用str_detect：

library(stringr) 
data(hafu, package = "learningr") 
hafu$FathersNationalityIsUncertain <- str_detect(hafu$Father, fixed("?")) 
hafu$MothersNationalityIsUncertain <- str_detect(hafu$Mother, fixed("?"))

2.要替换这些问号，使用str_replace函数：

hafu$Father <- str_replace(hafu$Father, fixed("?"), "") 
hafu$Mother <- str_replace(hafu$Mother, fixed("?"), "")

练习13-2

请确保你已经安装了reshape2包！关键是把测量变量命名为“Father”和“Mother”：

hafu_long <- melt(hafu, measure.vars = c("Father", "Mother"))

练习13-3

我们可以使用基础R包中的函数来完成，例如table函数能为我们计数，sort和head的组合可以使我们能够找到相同的值。把useNA = "always"传递给table意味着NA将始终包含在counts向量中：

top10 <- function(x) 
{ 
   counts <- table(x, useNA = "always") 
   head(sort(counts, decreasing = TRUE), 10) 
} 
top10(hafu$Mother)
## x 
## Japanese    <NA>    English American French German Russian 
##      120      50         29       23     20     12      10 
##  Fantasy Italian  Brazilian 
##        8       4          3

plyr包提供了另一种替代的解决方案，它能把答案作为一个数据框返回，这对进一步操纵结果可能更为有用：

top10_v2 <- function(x) 
{ 
   counts <- count(x) 
   head(arrange(counts, desc(freq)), 10) 
} 
top10_v2(hafu$Mother)
##            x freq 
## 1   Japanese  120 
## 2       <NA>   50 
## 3    English   29 
## 4   American   23 
## 5     French   20 
## 6     German   12 
## 7    Russian   10 
## 8    Fantasy    8 
## 9    Italian    4 
## 10 Brazilian    3

练习14-1

1.cor计算相关性，并且默认为Pearson相关性。对其他类型使用method参数实验一下：

with(obama_vs_mccain, cor(Unemployment, Obama))
## [1] 0.2897

2.以base/lattice/ggplot2的顺序，我们有：

with(obama_vs_mccain, plot(Unemployment, Obama)) 
xyplot(Obama ~ Unemployment, obama_vs_mccain) 
ggplot(obama_vs_mccain, aes(Unemployment, Obama)) + 　geom_point()

练习14-2

1.直方图：

plot_numbers <- 1:2 
layout(matrix(plot_numbers, ncol = 2, byrow = TRUE)) 
for(drug_use in c(FALSE, TRUE)) 
{ 
   group_data <- subset(alpe_d_huez2, DrugUse == drug_use) 
   with(group_data, hist(NumericTime)) 
} 
histogram(~ NumericTime | DrugUse, alpe_d_huez2) 
ggplot(alpe_d_huez2, aes(NumericTime)) + 
  geom_histogram(binwidth = 2) + 
  facet_wrap(~ DrugUse)

2.盒形图：

boxplot(NumericTime ~ DrugUse, alpe_d_huez2) 
bwplot(DrugUse ~ NumericTime, alpe_d_huez2) 
ggplot(alpe_d_huez2, aes(DrugUse, NumericTime, group = DrugUse)) + 
  geom_boxplot()

练习14-3

为简单起见，我们只使用ggplot2给出此答案。由于这是一个数据探索，所以没有“正确”的答案：如果图中显示了一些你感到有趣的事情，那么这是值得画出的。当你有几个“干扰因素”（在本例中，我们有年龄/种族/性别）存在时，对于不同的绘图会有很多不同的可能性。一般情况下，处理多变量有两种策略：先作一个总体概述再添加变量；或者一开始就把所有包含在内，然后再删除那些看起来不那么有趣的变量。我们将使用第二种策略。

要看到每个变量的影响，最简单的方法的是把所有的东西都放到面板上：

ggplot(gonorrhoea, aes(Age.Group, Rate)) + 
  geom_bar(stat = "identity") + 
  facet_wrap(~ Year + Ethnicity + Gender)

每个面板上的条形高度是不同的，特别是种族，但是我们很难在50个面板中看出是怎么回事。我们可以把每年的数据都画在相同的面板来进行简化。我们使用group来声明哪些值属于同一栏，使用fill为每个条形指定不同的填充颜色，还可使用position = "dodge"把条形彼此分开，而不是堆叠在一起：

ggplot(gonorrhoea, aes(Age.Group, Rate, group = Year, fill = Year)) + 
   geom_bar(stat = "identity", position = "dodge") + 
   facet_wrap(~ Ethnicity + Gender)

面板数量减少是好事，但我还是觉得很难从那些条形中获取很多信息。由于大部分年龄组宽度的（五年）相同，我们可以稍用点手段：画一条以年龄为x轴的线图。虽然因为年龄组更宽一些，该图在每个面板的右侧看上去有点不对劲，但它已经可以提供某些信息了：

ggplot(gonorrhoea, aes(Age.Group, Rate, group = Year, color = Year)) + 
   geom_line() + 
   facet_wrap(~ Ethnicity + Gender)

该线紧靠在一起，所以看上去似乎并没有显示出太大的时间趋势（虽然时间段正好是五年）。由于有两种切面的方式，我们可以通过使用facet_grip而不是facet_wrap稍微改善一下绘图：

ggplot(gonorrhoea, aes(Age.Group, Rate, group = Year, color = Year)) + 
   geom_line() + 
   facet_grid(Ethnicity ~ Gender)

这清楚地表明种族对淋病感染率的影响：曲线要比在“非西班牙裔黑人”和“美洲印第安人和阿拉斯加土著”的群体高得多。由于这些群体占据了大部分，所以很难看到性别的影响。通过给每一行分配一个不同的y轴，它可以中和种族的影响，并更加清楚地看到男性和女性之间的区别：

ggplot(gonorrhoea, aes(Age.Group, Rate, group = Year, color = Year)) + 
   geom_line() + 
   facet_grid(Ethnicity ~ Gender, scales = "free_y")

在这里你可以看到女性的感染率高于男性（在相同年龄组和民族中），并有一个更加持续的高峰期：从15到24岁；而男性为20到24岁。

练习15-1

1.在这种情况下，无论是错别字的数目x和错别字的平均数目lambda都是3：

dpois(3, 3)
## [1] 0.224

2.位数q为12（个月），size为1，因为我们只需要怀孕一次，每个月的probability为0.25：

pnbinom(12 ，1， 0.25 )
## [1] 0.9762

3.为了更加直截了当，我们只把probability设定为0.9：

Qbirthday(0.9)
## [1] 41

练习15-2

我们要么可以取数据集的一个子集（使用通常的索引方法或subset函数），要么把子集的详细信息传递给lm的subset参数：

model1 <- lm( 
   Rate ~ Year + Age.Group + Ethnicity + Gender, 
   gonorrhoea, 
   subset = Age.Group %in% c("15 to 19" ,"20 to 24" ,"25 to 29" ,"30 to 34") 
) 
summary(model1)
## 
## Call: 
## lm(formula = Rate ~ Year + Age.Group + Ethnicity + Gender, data = gonorrhoea, 
## subset = Age.Group %in% c("15 to 19", "20 to 24", "25 to 29", 
## "30 to 34")) 
## 
## Residuals: 
##    Min     1Q Median    3Q   Max 
## -774.0 -127.7  -10.3 106.2 857.7 
## 
## Coefficients: 
##                                      Estimate Std. Error t value Pr(>|t|) 
## (Intercept)                           9491.13   25191.00    0.38   0.7068 
## Year                                    -4.55      12.54   -0.36   0.7173 
## Age.Group20 to 24                      131.12      50.16    2.61   0.0097 
## Age.Group25 to 29                     -124.60      50.16   -2.48   0.0138 
## Age.Group30 to 34                     -259.83      50.16   -5.18  5.6e-07 
## EthnicityAsians & Pacific Islanders   -212.76      56.08   -3.79   0.0002 
## EthnicityHispanics                    -124.06      56.08   -2.21   0.0281 
## EthnicityNon-Hispanic Blacks          1014.35      56.08   18.09  < 2e-16 
## EthnicityNon-Hispanic Whites          -174.72      56.08   -3.12   0.0021 
## GenderMale                             -83.85      35.47   -2.36   0.0191 
## 
## (Intercept) 
## Year 
## Age.Group20 to 24                    ** 
## Age.Group25 to 29                    * 
## Age.Group30 to 34                    *** 
## EthnicityAsians & Pacific Islanders  *** 
## EthnicityHispanics                   * 
## EthnicityNon-Hispanic Blacks         *** 
## EthnicityNon-Hispanic Whites         ** 
## GenderMale                           * 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 251 on 190 degrees of freedom 
## Multiple R-squared: 0.798, Adjusted R-squared: 0.789 
## F-statistic: 83.7 on 9 and 190 DF, p-value: <2e-16

Year不显著，因此我们将其删除：

model2 <- update(model1, ~ . - Year) 
summary(model2)
## 
## Call: 
## lm(formula = Rate ~ Age.Group + Ethnicity + Gender, data = gonorrhoea, 
## subset = Age.Group %in% c("15 to 19", "20 to 24", "25 to 29", 
## "30 to 34")) 
## 
## Residuals: 
##    Min     1Q Median    3Q   Max 
## -774.0 -129.3   -6.7 104.3 866.8 
## 
## Coefficients: 
##                                               Estimate Std. Error t value Pr(>|t|) 
## (Intercept)                                       358.2       53.1   6.75   1.7e-10 
## Age.Group20 to 24                              131.1       50.0   2.62   0.00949 
## Age.Group25 to 29                             -124.6       50.0  -2.49   0.01363 
## Age.Group30 to 34                             -259.8       50.0  -5.19   5.3e-07 
## EthnicityAsians & Pacific Islanders             -212.8       55.9  -3.80   0.00019 
## EthnicityHispanics                             -124.1       55.9  -2.22   0.02777 
## EthnicityNon-Hispanic Blacks                 1014.3       55.9  18.13   < 2e-16 
## EthnicityNon-Hispanic Whites                 -174.7       55.9  -3.12   0.00207 
## GenderMale                                      -83.8       35.4  -2.37   0.01881 
## 
## (Intercept)                             *** 
## Age.Group20 to 24                     ** 
## Age.Group25 to 29                     * 
## Age.Group30 to 34                     *** 
## EthnicityAsians & Pacific Islanders  *** 
## EthnicityHispanics                     * 
## EthnicityNon-Hispanic Blacks         *** 
## EthnicityNon-Hispanic Whites         ** 
## GenderMale                             * 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 250 on 191 degrees of freedom 
## Multiple R-squared: 0.798, Adjusted R-squared: 0.79 
## F-statistic: 94.5 on 8 and 191 DF, p-value: <2e-16

这一次，Gender就变得显著了，因此我们也可以停止了。（儿童和老人的感染率在性别上都很接近，因为性对于它们来说并非传播的主要原因，但是对于成年女性来说，她们的感染率要高于男性。）

大多数可能的交互项都不是非常有趣。通过增加一个种族和性别的交互，我们可以得到了一个更好的拟合，虽然不是所有的交互项都是显著的（输出非常详细，所以未显示）。如果你仍然对此分析感兴趣，尝试为黑色/非黑人创建一个单独的民族指示器，然后使用它作为与性别的交互项：

## 
## Call: 
## lm(formula = Rate ~ Age.Group + Ethnicity + Gender + Ethnicity:Gender, 
##    data = gonorrhoea, subset = Age.Group %in% c("15 to 19", 
##       "20 to 24", "25 to 29", "30 to 34")) 
## 
## Residuals: 
##    Min     1Q  Median    3Q   Max 
## -806.0 -119.4    -9.4 105.3 834.8 
## 
## Coefficients: 
##                                                  Estimate Std. Error t value 
## (Intercept)                                         405.1       63.9    6.34 
## Age.Group20 to 24                                   131.1       50.1    2.62 
## Age.Group25 to 29                                  -124.6       50.1   -2.49 
## Age.Group30 to 34                                  -259.8       50.1   -5.19 
## EthnicityAsians & Pacific Islanders                -296.9       79.2   -3.75 
## EthnicityHispanics                                 -197.2       79.2   -2.49 
## EthnicityNon-Hispanic Blacks                        999.5       79.2   12.62 
## EthnicityNon-Hispanic Whites                       -237.0       79.2   -2.99 
## GenderMale                                         -177.6       79.2   -2.24 
## EthnicityAsians & Pacific Islanders:GenderMale      168.2      112.0    1.50 
## EthnicityHispanics:GenderMale                       146.2      112.0    1.30 
## EthnicityNon-Hispanic Blacks:GenderMale              29.7      112.0    0.27 
## EthnicityNon-Hispanic Whites:GenderMale             124.6      112.0    1.11 
##                                                  Pr(>|t|) 
## (Intercept)                                      1.7e-09 *** 
## Age.Group20 to 24                                0.00960 ** 
## Age.Group25 to 29                                0.01377 * 
## Age.Group30 to 34                                5.6e-07 *** 
## EthnicityAsians & Pacific Islanders              0.00024 *** 
## EthnicityHispanics                               0.01370 * 
## EthnicityNon-Hispanic Blacks                     < 2e-16 *** 
## EthnicityNon-Hispanic Whites                     0.00315 ** 
## GenderMale                                       0.02615 * 
## EthnicityAsians & Pacific Islanders:GenderMale   0.13487 
## EthnicityHispanics:GenderMale                    0.19361 
## EthnicityNon-Hispanic Blacks:GenderMale          0.79092 
## EthnicityNon-Hispanic Whites:GenderMale          0.26754 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 251 on 187 degrees of freedom 
## Multiple R-squared: 0.802, Adjusted R-squared: 0.789 
## F-statistic: 63.2 on 12 and 187 DF, p-value: <2e-16

练习15-3

首先解决安装问题。像通常一样获取软件包并安装：

install.packages("betareg")

我们需要重新调整响应变量为0和1之间（而不是0和100），因为这是β分布的范围：

ovm <- within(obama_vs_mccain, Obama <- Obama / 100)

现在我们准备运行模型。它与运行一元线性回归相同，但我们称之为betareg而不是lm。我随机地关注种族群体Black和宗教Protestant：

library(betareg) 
beta_model1 <- betareg( 
   Obama ~ Turnout + Population + Unemployment + Urbanization + Black + Protestant, 
   ovm, 
   subset = State != "District of Columbia" 
) 
summary(beta_model1)
## Call:
## betareg(formula = Obama ~ Turnout + Population + Unemployment + Urbanization + 
##     Black + Protestant, data = ovm, subset = State != "District of Columbia")
## 
## Standardized weighted residuals 2:
##     Min      1Q  Median      3Q     Max 
## -2.8338 -0.4574 -0.0623  0.7706  2.0443 
## 
## Coefficients (mean model with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.518e-01  4.991e-01  -1.106  0.26894    
## Turnout       1.693e-02  5.910e-03   2.864  0.00418 ** 
## Population    1.656e-09  5.341e-09   0.310  0.75659    
## Unemployment  5.736e-02  2.315e-02   2.478  0.01322 *  
## Urbanization -1.818e-05  1.538e-04  -0.118  0.90594    
## Black         8.177e-03  4.275e-03   1.913  0.05575 .  
## Protestant   -1.781e-02  3.169e-03  -5.620 1.91e-08 ***
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)   109.46      23.23   4.711 2.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 72.11 on 8 Df
## Pseudo R-squared: 0.7234
## Number of iterations: 36 (BFGS) + 3 (Fisher scoring)

Urbanization是不显著的，所以我们将其删除，使用与lm相同的技术：

beta_model2 <- update(beta_model1, ~ . - Urbanization) 
summary(beta_model2)
## Call:
## betareg(formula = Obama ~ Turnout + Population + Unemployment + Black + Protestant, data = ovm, 
##     subset = State != "District of Columbia")
## 
## Standardized weighted residuals 2:
##     Min      1Q  Median      3Q     Max 
## -2.8307 -0.4568 -0.0528  0.7736  2.0072 
## 
## Coefficients (mean model with logit link):
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.688e-01  4.766e-01  -1.193  0.23268    
## Turnout       1.702e-02  5.861e-03   2.904  0.00369 ** 
## Population    1.725e-09  5.309e-09   0.325  0.74521    
## Unemployment  5.697e-02  2.293e-02   2.484  0.01299 *  
## Black         7.931e-03  3.727e-03   2.128  0.03335 *  
## Protestant   -1.758e-02  2.479e-03  -7.091 1.33e-12 ***
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)   109.43      23.23   4.711 2.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood:  72.1 on 7 Df
## Pseudo R-squared: 0.7232
## Number of iterations: 31 (BFGS) + 3 (Fisher scoring)

同样，也要把Population去掉：

beta_model3 <- update(beta_model2, ~ . - Population) 
summary(beta_model3)
## Call:
## betareg(formula = Obama ~ Turnout + Unemployment + Black + Protestant, data = ovm, subset = State != 
##     "District of Columbia")
## 
## Standardized weighted residuals 2:
##     Min      1Q  Median      3Q     Max 
## -2.8277 -0.4580  0.0432  0.7420  1.9352 
## 
## Coefficients (mean model with logit link):
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.555770   0.475674  -1.168  0.24265    
## Turnout       0.016859   0.005850   2.882  0.00395 ** 
## Unemployment  0.059644   0.021445   2.781  0.00542 ** 
## Black         0.008202   0.003637   2.255  0.02412 *  
## Protestant   -0.017789   0.002399  -7.415 1.21e-13 ***
## 
## Phi coefficients (precision model with identity link):
##       Estimate Std. Error z value Pr(>|z|)    
## (phi)   109.16      23.17   4.711 2.46e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Type of estimator: ML (maximum likelihood)
## Log-likelihood: 72.05 on 6 Df
## Pseudo R-squared: 0.7226
## Number of iterations: 27 (BFGS) + 2 (Fisher scoring)

绘图可以使用与lm完全相同的代码来完成：

plot_numbers <- 1:6 
layout(matrix(plot_numbers, ncol = 2, byrow = TRUE)) 
plot(beta_model3, plot_numbers)

练习16-1

对于检查和处理用户输入，并没有须特别遵循的规则。在一般情况下，如果输入的形式是错误的，最好是设法将其转换成正确的，并在此过程中警告用户。因此，对于非数字输入，我们可以尝试将它们转换为数值，并抛出一个警告。

对于非正值，我们可以尝试替换一个范围内的值，或假装该值是缺失的，但这样会使答案出错。在这种情况下，最好是抛出一个错误：

harmonic_mean <- function(x, na.rm = FALSE) 
{ 
   if(!is.numeric(x)) 
   { 
      warning("Coercing 'x' to be numeric.") 
      x <- as.numeric(x) 
   } 
   if(any(!is.na(x) & x <= 0))
   { 
      stop("'x' contains non-positive values") 
   } 
   1 / mean(1 / x, na.rm = na.rm) 
}

练习16-2

这里是RUnit的版本。为了测试缺失值，我们需要考虑的情况有： na.rm是TRUE和na.rm为FALSE。为了测试须抛出警告的情况，我们略施小计，将warn选项改为2，意即将警告看作错误。为了测试非正值，使用边界数值零。

test.harmonic_mean.1_2_4.returns_12_over_7 <- function() 
{ 
   expected <- 12 / 7 
   actual <- harmonic_mean(c(1, 2, 4)) 
   checkEqualsNumeric(expected, actual) 
} 
test.harmonic_mean.no_inputs.throws_error <- function() 
{ 
   checkException(harmonic_mean()) 
} 
test.harmonic_mean.some_missing.returns_na <- function() 
{ 
   expected <- NA_real_ 
   actual <- harmonic_mean(c(1, 2, 4, NA)) 
   checkEqualsNumeric(expected, actual) 
} 
test.harmonic_mean.some_missing_with_nas_removed.returns_12_over_7 <- function() 
{ 
   expected <- 12 / 7 
   actual <- harmonic_mean(c(1, 2, 4, NA), na.rm = TRUE) 
   checkEqualsNumeric(expected, actual) 
} 
test.harmonic_mean.non_numeric_input.throws_warning <- function() 
{ 
   old_ops <- options(warn = 2) 
   on.exit(options(old_ops)) 
   checkException(harmonic_mean("1")) 
} 
test.harmonic_mean.zero_inputs.throws_error <- function() 
{ 
   checkException(harmonic_mean(0)) 
}

翻译成testthat的版本很简单。使用expect_warning函数，可使警告更容易测试：

expect_equal(12 /7, harmonic_mean(c(1, 2, 4))) 
expect_error(harmonic_mean()) 
expect_equal(NA_real_, harmonic_mean(c(1, 2, 4, NA))) 
expect_equal(12 /7, harmonic_mean(c(1, 2, 4, NA), na.rm = TRUE)) 
expect_warning(harmonic_mean("1")) 
expect_error(harmonic_mean(0))

练习16-3

下面是更新的harmonic_mean函数：

harmonic_mean <- function(x, na.rm = FALSE) 
{
   if(!is.numeric(x)) 
   { 
      warning("Coercing 'x' to be numeric.")
      x <- as.numeric(x) 
   } 
   if(any(!is.na(x) & x <= 0))
   { 
      stop("'x' contains non-positive values") 
   } 
   result <- 1 / mean(1 / x, na.rm = na.rm) 
   class(result) <- "harmonic" 
   result 
}

为了生成一个S3的方法，名称的格式必须符合function.class的形式，在本例中即为print.harmonic。其他内容可以与其他print函数生成，但在这里，我们使用较低级别的cat函数：

print.harmonic <- function(x, ...) 
{ 
   cat("The harmonic mean is", x, "\n", ...) 
}

练习17-1

创建函数和数据框都很简单：

sum_of_squares <- function(n) 
{ 
  n * (n + 1) * (2 * n + 1) / 6 
} 
x <- 1:10 
squares_data <- data.frame(x = x, y = sum_of_squares(x))

在你调用package.skeleton时，需要考虑在哪个目录下创建软件包：

package.skeleton("squares", c("sum_of_squares", "squares_data"))

练习17-2

此函数的文档应该置于R/sum_of_squares.R，或类似的：

#' Sum of Squares 
#' 
#' Calculates the sum of squares of the first \code{n} natural numbers. 
#' 
#' @param n A positive integer. 
#' @return An integer equal to the sum of the squares of the first \code{n} 
#' natural numbers. 
#' @author Richie Cotton. 
#' @seealso \code{\link[base]{sum}} 
#' @examples 
#' sum_of_squares(1:20) 
#' @keywords misc 
#' @export

该软件包的文档位于R/squares-package.R中：

#' squares: Sums of squares. 
#' 
#' A test package that contains data on the sum of squares of natural 
#' numbers, and a function to calculate more values. 
#' 
#' @author Richie Cotton 
#' @docType package 
#' @name squares 
#' @aliases squares squares-package 
#' @keywords package 
NULL

数据文档可以被放入同一个文件，或是在R/squares-data.R中：

#' Sum of squares dataset 
#' 
#' The sum of squares of natural numbers. 
#' \itemize{ 
#' \item{x}{Natural numbers.}
#' \item{y}{The sum of squares from 1 to \code{x}.} 
#' } 
#' 
#' @docType data
#' @keywords datasets 
#' @name squares_data 
#' @usage data(squares_data) 
#' @format A data frame with 10 rows and 2 variables. 
NULL

练习17-3

代码很容易，困难的部分在于修复任何问题：

check("squares") 
build("squares")

附录Ｄ 练习答案

附录Ｄ 练习答案

附录Ｄ　练习答案

附录Ｄ　练习答案