R for Data Scienceの例題を解く- Chapter 21 Iteration

神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 21 Iterationです。

過去の記事

iterationとか言いながら3分の2くらいは関数型プログラミングについての技術。
専用環境は関数型プログラミングを支援するためのpurrrですが、tidyverseで一括ロードされる。

library(tidyverse)

21 Iteration

21.2 For loops

20.2.1 Exercises

1. Write for loops to:

  1. Compute the mean of every column in mtcars.
  2. Determine the type of each column in nycflights13::flights.
  3. Compute the number of unique values in each column of iris.
  4. Generate 10 random normals for each of \mu=-10,0,10 and 100.

Think about the output, sequence, and body before you start writing the loop.

#1
output <- vector("double", length(mtcars))
for (i in seq_along(mtcars)){
  output[[i]] <- mean(mtcars[[i]])
}

#2
library(nycflights13)
output <- vector("character", length(flights))
for (i in seq_along(flights)){
  output[[i]] <- typeof(flights[[i]])
}

#3
output <- vector("integer", length(iris))
for (i in seq_along(iris)){
  output[[i]] <- n_distinct(iris[[i]])
}

#4
output <- vector("list", 4)
mu <- c(-10, 0, 10, 100)
for (i in seq_along(mu)){
  output[[i]] <- rnorm(10, mean = mu[i])
}

2. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:

out <- ""
for (x in letters) {
  out <- stringr::str_c(out, x)
}

x <- sample(100)
sd <- 0
for (i in seq_along(x)) {
  sd <- sd + (x[i] - mean(x)) ^ 2
}
sd <- sqrt(sd / (length(x) - 1))

x <- runif(100)
out <- vector("numeric", length(x))
out[1] <- x[1]
for (i in 2:length(x)) {
  out[i] <- out[i - 1] + x[i]
}
#1つめ
out <- stringr::str_c(letters, collapse = "")

#2つめ
sd(x)

#3つめ
cumsum(x)

3. Combine your function writing and for loop skills:

  1. Write a for loop that prints() the lyrics to the children’s song “Alice the camel”
  2. Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.

歌詞

#1
phrase1 <- function(times){
  print(paste("Alice the camel has", times, "humps."))
}
phrase2 <- function(zero){
  if(!zero){
    print("So go, Alice, go.")
  } else {
    print("Now Alice is a horse.")
  }
}
for (i in 5:0){
  for (j in 1:3){
    phrase1(i)
  }
  phrase2(i == 0)
}

#2
rhyme <- function(cnt){
  print(cnt)
  print(paste("There were", cnt, "in the bed."))
  if(cnt > 1){
    print("and the little one said,")
    print("Roll over, roll over.")
    print("So they all rolled over and one fell out.")
  } else {
    print("and the little one said,")
    print("I’m lonely...")
  }
}
rhyme_gen <- function(n){
  if (n > 0){
    for (i in n:1){
      rhyme(i)
    }
  } else {
    print("n must be a positive integer.")
  }
}

#3
rhyme <- function(cnt){
  print(paste(cnt, "bottles of beer on the wall,", cnt, "bottles of beer."))
  if (cnt > 1){
    print(paste("Take one down and pass it around,", cnt - 1, "bottles of beer on the wall."))
  } else {                
    print("Take one down and pass it around, no more bottles of beer on the wall.")
  }          
}   
rhyme_gen <- function(n){
  if (n > 0){
    for (i in n:1){
      rhyme(i)
    }
    print("No more bottles of beer on the wall, no more bottles of beer.")
    print(paste("Go to the store and buy some more,", n, "bottles of beer on the wall."))
  } else {
    print("n must be a positive integer.")
  }
}

4. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:

output <- vector("integer", 0)
for (i in seq_along(x)){
  output <- c(output, lengths(x[[i]]))
}
output

How does this affect performance? Design and execute an experiment.

x <- rnorm(10000)
output <- vector("integer", 0)
system.time(
for (i in seq_along(x)){
  output <- c(output, lengths(x[[i]]))
}
)
#   user  system elapsed
#  0.188   0.008   0.195

output <- vector("integer", 20000)
system.time(
for (i in seq_along(x)){
  output[[i]] <- lengths(x[[i]])
}
)
#   user  system elapsed
#  0.024   0.000   0.023

後者の方が8.5倍程度早い。

21.3 For loop variations

21.3.5 Exercises

1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector, files <- dir("data/", pattern = "\\.csv$", full.names = TRUE), and now want to read each one with read_csv(). Write the for loop that will load them into a single data frame.

ループを回して取得するデータがそれぞれデータフレームなのでlistで受けてからくっつける。

output <- vector("list", length(files))
for (i in seq_along(files)){
  output[[i]] <- read_csv(files[i])
}
output <- bind_rows(output)

2. What happens if you use for (nm in names(x)) and x has no names? What if only some of the elements are named? What if the names are not unique?

x <- c(a = "apple", b = "banana", "chocolate", "diamonds", a = "aho")
for(nm in names(x))print(nm)
#[1] "a"
#[1] "b"
#[1] ""
#[1] ""
#[1] "a"

ベクトルの要素に名前が無いものが混ざっていた場合、名前が無い要素の数だけ名前が""としてループが回る。
重複していても無関係に要素の数だけ回る。

3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example, show_mean(iris) would print:

show_mean(iris)
#> Sepal.Length: 5.84
#> Sepal.Width:  3.06
#> Petal.Length: 3.76
#> Petal.Width:  1.20
show_mean <- function(df){
  for (nm in names(df)){
    if (is.numeric(df[[nm]])) cat(paste0(nm, ": ", format(mean(df[[nm]]), digits = 3), "\n"))
  }
}

4. What does this code do? How does it work?

trans <- list( 
  disp = function(x) x * 0.0163871,
  am = function(x) {
    factor(x, labels = c("auto", "manual"))
  }
)
for (var in names(trans)) {
  mtcars[[var]] <- trans[[var]](mtcars[[var]])
}

データフレームmtcarsについて、dispは一律で0.0163871倍に(おそらく単位の変換)、am0,1だったものをそれぞれ"auto", "manual"に変換する。
mtcarsのそれ以外の要素は触らない。

21.4 For loops vs. Functionals

21.4.1 Exercises

1. Read the documentation for apply(). In the 2d case, what two for loops does it generalise?

apply(X, c(1, 2), FUN)としたら次のようになる。

for (i in nrow(X)){
  for (j in ncol(X)){
    X[i,j] <- FUN(X[i,j])
  }
}

2. Adapt col_summary() so that it only applies to numeric columns. You might want to start with an is_numeric() function that returns a logical vector that has a TRUE corresponding to each numeric column.

col_summary <- function(df, fun) {
  l <- 0
  for (i in seq_along(df)) {
    if (is.numeric(df[[i]])) l <- l + 1
  }
  out <- vector("double", l)
  for (i in seq_along(df)) {
    if (is.numeric(df[[i]])) out[i] <- fun(df[[i]])
  }
  out
}

21.5 The map functions

21.5.3 Exercises

1. Write code that uses one of the map functions to:

  1. Compute the mean of every column in mtcars.
  2. Determine the type of each column in nycflights13::flights.
  3. Compute the number of unique values in each column of iris.
  4. Generate 10 random normals for each of [latexl]\mu=-10,0,10[/latex] and 100.
#1
map_dbl(mtcars, mean)

#2
map_chr(nycflights13::flights, typeof)

#3
map_int(iris, n_distinct)

#4
map(c(-10,0,10,100), rnorm, n = 10)

2. How can you create a single vector that for each column in a data.frame indicates whether or not it's a factor?

map_lgl(df, is.factor)

3. What happens when you use the map functions on vectors that aren't lists? What does map(1:5, runif) do? Why?

map(1:5, runif)
#[[1]]
#[1] 0.04953596

#[[2]]
#[1] 0.5390083 0.6190010

#[[3]]
#[1] 0.8798693 0.4998527 0.5550514

#[[4]]
#[1] 0.1745245 0.7734466 0.7938441 0.9021727

#[[5]]
#[1] 0.8783163 0.4091544 0.5472550 0.2817710 0.5079556

たとえ入力がリストでなくても、mapの結果はリストになる。(定義どおり)
この例ではrunifの第一引数(サンプル数)に1から5までの値がそれぞれ入ったときの結果がリストとして返っている。
すまわち長さ5のリストで、その要素はそれぞれ1~5の長さの乱数列になる。

4. What does map(-2:2, rnorm, n = 5) do? Why? What does map_dbl(-2:2, rnorm, n = 5) do? Why?

map(-2:2, rnorm, n = 5)
#[[1]]
#[1] -2.840793644 -0.005469319 -1.065458143 -0.974701590 -1.435614627

#[[2]]
#[1] -1.6694444  0.6454814 -1.5945935 -0.8147798 -1.7862531

#[[3]]
#[1]  1.0497548  0.0812402 -1.8419471  0.1736372 -0.1636357

#[[4]]
#[1]  1.05609725 -0.35278174 -0.72301882  1.83283869 -0.08094902

#[[5]]
#[1] 2.679921 2.271799 1.321095 2.277720 3.303472

rnorm第一引数はn, 第二引数は平均値で、n=5に指定されているので、mapの第一引数である-2:2はそれぞれrnormの平均値に代入される。

map_dbl(-2:2, rnorm, n = 5)
#Error: Result 1 is not a length 1 atomic vector

map_dblは、それぞれの戻り値が長さ1の数値ベクトルでなければならない。

5. Rewrite map(x, function(df) lm(mpg ~ wt, data = f)) to eliminate the anonymous function.

map(x, ~lm(mpg ~ wt, data = .))

21.9 Other patterns of for loops

21.9.3 Exercises

1. Implement your own version of every() using a for loop. Compare it with purrr::every(). What does purrr's version do that your version doesn't?

my_every <- function(df, f, ...){
  check <- vector("logical", length(df))
  for (i in seq_along(df)) check[i] <- f(df[[i]], ...)
  all(check)
}

2. Create an enhanced col_summary() that applies a summary function to every numeric column in a data frame.

col_summary <- function(df, fun){
  df %>%
    keep(is.numeric) %>%
    map_dbl(fun)
}

3. A possible base R equivalent of col_summary() is:

col_sum3 <- function(df, f) {
  is_num <- sapply(df, is.numeric)
  df_num <- df[, is_num]

  sapply(df_num, f)
}

But it has a number of bugs as illustrated with the following inputs:

df <- tibble(
  x = 1:3,
  y = 3:1,
  z = c("a", "b", "c")
)
# OK
col_sum3(df ,mean)
# Has problems: don't always return numeric vector
col_sum3(df[1:2], mean)
col_sum3(df[1], mean)
col_sum3(df[0], mean)

What causes the bugs?

最後を除いて普通に数値ベクトルが返ってくるが。

カテゴリー: R4DS

コメントを残す