神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 21 Iterationです。
過去の記事
- Chapter 3 データ可視化
- Chapter 5 データ変換
- Chapter 7 探索的データ分析
- Chapter 10 tibble
- Chapter 11 Data import
- Chapter 12 Tidy data
- Chapter 13 関係データ
- Chapter 14 文字列
- Chapter 15 ファクタ
- Chapter 16 日付と時刻
- Chapter 19 関数
- Chapter 20 ベクトル
iterationとか言いながら3分の2くらいは関数型プログラミングについての技術。
専用環境は関数型プログラミングを支援するためのpurrr
ですが、tidyverse
で一括ロードされる。
library(tidyverse)
21 Iteration
21.2 For loops
20.2.1 Exercises
1. Write for loops to:
- Compute the mean of every column in
mtcars
.- Determine the type of each column in
nycflights13::flights
.- Compute the number of unique values in each column of
iris
.- Generate 10 random normals for each of \mu=-10,0,10 and 100.
Think about the output, sequence, and body before you start writing the loop.
#1 output <- vector("double", length(mtcars)) for (i in seq_along(mtcars)){ output[[i]] <- mean(mtcars[[i]]) } #2 library(nycflights13) output <- vector("character", length(flights)) for (i in seq_along(flights)){ output[[i]] <- typeof(flights[[i]]) } #3 output <- vector("integer", length(iris)) for (i in seq_along(iris)){ output[[i]] <- n_distinct(iris[[i]]) } #4 output <- vector("list", 4) mu <- c(-10, 0, 10, 100) for (i in seq_along(mu)){ output[[i]] <- rnorm(10, mean = mu[i]) }
2. Eliminate the for loop in each of the following examples by taking advantage of an existing function that works with vectors:
out <- "" for (x in letters) { out <- stringr::str_c(out, x) } x <- sample(100) sd <- 0 for (i in seq_along(x)) { sd <- sd + (x[i] - mean(x)) ^ 2 } sd <- sqrt(sd / (length(x) - 1)) x <- runif(100) out <- vector("numeric", length(x)) out[1] <- x[1] for (i in 2:length(x)) { out[i] <- out[i - 1] + x[i] }
#1つめ out <- stringr::str_c(letters, collapse = "") #2つめ sd(x) #3つめ cumsum(x)
3. Combine your function writing and for loop skills:
- Write a for loop that
prints()
the lyrics to the children’s song “Alice the camel”- Convert the nursery rhyme “ten in the bed” to a function. Generalise it to any number of people in any sleeping structure.
#1 phrase1 <- function(times){ print(paste("Alice the camel has", times, "humps.")) } phrase2 <- function(zero){ if(!zero){ print("So go, Alice, go.") } else { print("Now Alice is a horse.") } } for (i in 5:0){ for (j in 1:3){ phrase1(i) } phrase2(i == 0) } #2 rhyme <- function(cnt){ print(cnt) print(paste("There were", cnt, "in the bed.")) if(cnt > 1){ print("and the little one said,") print("Roll over, roll over.") print("So they all rolled over and one fell out.") } else { print("and the little one said,") print("I’m lonely...") } } rhyme_gen <- function(n){ if (n > 0){ for (i in n:1){ rhyme(i) } } else { print("n must be a positive integer.") } } #3 rhyme <- function(cnt){ print(paste(cnt, "bottles of beer on the wall,", cnt, "bottles of beer.")) if (cnt > 1){ print(paste("Take one down and pass it around,", cnt - 1, "bottles of beer on the wall.")) } else { print("Take one down and pass it around, no more bottles of beer on the wall.") } } rhyme_gen <- function(n){ if (n > 0){ for (i in n:1){ rhyme(i) } print("No more bottles of beer on the wall, no more bottles of beer.") print(paste("Go to the store and buy some more,", n, "bottles of beer on the wall.")) } else { print("n must be a positive integer.") } }
4. It’s common to see for loops that don’t preallocate the output and instead increase the length of a vector at each step:
output <- vector("integer", 0) for (i in seq_along(x)){ output <- c(output, lengths(x[[i]])) } outputHow does this affect performance? Design and execute an experiment.
x <- rnorm(10000) output <- vector("integer", 0) system.time( for (i in seq_along(x)){ output <- c(output, lengths(x[[i]])) } ) # user system elapsed # 0.188 0.008 0.195 output <- vector("integer", 20000) system.time( for (i in seq_along(x)){ output[[i]] <- lengths(x[[i]]) } ) # user system elapsed # 0.024 0.000 0.023
後者の方が8.5倍程度早い。
21.3 For loop variations
21.3.5 Exercises
1. Imagine you have a directory full of CSV files that you want to read in. You have their paths in a vector,
files <- dir("data/", pattern = "\\.csv$", full.names = TRUE)
, and now want to read each one withread_csv()
. Write the for loop that will load them into a single data frame.
ループを回して取得するデータがそれぞれデータフレームなのでlistで受けてからくっつける。
output <- vector("list", length(files)) for (i in seq_along(files)){ output[[i]] <- read_csv(files[i]) } output <- bind_rows(output)
2. What happens if you use
for (nm in names(x))
andx
has no names? What if only some of the elements are named? What if the names are not unique?
x <- c(a = "apple", b = "banana", "chocolate", "diamonds", a = "aho") for(nm in names(x))print(nm) #[1] "a" #[1] "b" #[1] "" #[1] "" #[1] "a"
ベクトルの要素に名前が無いものが混ざっていた場合、名前が無い要素の数だけ名前が""
としてループが回る。
重複していても無関係に要素の数だけ回る。
3. Write a function that prints the mean of each numeric column in a data frame, along with its name. For example,
show_mean(iris)
would print:show_mean(iris) #> Sepal.Length: 5.84 #> Sepal.Width: 3.06 #> Petal.Length: 3.76 #> Petal.Width: 1.20
show_mean <- function(df){ for (nm in names(df)){ if (is.numeric(df[[nm]])) cat(paste0(nm, ": ", format(mean(df[[nm]]), digits = 3), "\n")) } }
4. What does this code do? How does it work?
trans <- list( disp = function(x) x * 0.0163871, am = function(x) { factor(x, labels = c("auto", "manual")) } ) for (var in names(trans)) { mtcars[[var]] <- trans[[var]](mtcars[[var]]) }
データフレームmtcars
について、disp
は一律で0.0163871倍に(おそらく単位の変換)、am
は0,1
だったものをそれぞれ"auto", "manual"
に変換する。
mtcars
のそれ以外の要素は触らない。
21.4 For loops vs. Functionals
21.4.1 Exercises
1. Read the documentation for
apply()
. In the 2d case, what two for loops does it generalise?
apply(X, c(1, 2), FUN)
としたら次のようになる。
for (i in nrow(X)){ for (j in ncol(X)){ X[i,j] <- FUN(X[i,j]) } }
2. Adapt
col_summary()
so that it only applies to numeric columns. You might want to start with anis_numeric()
function that returns a logical vector that has aTRUE
corresponding to each numeric column.
col_summary <- function(df, fun) { l <- 0 for (i in seq_along(df)) { if (is.numeric(df[[i]])) l <- l + 1 } out <- vector("double", l) for (i in seq_along(df)) { if (is.numeric(df[[i]])) out[i] <- fun(df[[i]]) } out }
21.5 The map functions
21.5.3 Exercises
1. Write code that uses one of the map functions to:
- Compute the mean of every column in
mtcars
.- Determine the type of each column in
nycflights13::flights
.- Compute the number of unique values in each column of
iris
.- Generate 10 random normals for each of [latexl]\mu=-10,0,10[/latex] and 100.
#1 map_dbl(mtcars, mean) #2 map_chr(nycflights13::flights, typeof) #3 map_int(iris, n_distinct) #4 map(c(-10,0,10,100), rnorm, n = 10)
2. How can you create a single vector that for each column in a data.frame indicates whether or not it's a factor?
map_lgl(df, is.factor)
3. What happens when you use the map functions on vectors that aren't lists? What does
map(1:5, runif)
do? Why?
map(1:5, runif) #[[1]] #[1] 0.04953596 #[[2]] #[1] 0.5390083 0.6190010 #[[3]] #[1] 0.8798693 0.4998527 0.5550514 #[[4]] #[1] 0.1745245 0.7734466 0.7938441 0.9021727 #[[5]] #[1] 0.8783163 0.4091544 0.5472550 0.2817710 0.5079556
たとえ入力がリストでなくても、map
の結果はリストになる。(定義どおり)
この例ではrunif
の第一引数(サンプル数)に1から5までの値がそれぞれ入ったときの結果がリストとして返っている。
すまわち長さ5のリストで、その要素はそれぞれ1~5の長さの乱数列になる。
4. What does
map(-2:2, rnorm, n = 5)
do? Why? What doesmap_dbl(-2:2, rnorm, n = 5)
do? Why?
map(-2:2, rnorm, n = 5) #[[1]] #[1] -2.840793644 -0.005469319 -1.065458143 -0.974701590 -1.435614627 #[[2]] #[1] -1.6694444 0.6454814 -1.5945935 -0.8147798 -1.7862531 #[[3]] #[1] 1.0497548 0.0812402 -1.8419471 0.1736372 -0.1636357 #[[4]] #[1] 1.05609725 -0.35278174 -0.72301882 1.83283869 -0.08094902 #[[5]] #[1] 2.679921 2.271799 1.321095 2.277720 3.303472
rnorm
第一引数はn
, 第二引数は平均値で、n=5
に指定されているので、map
の第一引数である-2:2
はそれぞれrnorm
の平均値に代入される。
map_dbl(-2:2, rnorm, n = 5) #Error: Result 1 is not a length 1 atomic vector
map_dbl
は、それぞれの戻り値が長さ1の数値ベクトルでなければならない。
5. Rewrite
map(x, function(df) lm(mpg ~ wt, data = f))
to eliminate the anonymous function.
map(x, ~lm(mpg ~ wt, data = .))
21.9 Other patterns of for loops
21.9.3 Exercises
1. Implement your own version of
every()
using a for loop. Compare it withpurrr::every()
. What does purrr's version do that your version doesn't?
my_every <- function(df, f, ...){ check <- vector("logical", length(df)) for (i in seq_along(df)) check[i] <- f(df[[i]], ...) all(check) }
2. Create an enhanced
col_summary()
that applies a summary function to every numeric column in a data frame.
col_summary <- function(df, fun){ df %>% keep(is.numeric) %>% map_dbl(fun) }
3. A possible base R equivalent of
col_summary()
is:col_sum3 <- function(df, f) { is_num <- sapply(df, is.numeric) df_num <- df[, is_num] sapply(df_num, f) }But it has a number of bugs as illustrated with the following inputs:
df <- tibble( x = 1:3, y = 3:1, z = c("a", "b", "c") ) # OK col_sum3(df ,mean) # Has problems: don't always return numeric vector col_sum3(df[1:2], mean) col_sum3(df[1], mean) col_sum3(df[0], mean)What causes the bugs?
最後を除いて普通に数値ベクトルが返ってくるが。