R for Data Science の例題を解く – Chapter3 データ可視化

自分の勉強がてらに神Hadley R for Data Science の例題たちとその解答を書き残します。
まずはChapter 3から。


3. Data visualization

3.2. First Steps

3.2.4. Exercises

1. Run ggplot(data = mpg). What do you see?


2. How many rows are in mpg? How many columns?

ncol(mpg); nrow(mpg)
#[1] 11
#[1] 234

3. What does the drv variable describe? Read the help for ?mpg to find out.

値はf, r, 4の三種類を取り、それぞれフロントドライブ、リアドライブ、4WDを意味している

4. Make a scatterplot of hwy vs cyl.

ggplot(mpg, aes(hwy, cyl)) + geom_point()

5. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(mpg, aes(class, drv)) + geom_point()


3.3. Aesthetic mappings

3.3.1. Exercises

1. What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))


ggplot(data =mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

2. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

カテゴリ変数は、manufacturer, model, year, cyl, trans, drv, fl, class
連続変数は displ, year, cty, hwy

3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?


4. What happens if you map the same variable to multiple aesthetics?


5. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

geom_pointのhelpではshape = 21とする例がある。デフォルトのshapeでは塗りつぶされていて枠線の太さの区別ができない。
shape = 0,1,2などでもいける。

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

この例ではdispl < 5という論理ベクトルで二つに色分けされる。

3.5. Facets

3.5.1. Exercise

1. What happens if you facet on a continuous variable?


2. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

aes(x = drv, y = cyl)のプロットで各drv, cylにデータが存在しているかどうかを確認することができる。

3. What plots does the following code make? What does . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

facet_grid(drv ~.)のプロットはdrvごとに縦にグラフが並ぶ。
facet_grid(. ~ cyl)のプロットはcylごとに横にグラフが並ぶ。
ピリオドは指定なしの意味。ただし下のfacet_grid(. ~ cyl)はピリオド無しでもプロットできる。

4. Take the first faceted plot in this section:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

color aestheticでは重なったデータが重なった場合にはパターンを見出しづらい。データが密集している場合はfacetでデータを分割して見る効果は大きい。

5. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol argument?

nrowを指定するとfacet_wrapで配置されるグラフの行数をしていできる。たとえば nrow = 1 だと横一列に並ぶ。
dirオプションはチャートを並べる順番を指定する。dir = "v"とすると、グラフを縦向きに配置していく。
facet_gridは指定する変数の種類数がnrow, ncolになるのでオプションで制御することはできない。

6. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?


3.6. Geometric objects

3.6.1. Exercise

1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

line chartはgeom_line, boxplotはgeom_boxplot, histogramはgeom_histogram, area chartはgeom_area

2. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

3. What does show.legend = FALSE do? What happens if you remove it?
Why do you think I used it earlier in the chapter?

show.legend = FALSE で右側の凡例がなくなる。
show.legend = FALSEとしたのはデフォルトで凡例が無いグラフと大きさを一致させることができるため。

4. What does the se argument to geom_smooth() do?

Standard Errorの略で、信頼区間を表示するかどうかを指定する。

5. Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))


6. Recreate the R code necessary to generate the following graphs.

gp <- ggplot(data = mpg, mapping = aes(x = displ, y = hwy))
gp + geom_point() + geom_smooth(se = FALSE)
gp + geom_point() + geom_smooth(aes(group = drv), se = FALSE)
gp + geom_point(aes(color = drv)) + geom_smooth(aes(color = drv), se = FALSE))
gp + geom_point(aes(color = drv)) + geom_smooth(se = FALSE)
gp + geom_point(aes(color = drv)) + geom_smooth(aes(color = drv, linetype = drv), se = FALSE)
gp + geom_point(aes(fill = drv), shape = 21, color = "white", stroke = 1)

3.7 Statistical transformations

3.7.1. Exercises

1. What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?


ggplot(diamonds) + geom_pointrange(aes(cut, depth), stat = "summary", fun.y = median, fun.ymin = min, fun.ymax = max)

2. What does geom_col() do? How is it different to geom_bar()?


3. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

color, size, fillなどは全てのプロットで共通

geom stat 共通で持つ設定
geom_bar stat_count width, position
geom_point stat_identity position
geom_histogram stat_bin position, binwidth, bins
geom_density stat_density position
geom_boxplot stat_boxplot position
geom_count stat_sum position


4. What variables does stat_smooth() compute? What parameters control its behaviour?


5. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

group = 1と設定することで全体の1つのグループとして比率の計算をするようにする。

3.8. Position Adjustment

3.8.1. Exercises

1. What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 

cty, hwyともに取りうる値の種類数が少ないので、散布図の点が重なってしまってどれだけ散布しているのかが分からない。
geom_pointposition = "jitter"オプションを追加すれば見えるようになる。
position = "jitter"の代わりに、stat = "sum"を指定すれば点の大きさでデータ数が分かるようになる。

2. What parameters to geom_jitter() control the amount of jittering?


3. Compare and contrast geom_jitter() with geom_count().


4. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

ggplot(diamonds, aes(clarity, price)) + geom_boxplot(aes(color = color))

3.9. Coordinate systems

3.9.1 Exercises

1. Turn a stacked bar chart into a pie chart using coord_polar().

ggplot(diamonds, aes(cut)) + geom_bar(aes(fill = color)) + coord_polar()

2. What does labs() do? Read the documentation.

labs(x = "x labs", y = "y labs")でx軸とy軸のラベルを指定できる。

3. What’s the difference between coord_quickmap() and coord_map()?


4. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +



R for Data Science Chapter 3: Data visualization