神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 13 Relational dataです。
過去の記事
- Chapter 3 データ可視化
- Chapter 5 データ変換
- Chapter 7 探索的データ分析
- Chapter 10 tibble
- Chapter 11 Data import
- Chapter 12 Tidy data
この章ではdplyr
ライブラリを使います。
またデモ用データとしてChapter 5 データ変換でも使ったnycflights13
を使います。
このため
library(tidyverse) library(nycflights13)
を本章で準備する環境にします。
またER図を書くためにgraphvizも入れます。
library(DiagrammeR)
Chapter 13 Relational data
13.2 nycflights13
13.2.1 Exercises
1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?
flights %>% select(tailnum, origin, dest) %>% unique
2. I forgot to draw the relationship between
weather
andairports
. What is the relationship and how should it appear in the diagram?
grViz("digraph dot{ graph[rankdir = LR] node[shape = box] faa[xlabel = airports] origin[xlabel = weather] origin->faa[arrowhead = odot] }")

3.
weather
only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define withflights
?
grViz('digraph structs { graph[rankdir = LR]; node [shape=record]; struct1 [xlabel = flights, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>flight|<f5>origin|<f6>dest|..."]; con1 [shape = point]; struct3 [xlabel = weather, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>origin|..."]; con1 -> struct1:f0; con1 -> struct1:f1; con1 -> struct1:f2; con1 -> struct1:f3; con1 -> struct1:f6; con1 -> struct3:f0[arrowhead = odot]; con1 -> struct3:f1[arrowhead = odot]; con1 -> struct3:f2[arrowhead = odot]; con1 -> struct3:f3[arrowhead = odot]; con1 -> struct3:f4[arrowhead = odot]; {rank = min; struct1}; {rank = max; struct3}; } ')

4. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?
grViz('digraph structs{ graph[rankdir = LR]; node[shape = record]; struct3 [xlabel = weather, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>origin|..."]; con2 [shape = point]; struct1 [xlabel = special, label="<f0> year|<f1> month|<f2> day"]; con1 [shape = point]; struct2 [xlabel = flights, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>flight|..."]; struct1:f0 -> con1[arrowhead = none, arrowtail = odot, dir=both]; struct1:f1 -> con1[arrowhead = none, arrowtail = odot, dir=both]; struct1:f2 -> con1[arrowhead = none, arrowtail = odot, dir=both]; con2 -> struct1:f0[arrowhead = odot]; con2 -> struct1:f1[arrowhead = odot]; con2 -> struct1:f2[arrowhead = odot]; struct3:f0 -> con2[arrowhead = none, arrowtail = normal, dir=both]; struct3:f1 -> con2[arrowhead = none, arrowtail = normal, dir=both]; struct3:f2 -> con2[arrowhead = none, arrowtail = normal, dir=both]; con1 -> struct2:f0; con1 -> struct2:f1; con1 -> struct2:f2; {rank = min; struct3}; {rank = max; struct2}; }')
month
, day
のみにしてyear
を含まないという選択もありえるけれども、毎年変化する祝日(体育の日)などもあるので、年月日で持つ。
たとえばyear
がNA
の場合は任意の年に適応される、といった制御も可能性としてはありえる。
13.3 Keys
13.3.1 Exercises
1. Add a surrogate key to flights
mutate(flights, sk = row_number())
2. Identify the keys in the fllowing datasets
Lahman::Batting
babynames::babynames
nasaweather::atoms
fueleconomy::vehicles
ggplot2::diamonds
playerID, yearID, stint
year, name, sex
lat, long, year, month
id
- キーなし
3. Draw a diagram illustarting the connections between the
Batting
,Master
, andSalaries
tables in the Lahman package. Draw another diagram that shows the relationship betweenMaster
,Managers
,AwardsManagers
. How would you characterise the relationship between theBatting
,Pitching
, andFielding
tables?
grViz('digraph structs{ graph[rankdir = LR]; node[shape = record]; struct1 [xlabel = Master, label="<f0> playerID|..."]; struct2 [xlabel = Batting, label="<f0> playerID|<f1> yearID|<f2> teamID|..."]; struct3 [xlabel = Salaries, label="<f0> playerID|<f1> yearID|<f2> teamID|..."]; con1 [shape = point]; struct1:f0 -> struct2:f0[arrowtail = odot, arrowhead = normal, dir = both]; struct1:f0 -> struct3:f0[arrowtail = odot, arrowhead = normal, dir = both]; struct3:f0 -> con1[arrowtail = odot, arrowhead = none, dir = both]; struct3:f1 -> con1[arrowtail = odot, arrowhead = none, dir = both]; struct3:f2 -> con1[arrowtail = odot, arrowhead = none, dir = both]; con1 -> struct2:f0; con1 -> struct2:f1; con1 -> struct2:f2; {rank = min; struct1}; {rank = same; struct2, struct3}; {rank = max; con1}; }') grViz('digraph structs{ graph[rankdir = LR]; node[shape = record]; struct1 [xlabel = Master, label="<f0> playerID|..."]; struct2 [xlabel = Managers, label="<f0> yearID|<f1> teamID|<f2> inseason|<f3> playerID|..."]; struct3 [xlabel = AwardsManagers, label="<f0> playerID|<f1> awardID|<f2> yearID|..."]; struct1:f0 -> struct2:f3[arrowtail = odot, arrowhead = normal, dir = both]; struct1:f0 -> struct3:f0[arrowtail = odot, arrowhead = normal, dir = both]; }')
Batting, Pitching, Fielding
それぞれ各選手の年ごとの成績を表している。
Fielding
のみポジション(POS)がキーとして追加される。
13.4 Mutating Joins
13.4.6 Exercises
1. Compute the average delay by destination, then join on the
airports
data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:airports %>% semi_join(flights, c("faa" = "dest")) %>% ggplot(aes(lon, lat)) + borders("state") + geom_point() + coord_quickmap()
borders()
を使うために前処理としてmaps
パッケージをインストールする必要がある。
install.packages("maps") airports %>% semi_join(flights, c("faa" = "dest")) %>% left_join( flights %>% group_by(dest) %>% summarise(ave_delay = mean(arr_delay, na.rm=TRUE)), by = c("faa" = "dest") ) %>% ggplot(aes(lon, lat)) + borders("state") + geom_point(aes(color = ave_delay)) + coord_quickmap()
2. Add the location of the origin and destination (i.e. the
lat
andlon
) toflights
.
flights %>% left_join(airports, by = c("origin" = "faa")) %>% left_join(airports, by = c("dest" = "faa"), suffix = c("", "_dest"))
3. Is there a relationship between the age of a plane and its delays?
まず飛行機の年齢の分布
flights %>% left_join(planes, by = "tailnum", suffix = c("", "_made")) %>% mutate(plane_age = year - year_made) %>% ggplot(aes(plane_age)) + geom_bar()
12年目が一番多く、18年目が少ない。
20歳周辺にももう一度ピークがある。
flights %>% left_join(planes, by = "tailnum", suffix = c("", "_made")) %>% mutate(plane_age = year - year_made) %>% ggplot(aes(plane_age, arr_delay)) + stat_summary()
4. What weather conditions make it more likely to see a delay?
flights %>% left_join(weather) %>% mutate(humid_cut = cut_number(humid, 30)) %>% ggplot(aes(humid_cut, arr_delay)) + stat_summary()
湿度が70あたりから高いほど遅れる。
湿度が高いのは雨が振っているためではないか。
flights %>% left_join(weather) %>% ggplot(aes(wind_dir, arr_delay)) + stat_summary() flights %>% left_join(weather) %>% ggplot(aes(wind_speed, arr_delay)) + stat_summary() flights %>% left_join(weather) %>% filter(wind_speed < 100, wind_speed > 0) %>% ggplot(aes(wind_speed, wind_dir)) + geom_count()
風が強いと遅れる。
風向きは100度あたりが遅れる。
特に100度あたりが強いというわけではないので、やはり向きも重要そう。
5. What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.
flights %>% filter(month == 6, day %in% 10:16) %>% gather(key, delay, dep_delay, arr_delay) %>% ggplot(aes(factor(day), delay)) + stat_summary(aes(color = key))
- 13日は出発の遅れは突出して遅い
- 13日は出発よりも到着の方がより遅れている
遅れは出発時点で発生し、また出発が遅れると巻き返そうとする傾向があるため、出発よりも到着の遅れの方が短くなると考えられる。
にもかかわらず13日だけは出発の遅れよりも到着の遅れの方が大きくなっているのは、何が起こっているのだろうか?
flights %>% filter(month == 6, day %in% 12:13) %>% mutate( day = factor(day), dest = reorder(dest, arr_delay - dep_delay, mean, na.rm=TRUE) ) %>% ggplot(aes(dest, arr_delay - dep_delay)) + stat_summary(aes(color = day), alpha = 2/3) + coord_flip()
13日の到着の遅れはどこかの地域に偏っているわけではなくアメリカ全体で発生している。
ググってみるとどうやら当日は東海岸側で大きな嵐が発生していたみたい。
https://www.usatoday.com/story/todayinthesky/2013/06/13/severe-storms-snarl-flights-across-the-east/2418761/
13.5 Filtering Join
13.5.1 Exercises
1. What does it mean for a flight to have a missing
tailnum
? What do the tail numbers that don’t have a matching record inplanes
have in common?
tailnum
がNA
のレコードを見るとすべてdep_time
もNA
になっている。
すなわちフライトがキャンセルになったと考えられる。
flights %>% mutate(inplanes = tailnum %in% planes$tailnum) %>% count(inplanes, carrier) %>% spread(inplanes, n) ## A tibble: 16 x 3 # carrier `FALSE` `TRUE` # <chr> <int> <int> # 9E 1044 17416 # AA 22558 10171 # AS NA 714 # B6 830 53805 # DL 110 48000 # EV NA 54173 # F9 50 635 # FL 187 3073 # HA NA 342 # MQ 25397 1000 # OO NA 32 # UA 1693 56972 # US 699 19837 # VX NA 5162 # WN 38 12237 # YV NA 601
とりあえずcarrier
MQのフライトのtailnum
はplanes
に入っていないようだ。
ほかのcarrier
はほとんどすべてがplanes
に入っている。
carrier
AAは1/3だけがplanes
に入っていてひとつだけ中途半端になっている。
2. Filter flights to only show flights with planes that have flown at least 100 flights.
flights %>% semi_join( count(flights, tailnum) %>% filter(n > 100), by = "tailnum")
joinの練習だからこういう書き方をしたけど、普通にgroup byでやったほうがシンプルなような・・・
flights %>% group_by(tailnum) %>% filter(n() > 100)
3.Combine
fueleconomy::vehicles
andfueleconomy::common
to find only the records for the most common models.
fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = c("make", "model"))
4. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?
flights %>% group_by(date = as.Date(time_hour) %>% as.numeric %/% 2) %>% summarise(delay = sum(arr_delay, na.rm=TRUE)) %>% ungroup %>% mutate(date = as.Date(date * 2, origin = "1970-01-01")) arrange(desc(delay)) ## A tibble: 183 x 2 # date delay # <date> <dbl> # 2013-07-22 94135 # 2013-06-30 87095 # 2013-06-24 84013 # 2013-12-09 71210 # 2013-05-23 69749 # 2013-07-10 65845 # 2013-06-26 65411 # 2013-03-08 63236 # 2013-03-18 62724 # 2013-12-05 60890 ## ... with 173 more rows
48時間で一番遅れているのは2013-07-22から始まる48時間。
この日の天候データを見てみると、temp, dewp, humidが高い。
意外にwind_speedやprecipは普通。
5. What does
anti_join(flights, airports, by = c("dest" = "faa"))
tell you? What doesanti_join(airports, flights, by = c("faa" = "dest"))
tell you?
1つめ)目的地の空港についてairports
に情報が入っていないフライトだけを抽出する。
2つめ)期間中に一度もニューヨークからのフライトが無かった空港
6. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve leaned above.
flights %>% semi_join( flights %>% group_by(tailnum) %>% summarise(n = n_distinct(carrier)) %>% filter(n > 1, !is.na(tailnum)) , by = "tailnum") %>% count(carrier, tailnum) %>% spread(carrier, n) # A tibble: 17 x 5 # tailnum `9E` DL EV FL # <chr> <int> <int> <int> <int> # N146PQ 8 NA 36 NA # N153PQ 5 NA 26 NA # N176PQ 7 NA 21 NA # N181PQ 4 NA 35 NA # N197PQ 2 NA 31 NA # N200PQ 7 NA 28 NA # N228PQ 8 NA 20 NA # N232PQ 7 NA 35 NA # N933AT NA 25 NA 10 # N935AT NA 77 NA 11 # N977AT NA 62 NA 24 # N978AT NA 43 NA 22 # N979AT NA 39 NA 19 # N981AT NA 22 NA 22 # N989AT NA 42 NA 24 # N990AT NA 57 NA 14 # N994AT NA 9 NA 22
上記の17台の航空機を除けば1つのキャリアのみでフライトしている。