R for Data Scienceの例題を解く- Chapter 13 関係データ

神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 13 Relational dataです。

過去の記事

この章ではdplyrライブラリを使います。
またデモ用データとしてChapter 5 データ変換でも使ったnycflights13を使います。
このため

library(tidyverse)
library(nycflights13)

を本章で準備する環境にします。

またER図を書くためにgraphvizも入れます。

library(DiagrammeR)

Chapter 13 Relational data

13.2 nycflights13

13.2.1 Exercises

1. Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What tables would you need to combine?

flights %>% 
  select(tailnum, origin, dest) %>% 
  unique

2. I forgot to draw the relationship between weather and airports. What is the relationship and how should it appear in the diagram?

grViz("digraph dot{
  graph[rankdir = LR]
  node[shape = box]
  faa[xlabel = airports]
  origin[xlabel = weather]
  origin->faa[arrowhead = odot]
}")
weatherとairports

3. weather only contains information for the origin (NYC) airports. If it contained weather records for all airports in the USA, what additional relation would it define with flights ?

grViz('digraph structs {
  graph[rankdir = LR];
  node [shape=record];
  struct1 [xlabel = flights, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>flight|<f5>origin|<f6>dest|..."];
  con1 [shape = point];
  struct3 [xlabel = weather, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>origin|..."];
  con1 -> struct1:f0;
  con1 -> struct1:f1;
  con1 -> struct1:f2;
  con1 -> struct1:f3;
  con1 -> struct1:f6;
  con1 -> struct3:f0[arrowhead = odot];
  con1 -> struct3:f1[arrowhead = odot];
  con1 -> struct3:f2[arrowhead = odot];
  con1 -> struct3:f3[arrowhead = odot];
  con1 -> struct3:f4[arrowhead = odot];
  {rank = min; struct1};
  {rank = max; struct3};
}
')  
flightsとweather

4. We know that some days of the year are “special”, and fewer people than usual fly on them. How might you represent that data as a data frame? What would be the primary keys of that table? How would it connect to the existing tables?

grViz('digraph structs{
  graph[rankdir = LR];
  node[shape = record];
  struct3 [xlabel = weather, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>origin|..."];
  con2 [shape = point];
  struct1 [xlabel = special, label="<f0> year|<f1> month|<f2> day"];
  con1 [shape = point];
  struct2 [xlabel = flights, label="<f0> year|<f1> month|<f2> day|<f3>hour|<f4>flight|..."];
  struct1:f0 -> con1[arrowhead = none, arrowtail = odot, dir=both];
  struct1:f1 -> con1[arrowhead = none, arrowtail = odot, dir=both];
  struct1:f2 -> con1[arrowhead = none, arrowtail = odot, dir=both];
  con2 -> struct1:f0[arrowhead = odot];
  con2 -> struct1:f1[arrowhead = odot];
  con2 -> struct1:f2[arrowhead = odot];
  struct3:f0 -> con2[arrowhead = none, arrowtail = normal, dir=both];
  struct3:f1 -> con2[arrowhead = none, arrowtail = normal, dir=both];
  struct3:f2 -> con2[arrowhead = none, arrowtail = normal, dir=both];
  con1 -> struct2:f0;
  con1 -> struct2:f1;
  con1 -> struct2:f2;
  {rank = min; struct3};
  {rank = max; struct2};
}')
flightsとweatherとspecial

13.3 Keys

13.3.1 Exercises

1. Add a surrogate key to flights

mutate(flights, sk = row_number())

2. Identify the keys in the fllowing datasets

  1. Lahman::Batting
  2. babynames::babynames
  3. nasaweather::atoms
  4. fueleconomy::vehicles
  5. ggplot2::diamonds
  1. playerID, yearID, stint
  2. year, name, sex
  3. lat, long, year, month
  4. id
  5. キーなし

3. Draw a diagram illustarting the connections between the Batting, Master, and Salaries tables in the Lahman package. Draw another diagram that shows the relationship between Master, Managers, AwardsManagers. How would you characterise the relationship between the Batting, Pitching, and Fielding tables?

grViz('digraph structs{
  graph[rankdir = LR];

  node[shape = record]; 
  struct1 [xlabel = Master, label="<f0> playerID|..."];
  struct2 [xlabel = Batting, label="<f0> playerID|<f1> yearID|<f2> teamID|..."];
  struct3 [xlabel = Salaries, label="<f0> playerID|<f1> yearID|<f2> teamID|..."]; 
  con1 [shape = point];

  struct1:f0 -> struct2:f0[arrowtail = odot, arrowhead = normal, dir = both];
  struct1:f0 -> struct3:f0[arrowtail = odot, arrowhead = normal, dir = both];
  struct3:f0 -> con1[arrowtail = odot, arrowhead = none, dir = both];
  struct3:f1 -> con1[arrowtail = odot, arrowhead = none, dir = both];
  struct3:f2 -> con1[arrowtail = odot, arrowhead = none, dir = both];
  con1 -> struct2:f0;
  con1 -> struct2:f1;
  con1 -> struct2:f2;
      
  {rank = min; struct1};
  {rank = same; struct2, struct3};
  {rank = max; con1};
}')

grViz('digraph structs{
  graph[rankdir = LR];
  
  node[shape = record];
  struct1 [xlabel = Master, label="<f0> playerID|..."];
  struct2 [xlabel = Managers, label="<f0> yearID|<f1> teamID|<f2> inseason|<f3> playerID|..."];
  struct3 [xlabel = AwardsManagers, label="<f0> playerID|<f1> awardID|<f2> yearID|..."]; 

  struct1:f0 -> struct2:f3[arrowtail = odot, arrowhead = normal, dir = both];
  struct1:f0 -> struct3:f0[arrowtail = odot, arrowhead = normal, dir = both];

}')

Batting, Pitching, Fieldingそれぞれ各選手の年ごとの成績を表している。
Fieldingのみポジション(POS)がキーとして追加される。

13.4 Mutating Joins

13.4.6 Exercises

1. Compute the average delay by destination, then join on the airports data frame so you can show the spatial distribution of delays. Here’s an easy way to draw a map of the United States:

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point() +
    coord_quickmap()

borders()を使うために前処理としてmapsパッケージをインストールする必要がある。

install.packages("maps")

airports %>%
  semi_join(flights, c("faa" = "dest")) %>%
  left_join(
    flights %>%
      group_by(dest) %>%
      summarise(ave_delay = mean(arr_delay, na.rm=TRUE)),
    by = c("faa" = "dest")
  ) %>%
  ggplot(aes(lon, lat)) +
    borders("state") +
    geom_point(aes(color = ave_delay)) +
    coord_quickmap()

2. Add the location of the origin and destination (i.e. the lat and lon ) to flights.

flights %>%
  left_join(airports, by = c("origin" = "faa")) %>%
  left_join(airports, by = c("dest" = "faa"), suffix = c("", "_dest"))

3. Is there a relationship between the age of a plane and its delays?

まず飛行機の年齢の分布

flights %>% 
  left_join(planes, by = "tailnum", suffix = c("", "_made")) %>% 
  mutate(plane_age = year - year_made) %>% 
  ggplot(aes(plane_age)) +
    geom_bar()


12年目が一番多く、18年目が少ない。
20歳周辺にももう一度ピークがある。

flights %>% 
  left_join(planes, by = "tailnum", suffix = c("", "_made")) %>% 
  mutate(plane_age = year - year_made) %>% 
  ggplot(aes(plane_age, arr_delay)) +
    stat_summary()


平均の遅れでは7年目〜10年目が多い。

4. What weather conditions make it more likely to see a delay?

flights %>% 
  left_join(weather) %>% 
  mutate(humid_cut = cut_number(humid, 30)) %>% 
  ggplot(aes(humid_cut, arr_delay)) +
    stat_summary()

湿度と遅れ

湿度が70あたりから高いほど遅れる。
湿度が高いのは雨が振っているためではないか。

flights %>% 
  left_join(weather) %>% 
  ggplot(aes(wind_dir, arr_delay)) +
    stat_summary() 
flights %>% 
  left_join(weather) %>% 
  ggplot(aes(wind_speed, arr_delay)) +
    stat_summary()
flights %>%
  left_join(weather) %>%
  filter(wind_speed < 100, wind_speed > 0) %>%
  ggplot(aes(wind_speed, wind_dir)) +
    geom_count()

風の強さと遅れ

風向きと遅れ


風が強いと遅れる。
風向きは100度あたりが遅れる。
特に100度あたりが強いというわけではないので、やはり向きも重要そう。

5. What happened on June 13 2013? Display the spatial pattern of delays, and then use Google to cross-reference with the weather.

よくわからない

13.5 Filtering Join

13.5.1 Exercises

1. What does it mean for a flight to have a missing tailnum ? What do the tail numbers that don’t have a matching record in planes have in common?

tailnumNAのレコードを見るとすべてdep_timeNAになっている。
すなわちフライトがキャンセルになったと考えられる。

flights %>%
  mutate(inplanes = tailnum %in% planes$tailnum) %>%
  count(inplanes, carrier) %>%
  spread(inplanes, n)
## A tibble: 16 x 3
# carrier `FALSE` `TRUE`
# <chr>     <int>  <int>
# 9E         1044  17416
# AA        22558  10171
# AS           NA    714
# B6          830  53805
# DL          110  48000
# EV           NA  54173
# F9           50    635
# FL          187   3073
# HA           NA    342
# MQ        25397   1000
# OO           NA     32
# UA         1693  56972
# US          699  19837
# VX           NA   5162
# WN           38  12237
# YV           NA    601

とりあえずcarrierMQのフライトのtailnumplanesに入っていないようだ。
ほかのcarrierはほとんどすべてがplanesに入っている。
carrierAAは1/3だけがplanesに入っていてひとつだけ中途半端になっている。

2. Filter flights to only show flights with planes that have flown at least 100 flights.

flights %>%
  semi_join(
    count(flights, tailnum) %>% filter(n > 100),
    by = "tailnum")

joinの練習だからこういう書き方をしたけど、普通にgroup byでやったほうがシンプルなような・・・

flights %>% group_by(tailnum) %>% filter(n() > 100)

3.Combine fueleconomy::vehicles and fueleconomy::common to find only the records for the most common models.

fueleconomy::vehicles %>% semi_join(fueleconomy::common, by = c("make", "model"))

4. Find the 48 hours (over the course of the whole year) that have the worst delays. Cross-reference it with the weather data. Can you see any patterns?

flights %>%
  group_by(date = as.Date(time_hour) %>% as.numeric %/% 2)  %>%
  summarise(delay = sum(arr_delay, na.rm=TRUE)) %>%
  ungroup %>%
  mutate(date = as.Date(date * 2, origin = "1970-01-01"))
  arrange(desc(delay))
## A tibble: 183 x 2
#  date       delay
#   <date>     <dbl>
#  2013-07-22 94135
#  2013-06-30 87095
#  2013-06-24 84013
#  2013-12-09 71210
#  2013-05-23 69749
#  2013-07-10 65845
#  2013-06-26 65411
#  2013-03-08 63236
#  2013-03-18 62724
#  2013-12-05 60890
## ... with 173 more rows

48時間で一番遅れているのは2013-07-22から始まる48時間。
この日の天候データを見てみると、temp, dewp, humidが高い。
意外にwind_speedやprecipは普通。

5. What does anti_join(flights, airports, by = c("dest" = "faa")) tell you? What does anti_join(airports, flights, by = c("faa" = "dest")) tell you?

1つめ)目的地の空港についてairportsに情報が入っていないフライトだけを抽出する。
2つめ)期間中に一度もニューヨークからのフライトが無かった空港

6. You might expect that there’s an implicit relationship between plane and airline, because each plane is flown by a single airline. Confirm or reject this hypothesis using the tools you’ve leaned above.

flights %>%
  semi_join(
    flights %>%
      group_by(tailnum) %>%
      summarise(n = n_distinct(carrier)) %>%
      filter(n > 1, !is.na(tailnum))
    , by = "tailnum") %>%
  count(carrier, tailnum) %>%
  spread(carrier, n)
# A tibble: 17 x 5
#  tailnum  `9E`    DL    EV    FL
#  <chr>   <int> <int> <int> <int>
# N146PQ      8    NA    36    NA
# N153PQ      5    NA    26    NA
# N176PQ      7    NA    21    NA
# N181PQ      4    NA    35    NA
# N197PQ      2    NA    31    NA
# N200PQ      7    NA    28    NA
# N228PQ      8    NA    20    NA
# N232PQ      7    NA    35    NA
# N933AT     NA    25    NA    10
# N935AT     NA    77    NA    11
# N977AT     NA    62    NA    24
# N978AT     NA    43    NA    22
# N979AT     NA    39    NA    19
# N981AT     NA    22    NA    22
# N989AT     NA    42    NA    24
# N990AT     NA    57    NA    14
# N994AT     NA     9    NA    22

上記の17台の航空機を除けば1つのキャリアのみでフライトしている。

カテゴリー: R4DS

コメントを残す