神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 5 Data transformationです。
Chapter 3はこちら
この章では下記のライブラリを使う。
library(dplyr) library(nycflights13)
Chapter 5
5.2 Filter rows with filter()
5.2.4. Exercises
1. Find all flights that
- Had an arrival delay of two or more hours
- Flew to Houston (
IAH
orHOU
)- Were operated by United, American, or Delta
- Departed in summer (July, August, and September)
- Arrived more than two hours late, but didn’t leave late
- Were delayed by at least an hour, but made up over 30 minutes in flight
- Departed between midnight and 6am (inclusive)
#1. filter(flights, arr_delay > 120) #2. filter(flights, dest %in% c("IAH", "HOU")) #3. filter(flights, carrier %in% c("UA", "DL")) #4. filter(flights, month %in% 7:9) #5. filter(flights, arr_delay > 120, dep_delay < 1) #6. filter(flights, arr_delay > 60, dep_delay > 30) #7. filter(flights, dep_time <= 600)
2. Another useful dplyr filtering helper is
between()
. What does it do? Can you use it to simplify the code needed to answer the previous challenges?
between(x, left, right)
でベクトルx
の元それぞれに対してleft
以上でかつright
以下であればTRUE
, そうでなければFALSE
を返す。
これを使って上の例題の4番を次のように書き換えることができる。
filter(flights, between(month, 7, 9))
3. How many flights have a missing
dep_time
? What other variables are missing? What might these rows represent?
8255行。dep_time
だけでなく、dep_delay, arr_time, arr_delay, air_time
もNA
になっている。
air_time
がNA
であることから、便が欠航になったと推測できる。
4. Why is
NA ^ 0
not missing? Why isNA | TRUE
not missing? Why isFALSE & NA
not missing? Can you figure out the general rule? (NA * 0
is a tricky counterexample!)
NA
にどのような値が入っても演算の結果が一定に定まるとき、結果はNAでなく値を返す。
ではNA * 0
は0
ではないかと思ってしまうのだが、NA
がInf
であったとき、Inf * 0 = NaN
となり0
でなくなる。
そのため値は一意に決定しないので計算結果はNA
となる。
5.3. Arrange rows with arrange()
?
5.3.1. Exercises
1. How could you use
arrange()
to sort all missing values to the start? (Hint: useis.na()
).
arrange(data, !is.na(variable1))
2. Sort
flights
to find the most delayed flights. Find the flights that left earliest.
arrange(flights, desc(arr_delay)) arrange(flights, dep_delay)
3. Sort
flights
to find the fastest flights.
arrange(flights, distance/air_time)
4. Which flights travelled the longest? Which travelled the shortest?
arrange(flights, desc(air_time)) arrange(flights, air_time)
5.4 Select column with select()
5.4.1. Exercises
1. Brainstorm as many ways as possible to select
dep_time
,dep_delay
,arr_time
, andarr_delay
fromflights
.
select(flights, starts_with("dep"), starts_with("arr")) select(flights, matches("^(dep|arr)")) select(flights, num_range("dep_", c("time", "delay")), num_range("arr_", c("time", "delay")))
など
2. What happens if you include the name of a variable multiple times in a
select()
call?
その変数が一度だけ現れるテーブルが返される。
つまり変数が二回selectされたからといって二回現れるわけではない。
3. What does the
one_of()
function do? Why might it be helpful in conjunction with this vector?vars <- c("year", "month", "day", "dep_delay", "arr_delay")
select(flights, one_of(vars))
で文字列から変数を指定できる。
ということだと思うが、次でも普通にselectできた。
select(flights, vars)
となるとone_of
関数の使いみちがよくわからない。はて。
4. Does the result of running the following code surprise you? How do the select helpers deal with case by default? How can you change that default?
select(flights, contains("TIME"))
文字列time
を含んだ列たち(dep_time
, sched_dep_time
,…)がselectされる。
次の式だとエラーを返す。
select(flights, TIME) select(flights, "TIME")
select helper関数たちは引数としてignore.case
を持ち、デフォルトはTRUE
になっている。
これをFALSE
にすれば大文字小文字を区別して変数名を指定できる。
select(flights, contains("TIME", ignore.case = FALSE)) ## A tibble: 336,776 x 0
5.5. Add new variables with mutate()
5.5.1. Exercises
1. Currently
dep_time
andsched_dep_time
are convenient to look at, but hard to compute with because they’re not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
flights2 <- mutate(flights, dep_time2 = dep_time %/% 100 * 60 + dep_time %% 100, sched_dep_time2 = sched_dep_time %/% 100 * 60 + sched_dep_time %% 100 )
2. Compare
air_time
witharr_time - dep_time
. What do you expect to see? What do you see? What do you need to do to fix it?
## A tibble: 6 x 5 # year month day air_time diff # <int> <int> <int> <dbl> <dbl> #1 2013 1 1 227 193 #2 2013 1 1 227 197 #3 2013 1 1 160 221 #4 2013 1 1 183 260 #5 2013 1 1 116 138 #6 2013 1 1 150 106
diff
はarr_time - dep_time
(in minutes)で計算した変数だが、air_time
と微妙にずれている。
dep_time, arr_time
は共に現地時刻なので飛行時間との差は時差が影響していると考えられる。
その場合arr_time
を現地時間ではなく東海岸時刻に調整すればよい。
3. Compare
dep_time
,sched_dep_time
, anddep_delay
. How would you expect those three numbers to be related?
flights3 <- mutate(flights2, cal_depdel = dep_time2 - sched_dep_tim2, dif_depdel = cal_depdel - dep_delay ) filter(flights3, dif_depdel != 0) ## A tibble: 1,207 x 5 # dep_time sched_dep_time dep_delay cal_depdel dif_depdel # <dbl> <dbl> <dbl> <dbl> <dbl> # 1 528 1115 853 -587 -1440 # 2 42 1439 43 -1397 -1440 # 3 86 1370 156 -1284 -1440 # 4 32 1439 33 -1407 -1440 # 5 50 1305 185 -1255 -1440 # 6 155 1439 156 -1284 -1440 # 7 25 1439 26 -1414 -1440 # 8 66 1365 141 -1299 -1440 # 9 14 1439 15 -1425 -1440 #10 37 1350 127 -1313 -1440
1440 = 60 * 24ということで、日付をまたいだ場合に計算した遅延時間と一日分のズレが発生して一致しなくなる。
4. Find the 10 most delayed flights using a ranking function. How do you want to handle ties? Carefully read the documentation for
min_rank()
.
flights %>% mutate(delay_rank = min_rank(desc(dep_delay))) %>% filter(delay_rank <= 10)
上記の例ではmin_rank
関数の仕様により10位タイが複数あった場合にはすべてが抽出される。
10位タイが複数あった場合に10位をすべて除くためにはmin_rank
を単にrank
とする。
5. What does
1:3 + 1:10
return? Why?
Rにおいて長さの異なるベクトルの四則演算は、長さが短いベクトルを繰り返して補完することになっている。
この例題の場合は短いのは1:3
の方なのでこれが繰り返されて長さ10のベクトルとなる。
その結果は下記に示す要素を持つベクトルになる。
- [1] 2 = 1 + 1
- [2] 4 = 2 + 2
- [3] 6 = 3 + 3
- [4] 5 = 1 + 4
- [5] 7 = 2 + 5
- [6] 9 = 3 + 6
- [7] 8 = 1 + 7
- [8] 10 = 2 + 8
- [9] 12 = 3 + 9
- [10] 11 = 1 + 10
6. What trigonometric functions does R provide?
- サイン
sin(x)
- コサイン
cos(x)
- タンジェント
tan(x)
- アークサイン
asin(x)
- アークコサイン
acos(x)
- アークタンジェント
atan(x)
- サイン(単位π)
sinpi(x)
- コサイン(単位π)
cospi(x)
- タンジェント(単位π)
tanpi(x)
sinpi, cospi, tanpi
はそれぞれsin(pi*x), cos(pi*x), tan(pi*x)
と等価な関数。
その他atan2(x, y)
という関数があるが使いみちは良くわからない。
5.6 Grouped summaries with summarise()
5.6.7. Exercises
1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights. Consider the following scenarios:
- A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.
- A flight is always 10 minutes late.
- A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.
- 99% of the time a flight is on time. 1% of the time it’s 2 hours late.
Which is more important: arrival delay or departure delay?
# フライトのグループ化 gf <- flights %>% filter(!is.na(dep_delay), !is.na(arr_delay)) %>% group_by(flight) # 1つめ gf %>% summarise(l15 = sum(arr_delay >= 15), e15 = sum(arr_delay <= -15), n = n()) %>% filter(l15/n >= 0.5, e15/n >= 0.5) gf %>% summarise(l15ratio = mean(arr_delay >= 15), e15ratio = mean(arr_delay <= -15)) %>% filter(l15ratio >= 0.5, e15ratio >= 0.5) gf %>% summarise(l15 = sum(arr_delay >= 15), e15 = sum(arr_delay <= -15), abs15 = sum(abs(arr_delay) >= 15), n = n()) %>% filter(abs15 == n, l15 == e15) gf %>% filter(all(abs(arr_delay)>=15)) %>% summarise(l15 = sum(arr_delay > 15), n = n()) %>% filter(l15 == n/2) gf %>% filter(all(abs(arr_delay)>=15)) %>% count(e15 = arr_delay > 15) %>% filter(n == sum(n)/2) %>% count() # 2つめ gf %>% summarise(nl10 = sum(arr_delay < 10)) %>% filter(nl10 == 0) gf %>% count(wt = arr_delay < 10) %>% filter(n == 0) gf %>% filter(all(arr_delay >= 10)) %>% summarise() gf %>% filter(!any(arr_delay < 10)) %>% count() gf %>% count(e10 = arr_delay >= 10) %>% filter(all(e10 == TRUE)) %>% count # 3つめ gf %>% filter(mean(arr_delay < -30) >= 0.5) %>% filter(mean(arr_delay > 30) >= 0.5) %>% summarise() gf %>% summarise(l30 = sum(arr_delay >= 30), e15 = sum(arr_delay <= -30), n = n()) %>% filter(l30/n >= 0.5, e30/n >= 0.5) gf %>% filter(all(abs(arr_delay) >= 30)) %>% count(delay = arr_delay < 0) %>% filter(n == sum(n)/2) %>% summarise gf %>% mutate(l30 = mean(arr_delay >= 30), e30 = mean(arr_delay <= -30)) %>% filter(l30 >= 0.5, e30 >= 0.5) %>% summarise gf %>% summarise(l30l = mean(arr_delay >= 30) >= 0.5, e30l = mean(arr_delay <= -30) >= 0.5) %>% filter(l30l, e30l) # 4つめ # 10分以上遅れないことを"on time"の定義とする。 # 下記の条件に当てはまるflightは存在しないような気がするが・・・ # 99%を95%などと条件を緩めれば見つかる。 gf %>% summarise(on_time_ratio = mean(arr_delay <= 10), max_delay = max(arr_delay)) %>% filter(on_time_ratio >= 0.99, max_delay >= 120) gf %>% filter(mean(arr_delay <= 10) >= 0.99) %>% filter(max(arr_delay) >= 120) %>% summarise() gf %>% filter(any(arr_delay >= 120)) %>% summarise(on_time_ratio = mean(arr_delay <= 10)) %>% filter(on_time_ratio >= 0.99) gf %>% filter(mean(arr_delay <= 10) >= 0.99) %>% count(wt = arr_delay >= 120) %>% filter(n > 0) gf %>% filter(!all(arr_delay < 120)) %>% group_by(flight, ontime = arr_delay <= 10) %>% summarise(n = n()) %>% filter(ontime == TRUE, n >= sum(n)*0.99)
データ分析とは関係無いが、出発時刻が遅れたとしても到着時刻に影響がなければ問題ないように思う。
2. Come up with another approach that will give you the same output as
not_cancelled %>% count(dest)
andnot_cancelled %>% count(tailnum, wt = distance)
(without usingcount()
).
not_cancelled %>% group_by(dest) %>% summarise(n = n()) not_cancelled %>% group_by(tailnum) %>% summarise(dist = sum(distance))
3. Our definition of cancelled flights (
is.na(dep_delay) | is.na(arr_delay)
) is slightly suboptimal. Why? Which is the most important column?
flights %>% count(is.na(dep_delay), is.na(arr_delay)) ## A tibble: 3 x 3 # `is.na(dep_delay)` `is.na(arr_delay)` n # <lgl> <lgl> <int> #1 FALSE FALSE 327346 #2 FALSE TRUE 1175 #3 TRUE TRUE 8255
dep_delay
はNA
ではないが、arr_delay
がNA
のフライトが存在する。
出発したはよいが引き返したのかもしれない。
何にせよ出発しなければ到着もしないので、フライトキャンセルの条件はia.na(arr_delay)
だけで良さそう。
4. Look at the number of cancelled flights per day. Is there a pattern? Is the proportion of cancelled flights related to the average delay?
flights %>% group_by(day) %>% summarise(cancelled.num = sum(is.na(arr_delay))) %>% ggplot(aes(factor(day), cancelled.num)) + geom_col() flights %>% group_by(day) %>% summarise(cancelled.ratio = mean(is.na(arr_delay))) %>% ggplot(aes(factor(day), cancelled.ratio)) + geom_col()
8日がダントツにdelayが多い。
4日、16日、20日が少ない、など4の倍数に何かヒミツがあるような気がしたり。
5. Which carrier has the worst delays? Challenge: can you disentangle the effects of bad airports vs. bad carriers? Why/why not? (Hint: think about
flights %>% group_by(carrier, dest) %>% summarise(n())
)
flights %>% filter(arr_delay == max(arr_delay, na.rm=TRUE)) ## A tibble: 1 x 19 # year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time # <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> #1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK HNL 640 ## ... with 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
ホノルル行きは長距離なので遅れが大きくなるのは当然、つまりハワイアン航空が遅れやすい航空会社だと結論するのは時期尚早です。
遅延時間は航路に大きく左右されるので、航空会社が良いか悪いかを判断するためには航路の影響を除かなければなりません。
flights %>% group_by(carrier, dest) %>% summarise(delay = mean(arr_delay[arr_delay > 0], na.rm=TRUE), n = n()) %>% ggplot(aes(carrier, dest)) + geom_count(aes(size = n, color = delay), alpha = 0.9) + scale_size_continuous(range = c(3,10)) + scale_color_gradient(low = "#56B1F7", high = "red")

横軸が航空会社、縦軸が行き先で、色は赤が遅延(平均値)が大きいことを意味しています。灰色は遅延が無いです。
横向きに色を比較すれば同じ行き先について優秀な航空会社、悪い航空会社を見つけることができます。
こう見るとOO(SkyWest Airline)はORD, DTW行きで遅れがひどいことがわかります。
データ数は少ないですが少ないデータのうちに二つもひどい遅れを出している以上、『遅れる航空会社』のレッテルを貼られるのは避けられないでしょう。
データ数が少ないカテゴリを除くと景色が変わります。
flights %>% group_by(carrier, dest) %>% summarise(delay = mean(arr_delay[arr_delay > 0], na.rm=TRUE), n = n()) %>% filter(n > 2) %>% ggplot(aes(carrier, dest)) + geom_count(aes(size = n, color = delay), alpha = 0.9) + scale_size_continuous(range = c(3,10)) + scale_color_gradient(low = "#56B1F7", high = "red")

これを見ればEV(エクスプレスジェット航空)は大手航空会社ですが相対的に赤色が多く、遅延が大きい航空会社であることが見て取れます。
6. What does the
sort
argument tocount()
do. When might you use it?
sort = TRUE
で大きい順に並び替えをする。
5.7 Grouped mutate (and filters)
5.7.1 Exercises
1. Refer back to the lists of useful mutate and filtering functions. Describe how each operation changes when you combine it with grouping.
関数arrange
を使えばグループ化された内部のみで整列することができる。
2. Which plane (
tailnum
) has the worst on-time record?
flights %>% group_by(tailnum) %>% filter(arr_delay == max(arr_delay, na.rm=TRUE)) %>% arrange(desc(arr_delay)) ## A tibble: 2,210 x 19 ## Groups: tailnum [2,189] # year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time # <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl> # 1 2013 1 9 641 900 1301 1242 1530 1272 HA 51 N384HA JFK HNL 640 # 2 2013 7 22 845 1600 1005 1044 1815 989 MQ 3075 N665MQ JFK CVG 96 # 3 2013 4 10 1100 1900 960 1342 2211 931 DL 2391 N959DL JFK TPA 139 # 4 2013 3 17 2321 810 911 135 1020 915 DL 2119 N927DA LGA MSP 167 # 5 2013 7 22 2257 759 898 121 1026 895 DL 2047 N6716C LGA ATL 109 # 6 2013 12 5 756 1700 896 1058 2020 878 AA 172 N5DMAA EWR MIA 149 # 7 2013 12 19 734 1725 849 1046 2039 847 DL 1223 N375NC EWR SLC 290 # 8 2013 2 10 2243 830 853 100 1106 834 F9 835 N203FR LGA DEN 233 # 9 2013 4 19 617 1700 797 858 1955 783 AA 257 N3GJAA JFK LAS 313 #10 2013 2 19 2324 1016 788 114 1227 767 DL 2319 N324US LGA MSP 136 ## ... with 2,200 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
3. What time of day should you fly if you want to avoid delays as much as possible?
flights %>% group_by(day) %>% summarise(ave.delay = mean(arr_delay, na.rm=TRUE)) %>% arrange(desc(ave.delay)) ## A tibble: 31 x 2 # day ave.delay # <int> <dbl> # 1 8 19.07218 # 2 22 17.40492 # 3 23 16.74046 # 4 10 14.73601 # 5 12 11.13897 # 6 11 10.51807 # 7 24 10.47845 # 8 19 10.19267 # 9 18 10.10772 #10 17 10.07892 ## ... with 21 more rows
質問の意図がよく分からなかった。この節のテクニックをどう使えばよいのか分からない。
4. For each destination, compute the total minutes of delay. For each, flight, compute the proportion of the total delay for its destination.
flights %>% filter(arr_delay > 0) %>% group_by(dest) %>% mutate(total_delay = sum(arr_delay, na.rm = TRUE)) flights %>% filter(arr_delay > 0) %>% group_by(flight) %>% mutate(total_delay = sum(arr_delay, na.rm = TRUE)) %>% group_by(flight, dest) %>% summarise(sum(arr_delay)/mean(total_delay))
5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using
lag()
explore how the delay of a flight is related to the delay of the immediately preceding flight.
flights %>% arrange(time_hour) %>% group_by(origin) %>% mutate(lag_delay = lag(dep_delay)) %>% ggplot(aes(dep_delay, lag_delay)) + geom_point(aes(color = origin), alpha = 0.1) + geom_abline(intercept = 0, slope = 1)

6. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?
flights %>% group_by(dest) %>% mutate(min_air = min(air_time, na.rm = TRUE)) %>% mutate(diff = air_time - min_air) %>% arrange(desc(diff)) %>% select(flight, dep_time, arr_time, air_time, min_air, diff) #Adding missing grouping variables: `dest` ## A tibble: 336,776 x 7 ## Groups: dest [105] # dest flight dep_time arr_time air_time min_air diff # <chr> <int> <int> <int> <dbl> <dbl> <dbl> # 1 SFO 841 1727 2242 490 295 195 # 2 LAX 426 1812 2302 440 275 165 # 3 EGE 575 1806 2253 382 219 163 # 4 DEN 745 1513 1914 331 182 149 # 5 LAX 17 1814 2240 422 275 147 # 6 LAS 587 2142 143 399 256 143 # 7 SFO 434 1727 2206 438 295 143 # 8 SAN 89 1646 2107 413 279 134 # 9 HNL 15 1337 1937 695 562 133 #10 SFO 177 1746 2225 426 295 131 ## ... with 336,766 more rows
SFO行きは最速295分で到達している実績があるにもかかわらず、flight 841はair_time
が490分もかかっています。
最速との差分を見るとSFO(San Francisco International Airport)など長距離のフライトばかりが候補に上がってしまいますので、データの入力エラーを見つけるのには適していないです。
データの入力エラーを見つけるのであれば我々に最も馴染みのある統計値の1つである偏差値を使うのが良いと思います。
flights %>% group_by(dest) %>% mutate(min_air = mean_air = mean(air_time, na.rm = TRUE), sd_air = sd(air_time, na.rm = TRUE)) %>% mutate(hensachi = (air_time - mean_air) * 10 / sd_air + 50) %>% arrange(desc(hensachi)) %>% select(dest, flight, air_time, mean_air, sd_air, hensachi) ## A tibble: 336,776 x 6 ## Groups: dest [105] # dest flight air_time mean_air sd_air hensachi # <chr> <int> <dbl> <dbl> <dbl> <dbl> # 1 BOS 1703 112 38.95300 4.948552 197.6129 # 2 BOS 2136 107 38.95300 4.948552 187.5089 # 3 DCA 2175 131 45.85655 6.460277 181.7954 # 4 ACK 1491 141 42.06818 8.127495 171.7249 # 5 BOS 2480 99 38.95300 4.948552 171.3426 # 6 SYR 1516 97 43.03984 4.530559 169.1027 # 7 BOS 1750 96 38.95300 4.948552 165.2802 # 8 DTW 1131 170 84.83202 7.853928 158.4400 # 9 DTW 1026 170 84.83202 7.853928 158.4400 #10 RDU 1185 138 70.88533 6.201977 158.2150
BOS(Boston Logan International Airport)行き1703便は通常のBOS行きが平均38.9でいけているところを112分もかかっています。
偏差値にして197(!)ということで、私はこれが一番ひどい遅れであるように思います。
7. Find all destinations that are flown by at least two carriers. Use that information to rank the carriers.
flights %>% group_by(dest) %>% filter(n_distinct(carrier) > 1) %>% summarise(car_num = n_distinct(carrier)) ## A tibble: 76 x 2 # dest car_num # <chr> <int> # 1 ATL 7 # 2 AUS 6 # 3 AVL 2 # 4 BDL 2 # 5 BGR 2 # 6 BNA 5 # 7 BOS 7 # 8 BQN 2 # 9 BTV 3 #10 BUF 4
上記の条件を使って航空会社をランク付けせよという質問の意図がよく理解できなかったが、2つ以上が競合している空港にたくさん便を持っている航空会社が高順位だとしてみました。
flights %>% group_by(dest) %>% filter(n_distinct(carrier) > 1) %>% ungroup() %>% group_by(carrier) %>% summarise(dest_num = n_distinct(dest)) %>% arrange(desc(dest_num)) ## A tibble: 16 x 2 # carrier dest_num # <chr> <int> # 1 EV 51 # 2 9E 48 # 3 UA 42 # 4 DL 39 # 5 B6 35 # 6 AA 19 # 7 MQ 19 # 8 WN 10 # 9 OO 5 #10 US 5 #11 VX 4 #12 YV 3 #13 FL 2 #14 AS 1 #15 F9 1 #16 HA 1
8. For each plane, count the number of flights before the first delay of greater than 1 hour.
flights %>% mutate(delay_1hour = arr_delay > 60) %>% group_by(tailnum) %>% filter(any(delay_1hour)) %>% group_by(tailnum, delay_1hour) %>% mutate(first_delay = (time_hour == min(time_hour)) & delay_1hour) %>% group_by(tailnum) %>% mutate(first_delay_timehour = min(time_hour[first_delay], na.rm=TRUE)) %>% summarise(answer = sum(time_hour < first_delay_timehour)) ## A tibble: 3,371 x 2 # tailnum answer # <chr> <int> # 1 D942DN 0 # 2 N0EGMQ 0 # 3 N10156 9 # 4 N102UW 25 # 5 N104UW 3 # 6 N10575 0 # 7 N105UW 22 # 8 N107US 20 # 9 N108UW 36 #10 N109UW 28 ## ... with 3,361 more rows