R for Data Scienceの例題を解く- Chapter 14 文字列

神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 14 Stringです。

過去の記事

この章では基本のtidyverseに加えてstringrライブラリを使います。

library(tidyverse)
library(stringr)

Chapter 14 Strings

14.2 String basics

14.2.5 Exercises

1. In code that doesn’t use stringr, you’ll often see paste() and paste0(). What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling of NA?

paste()はデフォルトでは文字の間に半角スペースが入るが、paste0()は入らない。
それぞれstr_c(sep = " ")str_c()に相当する。
NAに対してはpasteNAを文字列として処理した結果を返すのに対して、str_cNAを返す。

2. In your own words, describe the difference between the sep and collapse arguments to str_c().

どちらもくっつける文字列の間に挟むもの。
sepは引数の間に入る文字列を指定する。
デフォルトではstr_cはベクトルを返すが、collapseを指定すればそれぞれのベクトルの間にcollapseで指定された文字列を間に挟んでくっつけた一つの文字列を返すようになる。

3. Use str_length() and str_sub() to extract the middle character from a string. What will you do if the string has an even number of characters?

words %>% str_sub(., (str_length(.)+1)/2, (str_length(.)+1)/2)

文字数が偶数の場合は中央2文字のうち前方の位置文字を返す。

4. What does str_wrap() do? When might you want to use it?

Knuth-Plass Algorithmという方法で英文に改行を挟む。
一行の文字数はオプションwidthで指定する。
小さなテキストボックスに文章を収めたいときなどに使うかも。

参照

14.3 Matching patterns with regular expressions

14.3.1 Basic Matches

14.3.1.1 Exercises

1. Explan why each of these strings don’t match a \: “\”, “\\”, “\\\”.

“\”:バックスラッシュはRではエスケープ文字なので、この文字列は意味をなさない。
“\\”:エスケープされたバックスラッシュなので、一つのバックスラッシュという正規表現。正規表現ではバックスラッシュはエスケープなので意味のある正規表現になっていない。
“\\\”:エスケープ文字が一つ単独で残るため文字列として意味をなしていない。

2. How would you match the sequence “‘\?

"\"\'\\\\"

3. What patterns will the regular expression \..\..\.. match? How would you represent it as a string?

"."、任意の一文字、"."、任意の一文字、.、任意の一文字

14.3.2 Anchors

14.3.2.1 Exercises

1. How would you match the literal string "$^$"?

str_view("$^$", "^\\$\\^\\$$")

2. Given the corpus of common words in stringr::words, create regular expressions that find all words that:

  1. Start with "y"
  2. End with "x"
  3. Are exactly three letters long.
  4. Have seven letters or more
#1
str_view(words, "^y", match = TRUE)

#2
str_view(words, "x$", match = TRUE)

#3
str_view(words, "^...$", match = TRUE)

#4
str_view(words, "^.......", match = TRUE)

14.3.3 Character classes and alternatives

14.3.3.1 Exercises

1. Create regular expressions to find all words that:

  1. Start with vowel.
  2. That only contain consonants.
  3. End with ed, but not with eed.
  4. End with ing or ise.
#1
str_view(words, "^[aiueo]", match = TRUE)

#2
str_view(words, "^[^aiueo]*$", match = TRUE)
#*は次節で学ぶので本来は使うべきでないかもしれないが、それ以外の方法が思いつかなかった。

#3
str_view(words, "[^e]ed$", match = TRUE)

#4
str_view(words, "(ing|ise)$", match = TRUE)

2. Empirically verify the rule “i before e except after c”.

str_view(words, "[^c]ei", match = TRUE)

weighという単語があった。

3. Is “q” always followed by a “u”?

str_view(words, "q[^u]", match = TRUE)

ない!

4. Write a regular expression that matches a word if it’s probably written in British English, not American English.

str_view(words, "our", match = TRUE)

とか?fourは違うけど。

5. Create a regular expression that will match telephone numbers as commonly written in your country.

\d\d\d-\d\d\d\d-\d\d\d\d

14.3.4 Repetition

14.3.4.1 Exercises

1. Describe the equivalents of ?, +, * in {m,n} form.

それぞれ{0,1}, {1,}, {0,}

2. Describe in words what these regular expressions match:

  1. ^.*$
  2. "\\{.+\\}"
  3. \d{4}-\d{2}-\d{2}
  4. "\\\\{4}"
  1. 任意の文字
  2. { }の中に一つ以上の文字列が入っている。
  3. nnnn-nn-nn
  4. バックスラッシュ4つ

3. Create regular expressions to find all words that:

  1. Start with three consonants.
  2. Have three or more vowels in a row.
  3. Have two or more vowel-consonant pairs in a row.
#1
"^[^aiueo]{3}"

#2
"[aiueo].*[aiueo].*[aiueo]"

#3
"[aiueo][^aiueo].*[aiueo][^aiueo]"

4. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner”.

ハッカーぽくてたのしい

14.3.5 Grouping and backreferences

14.3.5.1 Exercises

1. Describe, in words, what these expressions will match:

  1. (.)\1\1
  2. "(.)(.)\\2\\1"
  3. (..)\1
  4. "(.).\\1.\\1
  5. "(.)(.)(.).*\\3\\2\\1"
  1. aaa
  2. abba
  3. abab
  4. abaca
  5. abcほげほげcba。paragraphとか

2. Construct regular expressions to match words that:

  1. Start and end with tha same character
  2. Contain a repeated pair of letters
  3. Contain one letter repeated in at least three places
#1
str_view(words, "^(.).*\\1$", match = TRUE)
#2
str_view(words, "(..).*\\1", match = TRUE)
#3
str_view(words, "(.).*\\1.*\\1", match = TRUE)

14.4 Tools

14.4.1 Detect matches

14.4.1.1 Exercises

1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple str_detect calles.

  1. Find all words that start or end with x.
  2. Find all words that start with a vowel and end with a consonant.
  3. Are there any words that contain at least one of each different vowel?
#1
words[str_detect(words, "^x") | str_detect(words, "x$")]
words[str_detect(words, "^x|x$")]
#2
words[str_detect(words, "^[aiueo]") & str_detect(words, "[^aiueo]$")]
words[str_detect(words, "^[aiueo].*[^aiueo]$")]
#3
words[
  (str_detect(words, "a") & str_detect(words, "[iueo]")) |
  (str_detect(words, "i") & str_detect(words, "[aueo]")) |
  (str_detect(words, "u") & str_detect(words, "[aieo]")) |
  (str_detect(words, "e") & str_detect(words, "[aiuo]")) |
  (str_detect(words, "o") & str_detect(words, "[aiue]"))
  ]
words[str_detect(words, "a.*e|a.*i|a.*o|a.*u|e.*a|e.*i|e.*o|e.*u|i.*a|i.*e|i.*o|i.*u|o.*a|o.*e|o.*i|o.*u|u.*a|u.*e|u.*i|u.*o")]

2. What word has the highest number of vowels? What word has the highest proportion of vowels?

words[str_count(words, "[aiueo]") == max(str_count(words, "[aiueo]"))]
#[1] "appropriate" "associate"   "available"   "colleague"   "encourage"   "experience"  "individual"  "television"

純粋に考えると単語aが100%で一位になるがつまらないので、それ以外で一位を見つけることにする。

prop <- str_count(words, "[aiueo]")/str_count(words, "[^ ]")
words[prop[-1] == max(prop[-1])]
#[1] "appropriate" "husband"

14.4.2 Extract matches

14.4.2.1 Exercises

1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a clour. Modify the regex to fix the problem.

単語であることを確実にするために直前直後に半角スペースを挟む。
後ろはコンマ、ピリオドでもよい。

colour_match2 <- str_c(" (", colour_match, ")[ ,.]")

2. From the Harvard sentence data, extract:

  1. The first word from each sentence.
  2. All words ending in ing.
  3. All plurals.
#1
str_extract(sentences, "^[^ ]+")
#2
str_extract_all(sentences, "[^ ]*ing[ ,.]", simplify=TRUE)
#3
str_extract_all(sentences, "[^ ]*[^'aisu ]s[ ,.]", simplify=TRUE) 

三単元のsとかitsとかを退けるのは無理だと思うんですけど!

14.4.3 Grouped matches

14.4.3.1 Exercises

1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.

numbers <- str_c(" ", c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"), " ", collapse="|")
str_match(sentences, str_c("(", numbers, ")([^ .]+)"))

2. Find all contractions. Separate out the pieces before and after the apostrophe.

str_match(sentences, "([^ ]+)(')([^ ]*)")

14.4.4 Replacing matches

14.4.4.1 Exercises

1. Replace all forward slashes in a string with backslashes.

str_replace_all(sentences, "/", "\\\\")

2. Implement a simple version of str_to_lower() using replace_all().

l <- letters
names(l) <- LETTERS
str_replace_all(sentences, l)

3. Switch the first and last letters in words. Which of those strings are still words?

inv <- str_replace(words, "(^.)(.*)(.$)", "\\3\\2\\1")
words[words %in% inv]

14.4.5 Splitting

14.4.5.1 Exercises

1. Split up a string like "apples, pears, and bananas" into individual components.

s <- "apples, pears, and bananas"
str_split(s, "(, |, and )")

2. Why is it better to split up by boundary("word") than ” “?

” “でsplitするとコンマやピリオドなど各種記号が含まれてしまう。
boundary("word")ならばそうならない。

3. What does splitting with an empty string( "" ) do? Experiment, and then read the documentation.

str_split(sentences[1], "")
# [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i" "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t" "h" " " "p" "l" "a" "n" "k" "s" "."

文字単位で分割される。

14.5 Other types of pattern

14.5.1 Exercises

1. How would you find all strings containing \ with regex() vs. with fixed()?

regex("\\\\")
fixed("\\")

2. What are the five most common words in sentences?

str_extract_all(sentences, boundary("word"), simplify=TRUE) %>% str_to_lower %>% table %>% sort %>% tail
# and   to   of    a  the
# 118  123  132  202  751 2892

14.7 stringi

14.7.1 Exercises

1. Find the stringi function that:

  1. Count the number of words.
  2. Find duplicated strings.
#1
stri_count_words()
#2
stri_duplicated()
#3
stri_rand_strings()

2. How do you control the language that stri_sort() uses for sorting?

localeオプションで設定する。
指定方法は 言語_国 になる。
たとえば en_US など

コメントを残す