神Hadley R for Data Science の例題たちとその解答を書き残します。
今回はChapter 14 Stringです。
過去の記事
- Chapter 3 データ可視化
- Chapter 5 データ変換
- Chapter 7 探索的データ分析
- Chapter 10 tibble
- Chapter 11 Data import
- Chapter 12 Tidy data
- Chapter 13 関係データ
この章では基本のtidyverse
に加えてstringr
ライブラリを使います。
library(tidyverse) library(stringr)
Chapter 14 Strings
14.2 String basics
14.2.5 Exercises
1. In code that doesn’t use stringr, you’ll often see
paste()
andpaste0()
. What’s the difference between the two functions? What stringr function are they equivalent to? How do the functions differ in their handling ofNA
?
paste()
はデフォルトでは文字の間に半角スペースが入るが、paste0()
は入らない。
それぞれstr_c(sep = " ")
とstr_c()
に相当する。
NA
に対してはpaste
がNA
を文字列として処理した結果を返すのに対して、str_c
はNA
を返す。
2. In your own words, describe the difference between the
sep
andcollapse
arguments tostr_c()
.
どちらもくっつける文字列の間に挟むもの。
sep
は引数の間に入る文字列を指定する。
デフォルトではstr_c
はベクトルを返すが、collapse
を指定すればそれぞれのベクトルの間にcollapse
で指定された文字列を間に挟んでくっつけた一つの文字列を返すようになる。
3. Use
str_length()
andstr_sub()
to extract the middle character from a string. What will you do if the string has an even number of characters?
words %>% str_sub(., (str_length(.)+1)/2, (str_length(.)+1)/2)
文字数が偶数の場合は中央2文字のうち前方の位置文字を返す。
4. What does
str_wrap()
do? When might you want to use it?
Knuth-Plass Algorithmという方法で英文に改行を挟む。
一行の文字数はオプションwidth
で指定する。
小さなテキストボックスに文章を収めたいときなどに使うかも。
14.3 Matching patterns with regular expressions
14.3.1 Basic Matches
14.3.1.1 Exercises
1. Explan why each of these strings don’t match a \: “\”, “\\”, “\\\”.
“\”:バックスラッシュはRではエスケープ文字なので、この文字列は意味をなさない。
“\\”:エスケープされたバックスラッシュなので、一つのバックスラッシュという正規表現。正規表現ではバックスラッシュはエスケープなので意味のある正規表現になっていない。
“\\\”:エスケープ文字が一つ単独で残るため文字列として意味をなしていない。
2. How would you match the sequence “‘\?
"\"\'\\\\"
3. What patterns will the regular expression
\..\..\..
match? How would you represent it as a string?
"."
、任意の一文字、"."
、任意の一文字、.
、任意の一文字
14.3.2 Anchors
14.3.2.1 Exercises
1. How would you match the literal string
"$^$"
?
str_view("$^$", "^\\$\\^\\$$")
2. Given the corpus of common words in
stringr::words
, create regular expressions that find all words that:
- Start with
"y"
- End with
"x"
- Are exactly three letters long.
- Have seven letters or more
#1 str_view(words, "^y", match = TRUE) #2 str_view(words, "x$", match = TRUE) #3 str_view(words, "^...$", match = TRUE) #4 str_view(words, "^.......", match = TRUE)
14.3.3 Character classes and alternatives
14.3.3.1 Exercises
1. Create regular expressions to find all words that:
- Start with vowel.
- That only contain consonants.
- End with
ed
, but not witheed
.- End with
ing
orise
.
#1 str_view(words, "^[aiueo]", match = TRUE) #2 str_view(words, "^[^aiueo]*$", match = TRUE) #*は次節で学ぶので本来は使うべきでないかもしれないが、それ以外の方法が思いつかなかった。 #3 str_view(words, "[^e]ed$", match = TRUE) #4 str_view(words, "(ing|ise)$", match = TRUE)
2. Empirically verify the rule “i before e except after c”.
str_view(words, "[^c]ei", match = TRUE)
weighという単語があった。
3. Is “q” always followed by a “u”?
str_view(words, "q[^u]", match = TRUE)
ない!
4. Write a regular expression that matches a word if it’s probably written in British English, not American English.
str_view(words, "our", match = TRUE)
とか?fourは違うけど。
5. Create a regular expression that will match telephone numbers as commonly written in your country.
\d\d\d-\d\d\d\d-\d\d\d\d
14.3.4 Repetition
14.3.4.1 Exercises
1. Describe the equivalents of
?
,+
,*
in{m,n}
form.
それぞれ{0,1}, {1,}, {0,}
2. Describe in words what these regular expressions match:
^.*$
"\\{.+\\}"
\d{4}-\d{2}-\d{2}
"\\\\{4}"
- 任意の文字
- { }の中に一つ以上の文字列が入っている。
- nnnn-nn-nn
- バックスラッシュ4つ
3. Create regular expressions to find all words that:
- Start with three consonants.
- Have three or more vowels in a row.
- Have two or more vowel-consonant pairs in a row.
#1 "^[^aiueo]{3}" #2 "[aiueo].*[aiueo].*[aiueo]" #3 "[aiueo][^aiueo].*[aiueo][^aiueo]"
4. Solve the beginner regexp crosswords at https://regexcrossword.com/challenges/beginner”.
ハッカーぽくてたのしい
14.3.5 Grouping and backreferences
14.3.5.1 Exercises
1. Describe, in words, what these expressions will match:
(.)\1\1
"(.)(.)\\2\\1"
(..)\1
"(.).\\1.\\1
"(.)(.)(.).*\\3\\2\\1"
- aaa
- abba
- abab
- abaca
- abcほげほげcba。paragraphとか
2. Construct regular expressions to match words that:
- Start and end with tha same character
- Contain a repeated pair of letters
- Contain one letter repeated in at least three places
#1 str_view(words, "^(.).*\\1$", match = TRUE) #2 str_view(words, "(..).*\\1", match = TRUE) #3 str_view(words, "(.).*\\1.*\\1", match = TRUE)
14.4 Tools
14.4.1 Detect matches
14.4.1.1 Exercises
1. For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple
str_detect
calles.
- Find all words that start or end with
x
.- Find all words that start with a vowel and end with a consonant.
- Are there any words that contain at least one of each different vowel?
#1 words[str_detect(words, "^x") | str_detect(words, "x$")] words[str_detect(words, "^x|x$")] #2 words[str_detect(words, "^[aiueo]") & str_detect(words, "[^aiueo]$")] words[str_detect(words, "^[aiueo].*[^aiueo]$")] #3 words[ (str_detect(words, "a") & str_detect(words, "[iueo]")) | (str_detect(words, "i") & str_detect(words, "[aueo]")) | (str_detect(words, "u") & str_detect(words, "[aieo]")) | (str_detect(words, "e") & str_detect(words, "[aiuo]")) | (str_detect(words, "o") & str_detect(words, "[aiue]")) ] words[str_detect(words, "a.*e|a.*i|a.*o|a.*u|e.*a|e.*i|e.*o|e.*u|i.*a|i.*e|i.*o|i.*u|o.*a|o.*e|o.*i|o.*u|u.*a|u.*e|u.*i|u.*o")]
2. What word has the highest number of vowels? What word has the highest proportion of vowels?
words[str_count(words, "[aiueo]") == max(str_count(words, "[aiueo]"))] #[1] "appropriate" "associate" "available" "colleague" "encourage" "experience" "individual" "television"
純粋に考えると単語aが100%で一位になるがつまらないので、それ以外で一位を見つけることにする。
prop <- str_count(words, "[aiueo]")/str_count(words, "[^ ]") words[prop[-1] == max(prop[-1])] #[1] "appropriate" "husband"
14.4.2 Extract matches
14.4.2.1 Exercises
1. In the previous example, you might have noticed that the regular expression matched “flickered”, which is not a clour. Modify the regex to fix the problem.
単語であることを確実にするために直前直後に半角スペースを挟む。
後ろはコンマ、ピリオドでもよい。
colour_match2 <- str_c(" (", colour_match, ")[ ,.]")
2. From the Harvard sentence data, extract:
- The first word from each sentence.
- All words ending in
ing
.- All plurals.
#1 str_extract(sentences, "^[^ ]+") #2 str_extract_all(sentences, "[^ ]*ing[ ,.]", simplify=TRUE) #3 str_extract_all(sentences, "[^ ]*[^'aisu ]s[ ,.]", simplify=TRUE)
三単元のsとかitsとかを退けるのは無理だと思うんですけど!
14.4.3 Grouped matches
14.4.3.1 Exercises
1. Find all words that come after a “number” like “one”, “two”, “three” etc. Pull out both the number and the word.
numbers <- str_c(" ", c("one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"), " ", collapse="|") str_match(sentences, str_c("(", numbers, ")([^ .]+)"))
2. Find all contractions. Separate out the pieces before and after the apostrophe.
str_match(sentences, "([^ ]+)(')([^ ]*)")
14.4.4 Replacing matches
14.4.4.1 Exercises
1. Replace all forward slashes in a string with backslashes.
str_replace_all(sentences, "/", "\\\\")
2. Implement a simple version of
str_to_lower()
usingreplace_all()
.
l <- letters names(l) <- LETTERS str_replace_all(sentences, l)
3. Switch the first and last letters in
words
. Which of those strings are still words?
inv <- str_replace(words, "(^.)(.*)(.$)", "\\3\\2\\1") words[words %in% inv]
14.4.5 Splitting
14.4.5.1 Exercises
1. Split up a string like
"apples, pears, and bananas"
into individual components.
s <- "apples, pears, and bananas" str_split(s, "(, |, and )")
2. Why is it better to split up by
boundary("word")
than ” “?
” “でsplitするとコンマやピリオドなど各種記号が含まれてしまう。
boundary("word")
ならばそうならない。
3. What does splitting with an empty string(
""
) do? Experiment, and then read the documentation.
str_split(sentences[1], "") # [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i" "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t" "h" " " "p" "l" "a" "n" "k" "s" "."
文字単位で分割される。
14.5 Other types of pattern
14.5.1 Exercises
1. How would you find all strings containing
\
withregex()
vs. withfixed()
?
regex("\\\\") fixed("\\")
2. What are the five most common words in
sentences
?
str_extract_all(sentences, boundary("word"), simplify=TRUE) %>% str_to_lower %>% table %>% sort %>% tail # and to of a the # 118 123 132 202 751 2892
14.7 stringi
14.7.1 Exercises
1. Find the stringi function that:
- Count the number of words.
- Find duplicated strings.
#1 stri_count_words() #2 stri_duplicated() #3 stri_rand_strings()
2. How do you control the language that
stri_sort()
uses for sorting?
locale
オプションで設定する。
指定方法は 言語_国 になる。
たとえば en_US
など