M08: Advanced R Data Types and Structure

Author

Jae Jung

Published

March 27, 2025

1 Topics

  1. tibble
  2. Strings with stringr
  3. Factors with forcats
  4. Dates and Times with lubridate

2 Packages

```{r}
library(tidyverse)
library(lubridate)
library(nycflights13) # date-time data
```

3 Tibbles

  • Tibble is “opinionated data frames that make working in the tidyverse a little easier.”

  • If you’re already familiar with data.frame(), note that tibble() does much less:

    • it never changes the type of the inputs (e.g. it never converts strings to factors!),
    • it never changes the names of variables, and
    • it never creates row names.
  • vignette(“tibble”)

```{r}
library(tidyverse)
```

3.1 How to create a tibble?

3.1.1 Using tibble() function

```{r}
# tibble() is from tibble package, which is part of tidyverse
tibble(
  x = 1:5, 
  y = 1, 
  z = x ^ 2 + y
)
```
# A tibble: 5 × 3
      x     y     z
  <int> <dbl> <dbl>
1     1     1     2
2     2     1     5
3     3     1    10
4     4     1    17
5     5     1    26
Important
  • It’s possible for a tibble to have column names that are not valid R variable names, aka non-syntactic names.
  • To refer to these variables, you need to surround them with backticks (`)
```{r}
tb <- tibble(
  `:)` = "smile", 
  ` ` = "space",
  `2000` = "number"
)
tb
```
# A tibble: 1 × 3
  `:)`  ` `   `2000`
  <chr> <chr> <chr> 
1 smile space number
  • You’ll also need the backticks when working with these variables in other packages, like ggplot2, dplyr, and tidyr.

3.1.2 Using tribble(), short for transposed tibble.

  • tribble() is customised for data entry in code:
    • column headings are defined by formulas (i.e. they start with ~), and
    • entries are separated by commas.
  • This makes it possible to lay out small amounts of data in easy to read form.
```{r}
tribble(
  ~x, ~y, ~z,
  #--|--|----   # comment to show where the header is
  "a", 2, 3.6,
  "b", 1, 8.5
)
```
# A tibble: 2 × 3
  x         y     z
  <chr> <dbl> <dbl>
1 a         2   3.6
2 b         1   8.5

3.2 Tibbles vs. data.frame

  • There are two main differences in the usage of a tibble vs. a classic data.frame: (1) printing and (2) subsetting.

3.2.1 Printing

  • shows only the first 10 rows, and all the columns that fit on screen.
  • In addition to its name, each column reports its type, a nice feature borrowed from str():
```{r}
tibble(
  a = lubridate::now() + runif(1e3) * 86400,
  b = lubridate::today() + runif(1e3) * 30,
  c = 1:1e3,
  d = runif(1e3),
  e = sample(letters, 1e3, replace = TRUE)
)
```
# A tibble: 1,000 × 5
   a                   b              c      d e    
   <dttm>              <date>     <int>  <dbl> <chr>
 1 2025-03-28 12:06:30 2025-04-01     1 0.779  g    
 2 2025-03-27 14:13:23 2025-04-04     2 0.256  e    
 3 2025-03-27 23:33:12 2025-04-23     3 0.167  s    
 4 2025-03-28 08:26:31 2025-04-08     4 0.0975 y    
 5 2025-03-28 04:46:30 2025-03-29     5 0.548  k    
 6 2025-03-27 22:06:59 2025-04-24     6 0.479  e    
 7 2025-03-27 14:55:36 2025-04-23     7 0.642  q    
 8 2025-03-27 15:26:08 2025-04-17     8 0.718  k    
 9 2025-03-28 06:49:33 2025-04-12     9 0.947  x    
10 2025-03-28 04:43:40 2025-04-15    10 0.353  y    
# ℹ 990 more rows
Tip

runif(n, min = 0, max = 1) provides random deviates with uniform distribution

  • Printing customization

    • First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display.
    • width = Inf will display all columns:
```{r}
nycflights13::flights %>% 
  print(n = 10, width = Inf)

# using view()
nycflights13::flights %>% 
  View()
```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
   arr_delay carrier flight tailnum origin dest  air_time distance  hour minute
       <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>    <dbl> <dbl>  <dbl>
 1        11 UA        1545 N14228  EWR    IAH        227     1400     5     15
 2        20 UA        1714 N24211  LGA    IAH        227     1416     5     29
 3        33 AA        1141 N619AA  JFK    MIA        160     1089     5     40
 4       -18 B6         725 N804JB  JFK    BQN        183     1576     5     45
 5       -25 DL         461 N668DN  LGA    ATL        116      762     6      0
 6        12 UA        1696 N39463  EWR    ORD        150      719     5     58
 7        19 B6         507 N516JB  EWR    FLL        158     1065     6      0
 8       -14 EV        5708 N829AS  LGA    IAD         53      229     6      0
 9        -8 B6          79 N593JB  JFK    MCO        140      944     6      0
10         8 AA         301 N3ALAA  LGA    ORD        138      733     6      0
   time_hour          
   <dttm>             
 1 2013-01-01 05:00:00
 2 2013-01-01 05:00:00
 3 2013-01-01 05:00:00
 4 2013-01-01 05:00:00
 5 2013-01-01 06:00:00
 6 2013-01-01 05:00:00
 7 2013-01-01 06:00:00
 8 2013-01-01 06:00:00
 9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows

3.2.2 Subsetting

```{r}
df <- tibble(
  x = runif(5),
  y = rnorm(5)
)
df

# Extract by name
df$x

df[["x"]]

# Extract by position
df[[1]]

# To use these in a pipe, you'll need to use the special placeholder .:
df %>% .$x

df %>% .[["x"]]
```
# A tibble: 5 × 2
      x        y
  <dbl>    <dbl>
1 0.429 -1.26   
2 0.471 -0.320  
3 0.384  0.00297
4 0.828 -0.0364 
5 0.437 -0.162  
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805

3.2.2.1 Native operator

Caution

Using native operator: it doesn’t wrap LHS up. Thus use anonymous function

```{r}
df |> (\(x) x$x)()
df |> (\(x) x[["x"]])()
```
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805

cf Subsetting with deplyr

```{r}
df |> pull(x)
```
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805

3.3 Interacting with older code

  • Some older functions don’t work with tibbles.
  • If you encounter one of these functions, use as.data.frame() to turn a tibble back to a data.frame:
```{r}
class(as.data.frame(tb))
```
[1] "data.frame"

4 Strings (with stringr package) and regex

4.1 Introduction

  • The focus of this chapter will be on regular expressions, or regex for short.
  • Regular expressions are useful because strings usually contain unstructured or semi-structured data, and regex are a concise language for describing patterns in strings.
```{r}
library(tidyverse)
```

4.2 String basics

  • You can create strings with either single quotes or double quotes.
  • Unlike other languages, there is no difference in behavior.
  • Use ", unless you want to create a string that contains multiple ".
```{r}
string1 <- "This is a string"
string2 <- 'If I want to include a "quote" inside a string, I use single quotes'

# If you forget to close a quote, you'll see +, the continuation character: 
# If this happen to you, press Escape and try again!
```

4.2.1 Use \ to “escape”

  • To include a literal single or double quote in a string you can use  to “escape” it:
```{r}
double_quote <- "\"" # or '"'
double_quote # the printed representation shows the escapes.
cat(double_quote) # cat() evaluate it first and print it.

"\""
cat("\"")

'"'
cat('"')

single_quote <- '\'' # or "'"
single_quote # the printed representation shows the escapes.
cat(single_quote)

"'"
cat("'")

'\''
cat('\'') # the same as above.

"\\"
cat("\\")


x <- c("\"", "\\")
x # the printed representation shows the escapes as they are.

cat(x) 

# cf. 
writeLines(x) # To see the raw contents of the string in different lines
```
[1] "\""
"[1] "\""
"[1] "\""
"[1] "'"
'[1] "'"
'[1] "'"
'[1] "\\"
\[1] "\"" "\\"
" \"
\

4.2.2 Special characters

```{r}
# ?"'"
```

Quotes

  • Three types of quotes are part of the syntax of R: single and double quotation marks and the backtick (or back quote, ‘???`???’)
  • In addition, backslash is used to escape the following character inside character constants.
  • Single and double quotes delimit character constants. They can be used interchangeably but double quotes are preferred (and character constants are printed using double quotes),
  • so single quotes are normally only used to delimit character constants containing double quotes.
  • Backslash is used to start an escape sequence inside character constants. Escaping a character not in the following table is an error.
  • Single quotes need to be escaped by backslash in single-quoted strings, and double quotes in double-quoted strings.
`\n`    newline (aka 'line feed')
`\r`    carriage return
`\t`    tab
`\b`  backspace
`\a`  alert (bell)
`\f`  form feed
`\v`  vertical tab
`\\`  backslash '???\???'
`\'`  ASCII apostrophe '???'???'
`\"`  ASCII quotation mark '???"???'
`\``  ASCII grave accent (backtick) '???`???'

4.2.3 String length

  • Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember.
  • Instead, we’ll use functions from stringr. These have more intuitive names, and all start with str_.
```{r}
str_length(c("a", "R for data science", NA)) # number of characters in a string
```
[1]  1 18 NA

4.2.4 Combining strings

4.2.4.1 str_replace_na()

Like most other functions in R, missing values are contagious. If you want them to print as “NA”, use str_replace_na():

```{r}
x <- c("abc", NA)
x
str_c("|-", x, "-|")
str_c("|-", str_replace_na(x), "-|")
```
[1] "abc" NA   
[1] "|-abc-|" NA       
[1] "|-abc-|" "|-NA-|" 

4.2.4.2 Objects of length 0

Objects of length 0 are silently dropped. This is particularly useful in conjunction with if:

```{r}
name <- "Hadley"
time_of_day <- "morning"
birthday <- FALSE

str_length(name)
str_length(birthday)

str_c(
  "Good ", time_of_day, " ", name,
  if (birthday) " and HAPPY BIRTHDAY",
  "."
)
```
[1] 6
[1] 5
[1] "Good morning Hadley."

4.2.4.3 collapse

To collapse a vector of strings into a single string, use collapse

```{r}
str_c(c("x", "y", "z"))
str_c(c("x", "y", "z"), collapse = ", ")
str_c(c("x", "y", "z"), collapse = ",")
str_c(c("x", "y", "z"), collapse = "")
str_c(c("x", "y", "z"), collapse = " ")
```
[1] "x" "y" "z"
[1] "x, y, z"
[1] "x,y,z"
[1] "xyz"
[1] "x y z"

4.2.5 Subsetting strings

4.2.5.1 str_sub()

str_sub(string, start = 1L, end = -1L)

```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3) # argument: start, end

# negative numbers count backwards from end
str_sub(x, end = -1)
str_sub(x, 1, -1)
str_sub(x, -3, -1)
```
[1] "App" "Ban" "Pea"
[1] "Apple"  "Banana" "Pear"  
[1] "Apple"  "Banana" "Pear"  
[1] "ple" "ana" "ear"
Tip

str_sub() won’t fail if the string is too short: it will just return as much as possible:

```{r}
str_sub("a", 1, 5)
```
[1] "a"

You can also use the assignment form of str_sub() to modify strings

```{r}
str_sub(x, 1, 1)
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
```
[1] "A" "B" "P"
[1] "apple"  "banana" "pear"  

4.2.6 Locales

  • str_to_upper()
  • str_to_title()
  • str_to_lower()
```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "i"))

str_to_upper(c("i", "i"), locale = "tr") 
```
[1] "I" "I"
[1] "İ" "İ"
Note
  • The base R order() and sort() functions sort strings using the current locale.
  • If you want robust behaviour across different computers, you may want to use str_sort() and str_order() which take an additional locale argument:
```{r}
x <- c("apple", "eggplant", "banana")

str_sort(x, locale = "en")  # English
str_sort(x, locale = "haw") # Hawaiian
```
[1] "apple"    "banana"   "eggplant"
[1] "apple"    "eggplant" "banana"  

4.2.7 str_wrap()

str_wrap(string, width = 80, indent = 0, exdent = 0)

```{r}
thanks_path <- file.path(R.home("doc"), "THANKS")
thanks <- str_c(readLines(thanks_path), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
cat(str_wrap(thanks, width = 40), "\n")
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")
cat(str_wrap(thanks, width = 0, exdent = 2), "\n")
```
R would not be what it is today without the invaluable help of these people
outside of the (former and current) R Core team, who contributed by donating
code, bug fixes and documentation: Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson, Roger Bivand, Ben Bolker, David Brahm,
G"oran Brostr"om, Patrick Burns, Vince Carey, Saikat DebRoy, Matt Dowle, Brian
D'Urso, Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian Fischmeister,
John Fox, Paul Gilbert, Yu Gong, Gabor Grothendieck, Frank E Harrell Jr, Peter
M. Haverty, Torsten Hothorn, Robert King, Kjetil Kjernsmo, Roger Koenker,
Philippe Lambert, Jan de Leeuw, Jim Lindsey, Patrick Lindsey, Catherine Loader,
Gordon Maclean, Arni Magnusson, John Maindonald, David Meyer, Ei-ji Nakama,
Jens Oehlschl"agel, Steve Oncley, Richard O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel, Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have written code that has been adopted by R and
is acknowledged in the code files, including 
R would not be what it is today without
the invaluable help of these people
outside of the (former and current) R
Core team, who contributed by donating
code, bug fixes and documentation:
Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson,
Roger Bivand, Ben Bolker, David Brahm,
G"oran Brostr"om, Patrick Burns, Vince
Carey, Saikat DebRoy, Matt Dowle,
Brian D'Urso, Lyndon Drake, Dirk
Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert,
Yu Gong, Gabor Grothendieck, Frank E
Harrell Jr, Peter M. Haverty, Torsten
Hothorn, Robert King, Kjetil Kjernsmo,
Roger Koenker, Philippe Lambert, Jan
de Leeuw, Jim Lindsey, Patrick Lindsey,
Catherine Loader, Gordon Maclean,
Arni Magnusson, John Maindonald,
David Meyer, Ei-ji Nakama, Jens
Oehlschl"agel, Steve Oncley, Richard
O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony
Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun
Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry
Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel,
Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have
written code that has been adopted by R
and is acknowledged in the code files,
including 
  R would not be what it is today without the invaluable
help of these people outside of the (former and current)
R Core team, who contributed by donating code, bug fixes
and documentation: Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson, Roger Bivand, Ben
Bolker, David Brahm, G"oran Brostr"om, Patrick Burns,
Vince Carey, Saikat DebRoy, Matt Dowle, Brian D'Urso,
Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert, Yu Gong, Gabor
Grothendieck, Frank E Harrell Jr, Peter M. Haverty,
Torsten Hothorn, Robert King, Kjetil Kjernsmo, Roger
Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey,
Patrick Lindsey, Catherine Loader, Gordon Maclean, Arni
Magnusson, John Maindonald, David Meyer, Ei-ji Nakama,
Jens Oehlschl"agel, Steve Oncley, Richard O'Keefe, Hubert
Palme, Roger D. Peng, Jose' C. Pinheiro, Tony Plate, Anthony
Rossini, Jonathan Rougier, Petr Savicky, Guenther Sawitzki,
Marc Schwartz, Arun Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner,
Bill Venables, Gregory R. Warnes, Andreas Weingessel, Morten
Welinder, James Wettenhall, Simon Wood, and Achim Zeileis.
Others have written code that has been adopted by R and is
acknowledged in the code files, including 
R would not be what it is today without the invaluable help
  of these people outside of the (former and current) R
  Core team, who contributed by donating code, bug fixes
  and documentation: Valerio Aimale, Suharto Anggono, Thomas
  Baier, Gabe Becker, Henrik Bengtsson, Roger Bivand, Ben
  Bolker, David Brahm, G"oran Brostr"om, Patrick Burns,
  Vince Carey, Saikat DebRoy, Matt Dowle, Brian D'Urso,
  Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian
  Fischmeister, John Fox, Paul Gilbert, Yu Gong, Gabor
  Grothendieck, Frank E Harrell Jr, Peter M. Haverty,
  Torsten Hothorn, Robert King, Kjetil Kjernsmo, Roger
  Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey,
  Patrick Lindsey, Catherine Loader, Gordon Maclean, Arni
  Magnusson, John Maindonald, David Meyer, Ei-ji Nakama,
  Jens Oehlschl"agel, Steve Oncley, Richard O'Keefe, Hubert
  Palme, Roger D. Peng, Jose' C. Pinheiro, Tony Plate,
  Anthony Rossini, Jonathan Rougier, Petr Savicky, Guenther
  Sawitzki, Marc Schwartz, Arun Srinivasan, Detlef Steuer,
  Bill Simpson, Gordon Smyth, Adrian Trapletti, Terry
  Therneau, Rolf Turner, Bill Venables, Gregory R. Warnes,
  Andreas Weingessel, Morten Welinder, James Wettenhall,
  Simon Wood, and Achim Zeileis. Others have written code
  that has been adopted by R and is acknowledged in the code
  files, including 
R
  would
  not
  be
  what
  it
  is
  today
  without
  the
  invaluable
  help
  of
  these
  people
  outside
  of
  the
  (former
  and
  current)
  R
  Core
  team,
  who
  contributed
  by
  donating
  code,
  bug
  fixes
  and
  documentation:
  Valerio
  Aimale,
  Suharto
  Anggono,
  Thomas
  Baier,
  Gabe
  Becker,
  Henrik
  Bengtsson,
  Roger
  Bivand,
  Ben
  Bolker,
  David
  Brahm,
  G"oran
  Brostr"om,
  Patrick
  Burns,
  Vince
  Carey,
  Saikat
  DebRoy,
  Matt
  Dowle,
  Brian
  D'Urso,
  Lyndon
  Drake,
  Dirk
  Eddelbuettel,
  Claus
  Ekstrom,
  Sebastian
  Fischmeister,
  John
  Fox,
  Paul
  Gilbert,
  Yu
  Gong,
  Gabor
  Grothendieck,
  Frank
  E
  Harrell
  Jr,
  Peter
  M.
  Haverty,
  Torsten
  Hothorn,
  Robert
  King,
  Kjetil
  Kjernsmo,
  Roger
  Koenker,
  Philippe
  Lambert,
  Jan
  de
  Leeuw,
  Jim
  Lindsey,
  Patrick
  Lindsey,
  Catherine
  Loader,
  Gordon
  Maclean,
  Arni
  Magnusson,
  John
  Maindonald,
  David
  Meyer,
  Ei-ji
  Nakama,
  Jens
  Oehlschl"agel,
  Steve
  Oncley,
  Richard
  O'Keefe,
  Hubert
  Palme,
  Roger
  D.
  Peng,
  Jose'
  C.
  Pinheiro,
  Tony
  Plate,
  Anthony
  Rossini,
  Jonathan
  Rougier,
  Petr
  Savicky,
  Guenther
  Sawitzki,
  Marc
  Schwartz,
  Arun
  Srinivasan,
  Detlef
  Steuer,
  Bill
  Simpson,
  Gordon
  Smyth,
  Adrian
  Trapletti,
  Terry
  Therneau,
  Rolf
  Turner,
  Bill
  Venables,
  Gregory
  R.
  Warnes,
  Andreas
  Weingessel,
  Morten
  Welinder,
  James
  Wettenhall,
  Simon
  Wood,
  and
  Achim
  Zeileis.
  Others
  have
  written
  code
  that
  has
  been
  adopted
  by
  R
  and
  is
  acknowledged
  in
  the
  code
  files,
  including 

4.2.8 str_trim()

str_trim(string, side = c(“both”, “left”, “right”))

```{r}
cat("\nString with trailing and leading white space\n")
cat("\n\nString with trailing and leading white space\n\n")

str_trim("  String with trailing and leading white space\t")
str_trim("\n\nString with trailing and leading white space\n\n")
```

String with trailing and leading white space


String with trailing and leading white space

[1] "String with trailing and leading white space"
[1] "String with trailing and leading white space"

4.2.9 str_squish()

str_squish(string)

```{r}
str_squish("  String with trailing,  middle, and leading white space\t")
str_squish("\n\nString with excess,  trailing and leading white   space\n\n")
```
[1] "String with trailing, middle, and leading white space"
[1] "String with excess, trailing and leading white space"

4.3 Matching patterns with regular expressions

  • Regex are a very terse language that allow you to describe patterns in strings.
  • To learn regular expressions, we’ll use str_view().
  • They take a character vector and a regular expression, and show you how they match.
  • Once you’ve mastered pattern matching, you’ll learn how to apply those ideas with various stringr functions.

4.3.1 Basic matches

  • str_view shows the first match
  • str_view(string, pattern, match = NA)

4.3.1.1 (1) exact strings

```{r}
x <- c("apple", "banana", "pear")
str_view(x, "an")
```
[2] │ b<an><an>a

4.3.1.2 (2) .

  • . matches any character (except a newline):
```{r}
str_view(x, ".a.")
```
[2] │ <ban>ana
[3] │ p<ear>
```{r}
# To create the regular expression, we need \\
dot <- "\\."

# But the expression itself only contains one:
writeLines(dot)
cat(dot)

# And this tells R to look for an explicit `.`
str_view(c("abc", "a.c", "bef"), "a\\.c")
```
\.
\.[2] │ <a.c>
```{r}
x <- "a\\b"
writeLines(x)

str_view(x, "\\\\")
```
a\b
[1] │ a<\>b

4.3.2 Anchors

  • By default, regular expressions will match any part of a string.
  • It’s often useful to anchor the regular expression so that it matches from the start or end of the string.

4.3.2.1 ^ and $

  • You can use:
  • ^ to match the start of the string.
  • $ to match the end of the string.
```{r}
x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
```
[1] │ <a>pple
[2] │ banan<a>
  • To force a regular expression to only match a complete string, anchor it with both ^ and $:
```{r}
x <- c("apple pie", "apple", "apple cake")
str_view(x, "apple")
str_view(x, "^apple$")
```
[1] │ <apple> pie
[2] │ <apple>
[3] │ <apple> cake
[2] │ <apple>

4.3.2.2 boundary with \b

  • You can also match the boundary between words with
  • search for o avoid matching summarise, summary, rowsum and so on.
```{r}
y <- c("summarize", "summary", "sum", "rowsum")
str_view(y, "\\bsum\\b")
```
[3] │ <sum>

4.3.3 Chracter classes and alternatives

  • There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:
  • matches any digit.
  • : matches any whitespace (e.g. space, tab, newline).
  • Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type \\d or \\s.
  • A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
```{r}
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") # same as below.
str_view(c("abc", "a.c", "a*c", "a c"), "a\\.c")

str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") # same as below
str_view(c("abc", "a.c", "a*c", "a c"), ".\\*c")

str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") # same as below
str_view(c("abc", "a.c", "a*c", "a c"), "a\\ ")
```
[2] │ <a.c>
[2] │ <a.c>
[3] │ <a*c>
[3] │ <a*c>
[4] │ <a >c
[4] │ <a >c

4.3.3.1 Alternate

```{r}
str_view(c("grey", "gray"), "gr(e|a)y") # same as below
str_view(c("grey", "gray"), "grey|gray")
```
[1] │ <grey>
[2] │ <gray>
[1] │ <grey>
[2] │ <gray>

4.3.4 Repetition (= Quantifiers)

The next step up in power involves controlling how many times a pattern matches:

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, "CC*") #greedy

str_view(x, 'C[LX]+') #[LX]: matches either L or X

str_extract(x, "CC?") # greedy
str_extract(x, "CC+") # greedy
str_extract(x, 'C[LX]+') # greedy
```
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII
[1] "CC"
[1] "CCC"
[1] "CLXXX"

You can also specify the number of matches precisely:

  • {n}: exactly n
  • {n,}: n or more
  • {,m}: at most m
  • {n,m}: between n and m
```{r}
str_view(x, "C{2}") # 
str_view(x, "C{2,}") # greedy
str_view(x, "C{2,3}") # greedy

str_extract(x, "C{2}")
str_extract(x, "C{2,}")
str_extract(x, "C{2,3}")
```
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] "CC"
[1] "CCC"
[1] "CCC"
Tip
  • By default these matches are “greedy”: they will match the longest string possible.
  • You can make them “lazy”, matching the shortest string possible by putting a ? after them.
  • This is an advanced feature of regular expressions, but it’s useful to know that it exists:
    • ??: 0 or 1, prefer 0.
    • +?: 1 or more, match as few times as possible.
    • *?: 0 or more, match as few times as possible.
    • {n,}?: n or more, match as few times as possible.
    • {n,m}?: between n and m, , match as few times as possible, but at least n.
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
# cf
str_view(x, "CC??")

str_view(x, 'C{2,3}?') # lazy

str_view(x, 'C[LX]+?') # lazy
str_view(x, 'C[LX]+') # greedy

str_extract(x, c("C{2,3}", "C{2,3}?"))
str_extract(x, c("C[LX]+", "C[LX]+?"))
```
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<C><C><C>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MDCC<CL>XXXVIII
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII
[1] "CCC" "CC" 
[1] "CLXXX" "CL"   

4.3.5 Grouping and backreferences

  • Parentheses also create a numbered capturing group (number 1, 2 etc.).
  • A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses.
  • You can refer to the same text as previously matched by a capturing group with backreferences, like \1, \2 etc
```{r}
str_view(fruit, "(..)\\1", match = TRUE) # "\\1" means group 1

str_view(fruit, "(.)\\1", match = TRUE) #"(.)" means one word
str_view(fruit, "(..)\1", match = TRUE) # wrong. needs \\

str_view(fruit, "(.)(.)\\2\\1", match = TRUE) 
```
 [4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
 [1] │ a<pp>le
 [5] │ be<ll> pe<pp>er
 [6] │ bilbe<rr>y
 [7] │ blackbe<rr>y
 [8] │ blackcu<rr>ant
 [9] │ bl<oo>d orange
[10] │ bluebe<rr>y
[11] │ boysenbe<rr>y
[16] │ che<rr>y
[17] │ chili pe<pp>er
[19] │ cloudbe<rr>y
[21] │ cranbe<rr>y
[23] │ cu<rr>ant
[28] │ e<gg>plant
[29] │ elderbe<rr>y
[32] │ goji be<rr>y
[33] │ g<oo>sebe<rr>y
[38] │ hucklebe<rr>y
[47] │ lych<ee>
[50] │ mulbe<rr>y
... and 9 more
 [5] │ bell p<eppe>r
[17] │ chili p<eppe>r

4.3.6 Look arounds

These assertions look ahead or behind the current match without “consuming” any characters (i.e. changing the input position).

  • (?=...): positive look-ahead assertion. Matches if … matches at the current input.
    • followed by
  • (?!...): negative look-ahead assertion. Matches if … does not match at the current input.
    • not followed by
  • (?<=...): positive look-behind assertion. Matches if … matches text preceding the current position, with the last character of the match being the character just before the current position. Length must be bounded (i.e. no * or +).
    • preceded by
  • (?<!...): negative look-behind assertion. Matches if … does not match text preceding the current position. Length must be bounded (i.e. no * or +).
    • not preceded by

These are useful when you want to check that a pattern exists, but you don’t want to include it in the result:

```{r}
x <- c("1 piece", "2 pieces", "3")
str_extract(x, "\\d+(?= pieces?)") # positive look-ahead assertion: followed by

x1 <- c("piece 1", "pieces 2", "3")
str_extract(x1, "(?<=pieces?) \\d") # positive look-behind assertion: preceded by

y <- c("100", "$400")
str_extract(y, "(?<=\\$)\\d+") # positive look-behind assertion: preceded by
```
[1] "1" "2" NA 
[1] " 1" " 2" NA  
[1] NA    "400"

4.4 Tools

4.4.1 str_detect()

4.4.1.1 (1) Overview

  • Detect matches.
  • It returns a logical vector(TRUE = 1; FALSE = 0) the same length as the input
```{r}
x <- c("apple", "banana", "pear")
str_detect(x, "e")
```
[1]  TRUE FALSE  TRUE

4.4.1.2 (2) counting and proportion

  • That makes sum() and mean() useful if you want to answer questions about matches across a larger vector:
```{r}
# How many common words start with t in words dataset.
class(words)
as_tibble(words)
sum(str_detect(words, "^t"))

# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
```
[1] "character"
# A tibble: 980 × 1
   value   
   <chr>   
 1 a       
 2 able    
 3 about   
 4 absolute
 5 accept  
 6 account 
 7 achieve 
 8 across  
 9 act     
10 active  
# ℹ 970 more rows
[1] 65
[1] 0.2765306
Note
  • When you have complex logical conditions (e.g. match a or b but not c unless d) it’s often easier to combine multiple str_detect() calls with logical operators, rather than trying to create a single regular expression.

  • For example, here are two ways to find all words that don’t contain any vowels

```{r}
# Find all words containing at least one vowel, and negate
no_vowels_1 <- !str_detect(words, "[aeiou]")

# Find all words consisting only of consonants (non-vowels)
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")

identical(no_vowels_1, no_vowels_2) # check if they are the same.
```
[1] TRUE

4.4.1.3 (3) str_subset()

  • A common use of str_detect() is to select the elements that match a pattern.
  • You can do this with logical subsetting, or the convenient str_subset() wrapper:
```{r}
str_detect(words, "x$")
words[str_detect(words, "x$")] # logical subsetting
str_subset(words, "x$") # wrapper of str_detect()
```
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[601] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[613] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[637] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[673] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[685] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[697] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[709] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[721] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[733] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[745] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[769] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[805] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[829] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[841]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[853] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[889] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[901] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[913] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[925] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[937] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[973] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[1] "box" "sex" "six" "tax"
[1] "box" "sex" "six" "tax"

4.4.1.4 (4) with filter() in data frame

```{r}
df <- tibble(
  word = words, 
  i = seq_along(word)
)
df

df %>% 
  filter(str_detect(word, "x$"))
```
# A tibble: 980 × 2
   word         i
   <chr>    <int>
 1 a            1
 2 able         2
 3 about        3
 4 absolute     4
 5 accept       5
 6 account      6
 7 achieve      7
 8 across       8
 9 act          9
10 active      10
# ℹ 970 more rows
# A tibble: 4 × 2
  word      i
  <chr> <int>
1 box     108
2 sex     747
3 six     772
4 tax     841

4.4.1.5 (5) str_count()

  • str_count(): rather than a simple yes or no, it tells you how many matches there are in a string:
```{r}
x <- c("apple", "banana", "pear")
str_count(x, "a")

# On average, how many vowels per word?
mean(str_count(words, "[aeiou]"))

str_count(words, "[aeiou]") |> 
  mean()
```
[1] 1 3 1
[1] 1.991837
[1] 1.991837

4.4.1.6 (6) str_count() with mutate()

```{r}
df %>% 
  mutate(
    vowels = str_count(word, "[aeiou]"),
    consonants = str_count(word, "[^aeiou]") # matches anything except a, e, i, o, u.
  )
```
# A tibble: 980 × 4
   word         i vowels consonants
   <chr>    <int>  <int>      <int>
 1 a            1      1          0
 2 able         2      2          2
 3 about        3      3          2
 4 absolute     4      4          4
 5 accept       5      2          4
 6 account      6      3          4
 7 achieve      7      4          3
 8 across       8      2          4
 9 act          9      1          2
10 active      10      3          3
# ℹ 970 more rows

4.4.1.7 (7) No pattern overlapping

  • Note that matches never overlap. For example, in “abababa”, how many times will the pattern “aba” match?
  • Regular expressions say two, not three:
```{r}
str_count("abababa", "aba")

str_view("abababa", "aba")
```
[1] 2
[1] │ <aba>b<aba>
Tip
  • many stringr functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix _all.

4.4.2 str_extract()

4.4.2.1 (1) Overview

  • Extract matches
  • Note that str_extract() only extracts the first match.
```{r}
# sentences dataset from stringr: Sample character vectors for practicing string manipulations.
class(sentences)
length(sentences)
head(sentences)
```
[1] "character"
[1] 720
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[6] "The juice of lemons makes fine punch."      

Goal: Find all sentences that contain a colour.

```{r}
colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match

has_color <- str_subset(sentences, color_match)
has_color

matches <- str_extract(has_color, color_match)
head(matches)
```
[1] "red|orange|yellow|green|blue|purple"
 [1] "Glue the sheet to the dark blue background."       
 [2] "Two blue fish swam in the tank."                   
 [3] "The colt reared and threw the tall rider."         
 [4] "The wide road shimmered in the hot sun."           
 [5] "See the cat glaring at the scared mouse."          
 [6] "A wisp of cloud hung in the blue air."             
 [7] "Leaves turn brown and yellow in the fall."         
 [8] "He ordered peach pie with ice cream."              
 [9] "Pure bred poodles have curls."                     
[10] "The spot on the blotter was made by green ink."    
[11] "Mud was spattered on the front of his white shirt."
[12] "The sofa cushion is red and of light weight."      
[13] "The sky that morning was clear and bright blue."   
[14] "Torn scraps littered the stone floor."             
[15] "The doctor cured him with these pills."            
[16] "The new girl was fired today at noon."             
[17] "The third act was dull and tired the players."     
[18] "A blue crane is a tall wading bird."               
[19] "Live wires should be kept covered."                
[20] "It is hard to erase blue or red ink."              
[21] "The wreck occurred by the bank on Main Street."    
[22] "The lamp shone with a steady green flame."         
[23] "The box is held by a bright red snapper."          
[24] "The prince ordered his head chopped off."          
[25] "The houses are built of red clay bricks."          
[26] "The red tape bound the smuggled food."             
[27] "Nine men were hired to dig the ruins."             
[28] "The flint sputtered and lit a pine torch."         
[29] "Hedge apples may stain your hands green."          
[30] "The old pan was covered with hard fudge."          
[31] "The plant grew large and green in the window."     
[32] "The store walls were lined with colored frocks."   
[33] "The purple tie was ten years old."                 
[34] "Bathe and relax in the cool green grass."          
[35] "The clan gathered on each dull night."             
[36] "The lake sparkled in the red hot sun."             
[37] "Mark the spot with a sign painted red."            
[38] "Smoke poured out of every crack."                  
[39] "Serve the hot rum to the tired heroes."            
[40] "The couch cover and hall drapes were blue."        
[41] "He offered proof in the form of a large chart."    
[42] "A man in a blue sweater sat at the desk."          
[43] "A sip of tea revives his tired friend."            
[44] "The door was barred, locked, and bolted as well."  
[45] "A thick coat of black paint covered all."          
[46] "The small red neon lamp went out."                 
[47] "Paint the sockets in the wall dull green."         
[48] "Wake and rise, and step into the green outdoors."  
[49] "The green light in the brown box flickered."       
[50] "He put his last cartridge into the gun and fired." 
[51] "The ram scared the school children off."           
[52] "Tear a thin sheet from the yellow pad."            
[53] "Dimes showered down from all sides."               
[54] "The sky in the west is tinged with orange red."    
[55] "The red paper brightened the dim stage."           
[56] "The hail pattered on the burnt brown grass."       
[57] "The big red apple fell to the ground."             
[1] "blue" "blue" "red"  "red"  "red"  "blue"

4.4.2.2 (2) str_extract_all()

```{r}
more <- sentences[str_count(sentences, color_match) > 1]
str_view(more, color_match)

# single match allows you to use simpler data structure
str_extract(more, color_match)

#To get all matches, use str_extract_all(). It returns a list:
str_extract_all(more, color_match)

#If you use simplify = TRUE, str_extract_all() will return a matrix with short matches expanded to the same length as the longest:
str_extract_all(more, color_match, simplify = TRUE)

x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
[1] │ It is hard to erase <blue> or <red> ink.
[2] │ The <green> light in the brown box flicke<red>.
[3] │ The sky in the west is tinged with <orange> <red>.
[1] "blue"   "green"  "orange"
[[1]]
[1] "blue" "red" 

[[2]]
[1] "green" "red"  

[[3]]
[1] "orange" "red"   

     [,1]     [,2] 
[1,] "blue"   "red"
[2,] "green"  "red"
[3,] "orange" "red"
     [,1] [,2] [,3]
[1,] "a"  ""   ""  
[2,] "a"  "b"  ""  
[3,] "a"  "b"  "c" 

4.4.3 Grouped matches

  • You can also use parentheses to extract parts of a complex match.
  • For example, imagine we want to extract nouns from the sentences.
  • As a heuristic, we’ll look for any word that comes after “a” or “the”.
  • Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation:
  • a sequence of at least one character that isn’t a space.

4.4.3.1 (1) str_extract()

It gives us the complete match

```{r}
noun <- "(a|the) ([^ ]+)"

has_noun <- sentences %>%
  str_subset(noun) %>%
  head(10)
has_noun %>% 
  str_extract(noun)
```
 [1] "the smooth" "the sheet"  "the depth"  "a chicken"  "the parked"
 [6] "the sun"    "the huge"   "the ball"   "the woman"  "a helps"   

4.4.3.2 (2) str_matches()

  • gives each individual component
  • it returns a matrix, with one column for the complete match followed by one column for each group
```{r}
has_noun %>% 
  str_match(noun)

has_noun %>% 
  str_match_all(noun)
```
      [,1]         [,2]  [,3]     
 [1,] "the smooth" "the" "smooth" 
 [2,] "the sheet"  "the" "sheet"  
 [3,] "the depth"  "the" "depth"  
 [4,] "a chicken"  "a"   "chicken"
 [5,] "the parked" "the" "parked" 
 [6,] "the sun"    "the" "sun"    
 [7,] "the huge"   "the" "huge"   
 [8,] "the ball"   "the" "ball"   
 [9,] "the woman"  "the" "woman"  
[10,] "a helps"    "a"   "helps"  
[[1]]
     [,1]         [,2]  [,3]    
[1,] "the smooth" "the" "smooth"

[[2]]
     [,1]        [,2]  [,3]   
[1,] "the sheet" "the" "sheet"
[2,] "the dark"  "the" "dark" 

[[3]]
     [,1]        [,2]  [,3]   
[1,] "the depth" "the" "depth"
[2,] "a well."   "a"   "well."

[[4]]
     [,1]        [,2] [,3]     
[1,] "a chicken" "a"  "chicken"
[2,] "a rare"    "a"  "rare"   

[[5]]
     [,1]         [,2]  [,3]    
[1,] "the parked" "the" "parked"

[[6]]
     [,1]      [,2]  [,3] 
[1,] "the sun" "the" "sun"

[[7]]
     [,1]        [,2]  [,3]   
[1,] "the huge"  "the" "huge" 
[2,] "the clear" "the" "clear"

[[8]]
     [,1]       [,2]  [,3]  
[1,] "the ball" "the" "ball"

[[9]]
     [,1]        [,2]  [,3]   
[1,] "the woman" "the" "woman"

[[10]]
     [,1]           [,2]  [,3]      
[1,] "a helps"      "a"   "helps"   
[2,] "the evening." "the" "evening."

4.4.3.3 (3) tidyr::extract()

If your data is in a tibble, it’s often easier to use tidyr::extract().

  • It works like str_match() but requires you to name the matches, which are then placed in new columns:
  • Usage
extract(
  data,
  col,
  into,
  regex = "([[:alnum:]]+)",
  remove = TRUE,
  convert = FALSE,
  ...
)
```{r}
tibble(sentence = sentences) %>% 
  tidyr::extract(
    sentence, c("article", "noun"), "(a|the) ([^ ]+)", 
    remove = FALSE # If TRUE, remove input column from output data frame.
  )
```
# A tibble: 720 × 3
   sentence                                    article noun   
   <chr>                                       <chr>   <chr>  
 1 The birch canoe slid on the smooth planks.  the     smooth 
 2 Glue the sheet to the dark blue background. the     sheet  
 3 It's easy to tell the depth of a well.      the     depth  
 4 These days a chicken leg is a rare dish.    a       chicken
 5 Rice is often served in round bowls.        <NA>    <NA>   
 6 The juice of lemons makes fine punch.       <NA>    <NA>   
 7 The box was thrown beside the parked truck. the     parked 
 8 The hogs were fed chopped corn and garbage. <NA>    <NA>   
 9 Four hours of steady work faced us.         <NA>    <NA>   
10 A large size in stockings is hard to sell.  <NA>    <NA>   
# ℹ 710 more rows

4.4.4 Replacing matches: str_replace()

4.4.4.1 (1) Simple

  • str_replace() and str_replace_all() allow you to replace matches with new strings.
  • The simplest use is to replace a pattern with a fixed string:
```{r}
x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")

str_replace_all(x, "[aeiou]", "-")
```
[1] "-pple"  "p-ar"   "b-nana"
[1] "-ppl-"  "p--r"   "b-n-n-"

4.4.4.2 (2) Named vector

  • With str_replace_all() you can perform multiple replacements by supplying a named vector.
```{r}
x <- c("1 house", "2 cars", "3 people")
x
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
```
[1] "1 house"  "2 cars"   "3 people"
[1] "one house"    "two cars"     "three people"

4.4.4.3 (3) Insert components of match

  • Instead of replacing with a fixed string you can usebackreferencesto insert components of the match.
  • Ex) flip the order of the second and third words.
```{r}
sentences %>% 
  head(5)

sentences %>% 
  str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>% 
  head(5)
```
[1] "The birch canoe slid on the smooth planks." 
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."     
[4] "These days a chicken leg is a rare dish."   
[5] "Rice is often served in round bowls."       
[1] "The canoe birch slid on the smooth planks." 
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."     
[4] "These a days chicken leg is a rare dish."   
[5] "Rice often is served in round bowls."       

4.4.5 Splitting: str_split()

4.4.5.1 (1) Sentences to words

split a string up into pieces. For example, we could split sentences into words:

```{r}
# returns a list
sentences %>%
  head(5) %>% 
  str_split("") # one letter at a time

sentences %>%
  head(5) %>% 
  str_split(" ") # one word at a time

sentences %>%
  head(5) %>% 
  str_split("  ") # one sentence at a time
```
[[1]]
 [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i"
[20] "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t" "h" " " "p" "l" "a"
[39] "n" "k" "s" "."

[[2]]
 [1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t"
[20] "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b" "a" "c" "k" "g" "r"
[39] "o" "u" "n" "d" "."

[[3]]
 [1] "I" "t" "'" "s" " " "e" "a" "s" "y" " " "t" "o" " " "t" "e" "l" "l" " " "t"
[20] "h" "e" " " "d" "e" "p" "t" "h" " " "o" "f" " " "a" " " "w" "e" "l" "l" "."

[[4]]
 [1] "T" "h" "e" "s" "e" " " "d" "a" "y" "s" " " "a" " " "c" "h" "i" "c" "k" "e"
[20] "n" " " "l" "e" "g" " " "i" "s" " " "a" " " "r" "a" "r" "e" " " "d" "i" "s"
[39] "h" "."

[[5]]
 [1] "R" "i" "c" "e" " " "i" "s" " " "o" "f" "t" "e" "n" " " "s" "e" "r" "v" "e"
[20] "d" " " "i" "n" " " "r" "o" "u" "n" "d" " " "b" "o" "w" "l" "s" "."

[[1]]
[1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
[8] "planks."

[[2]]
[1] "Glue"        "the"         "sheet"       "to"          "the"        
[6] "dark"        "blue"        "background."

[[3]]
[1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."

[[4]]
[1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
[8] "rare"    "dish."  

[[5]]
[1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."

[[1]]
[1] "The birch canoe slid on the smooth planks."

[[2]]
[1] "Glue the sheet to the dark blue background."

[[3]]
[1] "It's easy to tell the depth of a well."

[[4]]
[1] "These days a chicken leg is a rare dish."

[[5]]
[1] "Rice is often served in round bowls."

4.4.5.2 (2) Splitting as an element

```{r}
# extract an element
"a|b|c|d" %>% 
  str_split("\\|") %>% 
  .[[1]] # extract first element of the list
```
[1] "a" "b" "c" "d"

4.4.5.3 (3) Splitting as a matrix

Return a matrix with “simplify = TRUE”

```{r}
sentences %>%
  head(5) %>% 
  str_split(" ", simplify = TRUE)

fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields
str(fields)

fields %>% str_split(": ", n = 2, simplify = TRUE) %>%  #n=2: indicate a maximum number
  tibble()
```
     [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         
[1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."    
[2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background."
[3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          
[4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       
[5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""           
     [,9]   
[1,] ""     
[2,] ""     
[3,] "well."
[4,] "dish."
[5,] ""     
[1] "Name: Hadley" "Country: NZ"  "Age: 35"     
 chr [1:3] "Name: Hadley" "Country: NZ" "Age: 35"
# A tibble: 3 × 1
  .[,1]   [,2]  
  <chr>   <chr> 
1 Name    Hadley
2 Country NZ    
3 Age     35    

4.4.5.4 (4)boudnary ("word")

Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:

boundary(
  type = c("character", "line_break", "sentence", "word"),
  skip_word_none = NA,
```{r}
x <- "This is a sentence.  This is another sentence."
str_view(x, boundary("word"))
str_split(x, boundary("word")) # better outcome than the one below
str_split(x, " ") # not as good as above

words <- c("These are   some words.")
str_count(words, boundary("word"))
str_split(words, " ")[[1]]
str_split(words, boundary("word"))[[1]]
```
[1] │ <This> <is> <a> <sentence>.  <This> <is> <another> <sentence>.
[[1]]
[1] "This"     "is"       "a"        "sentence" "This"     "is"       "another" 
[8] "sentence"

[[1]]
[1] "This"      "is"        "a"         "sentence." ""          "This"     
[7] "is"        "another"   "sentence."

[1] 4
[1] "These"  "are"    ""       ""       "some"   "words."
[1] "These" "are"   "some"  "words"

4.4.6 Find matches with str_locate()

  • str_locate() and str_locate_all() give you the starting and ending positions of each match.
  • These are particularly useful when none of the other functions does exactly what you want.
  • You can use str_locate() to find the matching pattern, str_sub() to extract and/or modify them.
```{r}
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
str_locate(fruit, "a")
str_locate(fruit, "e")
str_locate(fruit, c("a", "b", "p", "p"))

str_locate_all(fruit, "a")
str_locate_all(fruit, "e")
str_locate_all(fruit, c("a", "b", "p", "p"))

# Find location of every character
str_locate_all(fruit, "")
```
     start end
[1,]     6   5
[2,]     7   6
[3,]     5   4
[4,]    10   9
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
[4,]     5   5
     start end
[1,]     5   5
[2,]    NA  NA
[3,]     2   2
[4,]     4   4
     start end
[1,]     1   1
[2,]     1   1
[3,]     1   1
[4,]     1   1
[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     2   2
[2,]     4   4
[3,]     6   6

[[3]]
     start end
[1,]     3   3

[[4]]
     start end
[1,]     5   5

[[1]]
     start end
[1,]     5   5

[[2]]
     start end

[[3]]
     start end
[1,]     2   2

[[4]]
     start end
[1,]     4   4
[2,]     9   9

[[1]]
     start end
[1,]     1   1

[[2]]
     start end
[1,]     1   1

[[3]]
     start end
[1,]     1   1

[[4]]
     start end
[1,]     1   1
[2,]     6   6
[3,]     7   7

[[1]]
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
[4,]     4   4
[5,]     5   5

[[2]]
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
[4,]     4   4
[5,]     5   5
[6,]     6   6

[[3]]
     start end
[1,]     1   1
[2,]     2   2
[3,]     3   3
[4,]     4   4

[[4]]
      start end
 [1,]     1   1
 [2,]     2   2
 [3,]     3   3
 [4,]     4   4
 [5,]     5   5
 [6,]     6   6
 [7,]     7   7
 [8,]     8   8
 [9,]     9   9

4.5 Other types of pattern

4.5.1 regex()

When you use a pattern that’s a string, it’s automatically wrapped into a call to regex()

```{r}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```
[2] │ ba<nana>
[2] │ ba<nana>

You can use the other arguments of regex() to control details of the match

4.5.1.1 (1) ignore_case = TRUE

allows characters to match either their uppercase or lowercase forms. This always uses the current locale.

```{r}
bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")

str_view(bananas, regex("banana", ignore_case = TRUE))
```
[1] │ <banana>
[1] │ <banana>
[2] │ <Banana>
[3] │ <BANANA>

4.5.1.2 (2) multiline = TRUE

allows ^ and $ to match the start and end of each line rather than the start and end of the complete string.

```{r}
x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]

str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
```
[1] "Line"
[1] "Line" "Line" "Line"

4.5.1.3 (3) comments = TRUE

allows you to use comments and white space to make complex regular expressions more understandable.

```{r}
phone <- regex("
  \\(?     # optional opening parens
  (\\d{3}) # area code
  [) -]?   # optional closing parens, space, or dash
  (\\d{3}) # another three numbers
  [ -]?    # optional space or dash
  (\\d{3}) # three more numbers
  ", comments = TRUE)

str_match("514-791-8141", phone)
```
     [,1]          [,2]  [,3]  [,4] 
[1,] "514-791-814" "514" "791" "814"

4.6 Other uses of regular expressions

There are two useful function in base R that also use regular expressions:

4.6.1 apropos()

  • searches all objects available from the global environment.
  • This is useful if you can’t quite remember the name of the function.
```{r}
apropos("replace")
apropos("max")
```
[1] "%+replace%"       "replace"          "replace_na"       "setReplaceMethod"
[5] "str_replace"      "str_replace_all"  "str_replace_na"   "theme_replace"   
 [1] "cummax"       "max"          "max.col"      "max_height"   "max_width"   
 [6] "mem.maxNSize" "mem.maxVSize" "pmax"         "pmax.int"     "promax"      
[11] "slice_max"    "varimax"      "which.max"   

4.6.2 dir()

  • lists all the files in a directory.
  • The pattern argument takes a regular expression and only returns file names that match the pattern.
  • For example, you can find all the quarto Markdown files in the current directory with the following.
```{r}
dir(pattern = "\\.qmd$")
```
[1] "M08-Advanced DataTypes.qmd"           
[2] "M08-Customizable Tables-gtsummary.qmd"

5 Factors with forcats

Factors with forcats:: cheat Sheet: http://www.flutterbys.com.au/stats/downloads/slides/figure/factors.pdf

5.1 Introduction

Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they’re not actually helpful. Fortunately, you don’t need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.

5.2 Creating factors

5.2.1 (1) factor()

```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
month_levels <- c(
  "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)

y1 <- factor(x1, levels = month_levels)
y1

sort(y1)

# any values not in the set will be silently converted to NA:
x2 <- c("Dec", "Apr", "Jam", "Mar")
y2 <- factor(x2, levels = month_levels)
y2

# If you omit the levels, they'll be taken from the data in alphabetical order:
factor(x1)
```
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Dec  Apr  <NA> Mar 
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar

5.2.2 (2) readr::parse_factor()

If you want a warning, you can use readr::parse_factor().

```{r}
y2 <- readr::parse_factor(x2, levels = month_levels)
y2
```
[1] Dec  Apr  <NA> Mar 
attr(,"problems")
# A tibble: 1 × 4
    row   col expected           actual
  <int> <int> <chr>              <chr> 
1     3    NA value in level set Jam   
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

5.2.3 (3) unique() or fct_inorder()

Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the fact, with fct_inorder():

```{r}
f1 <- factor(x1, levels = unique(x1))
f1

f2 <- x1 %>% factor() %>% fct_inorder()
f2

levels(f2) # function from base R
```
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
[1] Dec Apr Jan Mar
Levels: Dec Apr Jan Mar
[1] "Dec" "Apr" "Jan" "Mar"

5.3 Survey data

```{r}
gss_cat
# ?gss_cat
```
# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows
year: year of survey, 2000-2014
age: age. Maximum age truncated to 89.
marital: marital status
race: race
rincome: reported income
partyid: party affiliation
relig: religion
denom: denomination
tvhours: hours per day watching tv
```{r}
skimr::skim(gss_cat)
levels(gss_cat$race)

# When factors are stored in a tibble, you can't see their levels so easily. 
# One way to see them is with count():
gss_cat %>%
  count(race)

# Also with a barplot
ggplot(gss_cat, aes(race)) +
  geom_bar()

# By default, ggplot2 will drop levels that don't have any values. 
# You can force them to display with:
ggplot(gss_cat, aes(race)) +
  geom_bar() +
  scale_x_discrete(drop = FALSE)
```
Data summary
Name gss_cat
Number of rows 21483
Number of columns 9
_______________________
Column type frequency:
factor 6
numeric 3
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
marital 0 1 FALSE 6 Mar: 10117, Nev: 5416, Div: 3383, Wid: 1807
race 0 1 FALSE 3 Whi: 16395, Bla: 3129, Oth: 1959, Not: 0
rincome 0 1 FALSE 16 $25: 7363, Not: 7043, $20: 1283, $10: 1168
partyid 0 1 FALSE 10 Ind: 4119, Not: 3690, Str: 3490, Not: 3032
relig 0 1 FALSE 15 Pro: 10846, Cat: 5124, Non: 3523, Chr: 689
denom 0 1 FALSE 30 Not: 10072, Oth: 2534, No : 1683, Sou: 1536

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2006.50 4.45 2000 2002 2006 2010 2014 ▇▃▇▂▆
age 76 1.00 47.18 17.29 18 33 46 59 89 ▇▇▇▅▂
tvhours 10146 0.53 2.98 2.59 0 1 2 4 24 ▇▂▁▁▁
[1] "Other"          "Black"          "White"          "Not applicable"
# A tibble: 3 × 2
  race      n
  <fct> <int>
1 Other  1959
2 Black  3129
3 White 16395

  • These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn’t yet have a drop option, but it will in the future.

5.3.1 Visualizating survey data: Exercise

5.3.1.1 (1) rincome

  • Explore the distribution of rincome (reported income).
  • What makes the default bar chart hard to understand?
  • How could you improve the plot?
```{r}
gss_cat

gss_cat %>% 
  ggplot(aes(rincome)) +
  geom_bar() +
  coord_flip() +
  scale_x_discrete(drop = FALSE)
```
# A tibble: 21,483 × 9
    year marital         age race  rincome        partyid    relig denom tvhours
   <int> <fct>         <int> <fct> <fct>          <fct>      <fct> <fct>   <int>
 1  2000 Never married    26 White $8000 to 9999  Ind,near … Prot… Sout…      12
 2  2000 Divorced         48 White $8000 to 9999  Not str r… Prot… Bapt…      NA
 3  2000 Widowed          67 White Not applicable Independe… Prot… No d…       2
 4  2000 Never married    39 White Not applicable Ind,near … Orth… Not …       4
 5  2000 Divorced         25 White Not applicable Not str d… None  Not …       1
 6  2000 Married          25 White $20000 - 24999 Strong de… Prot… Sout…      NA
 7  2000 Never married    36 White $25000 or more Not str r… Chri… Not …       3
 8  2000 Divorced         44 White $7000 to 7999  Ind,near … Prot… Luth…      NA
 9  2000 Married          44 White $25000 or more Not str d… Prot… Other       0
10  2000 Married          47 White $25000 or more Strong re… Prot… Sout…       3
# ℹ 21,473 more rows

5.3.1.2 (2) relig & partyid

What is the most common relig in this survey? What’s the most common partyid?

```{r}
# most common religion
gss_cat %>% 
  ggplot(aes(relig)) +
  geom_bar() +
  coord_flip() +
  scale_x_discrete(drop = FALSE)

# most common party
gss_cat %>% 
  ggplot(aes(partyid)) +
  geom_bar() +
  coord_flip() +
  scale_x_discrete(drop = FALSE)
```

5.3.1.3 (3) relgion ands donomination

Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?

```{r}
gss_cat %>% 
  ggplot(aes(denom)) +
  geom_bar() +
  coord_flip() +
  scale_x_discrete(drop = FALSE)

# create a table and visualize it
gss_cat %>% 
  count(relig, denom) %>% 
  #view() %>% 
  ggplot(aes(relig, n, fill = denom)) +
  geom_col() +
  coord_flip() +
  scale_x_discrete(drop = FALSE) #+
  #facet_wrap(~ denom)
```

Answer

Only protestants or Christians responded to deonomination questions.

5.4 Modifying factor order:

5.4.1 fct_reorder()

  • reorder factor based on a continuous variable

5.4.1.1 (1) Reordering nominal (arbitrary) factors

  • explore the average number of hours spent watching TV per day across religions:
```{r}
# hard to interpret the pattern of the relationship
relig_summary <- gss_cat %>%
  group_by(relig) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n() ) 

relig_summary

ggplot(relig_summary, aes(tvhours, relig)) + 
  geom_point()

# reorder the religion by its frequency
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) + 
  geom_point()

# It is better to move fct_reorder out of aes() and into a separate mutate() step.
relig_summary %>%
  mutate(relig = fct_reorder(relig, tvhours)) %>%
  ggplot(aes(tvhours, relig)) +
    geom_point()
```
# A tibble: 15 × 4
   relig                     age tvhours     n
   <fct>                   <dbl>   <dbl> <int>
 1 No answer                49.5    2.72    93
 2 Don't know               35.9    4.62    15
 3 Inter-nondenominational  40.0    2.87   109
 4 Native american          38.9    3.46    23
 5 Christian                40.1    2.79   689
 6 Orthodox-christian       50.4    2.42    95
 7 Moslem/islam             37.6    2.44   104
 8 Other eastern            45.9    1.67    32
 9 Hinduism                 37.7    1.89    71
10 Buddhism                 44.7    2.38   147
11 Other                    41.0    2.73   224
12 None                     41.2    2.71  3523
13 Jewish                   52.4    2.52   388
14 Catholic                 46.9    2.96  5124
15 Protestant               49.9    3.15 10846

5.4.1.2 (2) Reordering ordinal (principled) factors?

How average age varies across reported income level?

```{r}
rincome_summary <- gss_cat %>%
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n() ) 

rincome_summary

ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
```
# A tibble: 16 × 4
   rincome          age tvhours     n
   <fct>          <dbl>   <dbl> <int>
 1 No answer       45.5    2.90   183
 2 Don't know      45.6    3.41   267
 3 Refused         47.6    2.48   975
 4 $25000 or more  44.2    2.23  7363
 5 $20000 - 24999  41.5    2.78  1283
 6 $15000 - 19999  40.0    2.91  1048
 7 $10000 - 14999  41.1    3.02  1168
 8 $8000 to 9999   41.1    3.15   340
 9 $7000 to 7999   38.2    2.65   188
10 $6000 to 6999   40.3    3.17   215
11 $5000 to 5999   37.8    3.16   227
12 $4000 to 4999   38.9    3.15   226
13 $3000 to 3999   37.8    3.31   276
14 $1000 to 2999   34.5    3.00   395
15 Lt $1000        40.5    3.36   286
16 Not applicable  56.1    3.79  7043

Warning

Here, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve fct_reorder() for factors whose levels are arbitrarily ordered.

5.4.2 fct_relevel()

  • takes a factor, f, and then any number of levels that you want to move to the front of the line.
```{r}
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
  geom_point()

# releveling in wrangling part
gss_cat %>%
  mutate(rincome = fct_relevel(rincome, "Not applicable")) %>% 
  group_by(rincome) %>%
  summarise(
    age = mean(age, na.rm = TRUE),
    tvhours = mean(tvhours, na.rm = TRUE),
    n = n() ) %>% 
  ggplot(aes(age, rincome)) +
  geom_point()
```

5.4.3 fct_reorder2()

  • Reorders the factor by the y values associated with the largest x values.
  • This makes the plot easier to read because the line colours line up with the legend.
```{r}
by_age <- gss_cat %>%
  filter(!is.na(age)) %>%
  count(age, marital) %>%
  group_by(age) %>%
  mutate(prop = n / sum(n))

ggplot(by_age, aes(age, prop, colour = marital)) +
  geom_line(na.rm = TRUE)

# the line colors line up with the color legend.
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
  geom_line() +
  labs(colour = "Marital Status")
```

5.4.4 fct_infreq() + fct_rev()

  • fct_infreq()

  • changes the order of levels by number of observations with each level (largest first)

  • fct_rev()

  • reverse the order of levels.

  • Used togehter both will order levels in increasing frequency.

  • this is the simplest type of reordering because it doesn’t need any extra variables.

```{r}
gss_cat %>%
  mutate(marital = marital %>% fct_infreq()) %>%
  ggplot(aes(marital)) +
    geom_bar()
         
# barplot 1
marital_order1 <- gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev())

gss_cat %>%
  mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
  ggplot(aes(marital)) +
    geom_bar()

# barplot 2
marital_order2 <- gss_cat %>% 
  mutate(marital = fct_infreq(marital),
         marital = fct_rev(marital))

gss_cat %>% 
  mutate(marital = fct_infreq(marital),
         marital = fct_rev(marital)) %>% 
  ggplot(aes(marital)) +
    geom_bar()

# comparison
levels(marital_order1$marital)
levels(marital_order2$marital)
```
[1] "No answer"     "Separated"     "Widowed"       "Divorced"     
[5] "Never married" "Married"      
[1] "No answer"     "Separated"     "Widowed"       "Divorced"     
[5] "Never married" "Married"      

5.5 Modifying factor levels

  • More powerful than changing the orders of the levels is changing their values.
  • This allows you to clarify labels for publication, and collapse levels for high-level displays.

5.5.1 fct_recode

  • The most general and powerful tool is fct_recode().
  • It allows you to recode, or change, the value of each level.

5.5.1.1 (1) recode it

```{r}
# The levels are terse and inconsistent. 
gss_cat %>% count(partyid)

# Let's tweak them to be longer and use a parallel construction.
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat"
  )) %>%
  count(partyid)
```
# A tibble: 10 × 2
   partyid                n
   <fct>              <int>
 1 No answer            154
 2 Don't know             1
 3 Other party          393
 4 Strong republican   2314
 5 Not str republican  3032
 6 Ind,near rep        1791
 7 Independent         4119
 8 Ind,near dem        2499
 9 Not str democrat    3690
10 Strong democrat     3490
# A tibble: 10 × 2
   partyid                   n
   <fct>                 <int>
 1 No answer               154
 2 Don't know                1
 3 Other party             393
 4 Republican, strong     2314
 5 Republican, weak       3032
 6 Independent, near rep  1791
 7 Independent            4119
 8 Independent, near dem  2499
 9 Democrat, weak         3690
10 Democrat, strong       3490

fct_recode() will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.

5.5.1.2 (2) Combine multiple levels into a new level

  • To combine groups, you can assign multiple old levels to the same new level:
```{r}
gss_cat %>%
  mutate(partyid = fct_recode(partyid,
    "Republican, strong"    = "Strong republican",
    "Republican, weak"      = "Not str republican",
    "Independent, near rep" = "Ind,near rep",
    "Independent, near dem" = "Ind,near dem",
    "Democrat, weak"        = "Not str democrat",
    "Democrat, strong"      = "Strong democrat",
    "Other"                 = "No answer",
    "Other"                 = "Don't know",
    "Other"                 = "Other party"
  )) %>%
  count(partyid)
```
# A tibble: 8 × 2
  partyid                   n
  <fct>                 <int>
1 Other                   548
2 Republican, strong     2314
3 Republican, weak       3032
4 Independent, near rep  1791
5 Independent            4119
6 Independent, near dem  2499
7 Democrat, weak         3690
8 Democrat, strong       3490
Warning

Note: You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.

5.5.2 fct_collapse()

  • If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode().
  • For each new variable, you can provide a vector of old levels:
```{r}
gss_cat %>%
  mutate(partyid = fct_collapse(partyid,
    other = c("No answer", "Don't know", "Other party"),
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
  )) %>%
  count(partyid)
```
# A tibble: 4 × 2
  partyid     n
  <fct>   <int>
1 other     548
2 rep      5346
3 ind      8409
4 dem      7180

5.5.3 fct_lump()

  • Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump():
  • The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.
```{r}
# In this case it's not very helpful: it is true that the majority of Americans 
# in this survey are Protestant, but we've probably over collapsed.
gss_cat %>%
  mutate(relig = fct_lump(relig)) %>%
  count(relig)

# Instead, we can use the n parameter to specify how many groups (excluding other) we want to keep:
gss_cat %>%
  mutate(relig = fct_lump(relig, n = 10)) %>%
  count(relig, sort = TRUE) %>%
  print(n = Inf)
```
# A tibble: 2 × 2
  relig          n
  <fct>      <int>
1 Protestant 10846
2 Other      10637
# A tibble: 10 × 2
   relig                       n
   <fct>                   <int>
 1 Protestant              10846
 2 Catholic                 5124
 3 None                     3523
 4 Christian                 689
 5 Other                     458
 6 Jewish                    388
 7 Buddhism                  147
 8 Inter-nondenominational   109
 9 Moslem/islam              104
10 Orthodox-christian         95

5.5.4 Excercise

  1. How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
  2. How could you collapse rincome into a small set of categories?

6 Dates and Times with lubridate package

6.1 Introduction

Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST.

6.1.1 Prep

```{r}
library(tidyverse)

library(lubridate)
library(nycflights13)
```

6.2 Creating date/times

There are three types of date/time data that refer to an instant in time:

  • A date: Tibbles print this as .
  • A time within a day: Tibbles print this as
  • A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as . Elsewhere in R these are called POSIXct.
  • we are only going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.
```{r}
# current date
today()

# current date-time
now()
```
[1] "2025-03-27"
[1] "2025-03-27 12:12:27 PDT"

Three ways to create a date/time: * From a string. * From individual date-time components. * From an existing date/time object.

6.2.1 From Strings

  • Date/time data often comes as strings.
  • one approach is to parsing strings into date-times in date-times.
  • Another approach is to use the helpers provided by lubridate.
  • They automatically work out the format once you specify the order of the component.
  • To use them, identify the order in which year, month, and day appear in your dates,
  • then arrange “y”, “m”, and “d” in the same order.
  • That gives you the name of the lubridate function that will parse your date.
  • For example, see next.

6.2.1.1 (1) dates only

```{r}
ymd("2017-01-31")
mdy("January 31st, 2017")
dmy("31-Jan-2017")

# These functions also take unquoted numbers.
ymd(20170131)
```
[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31"
[1] "2017-01-31"

6.2.1.2 (2) date-time

  • To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function.
```{r}
ymd_hms("2017-01-31 20:11:59")

mdy_hm("01/31/2017 08:01")
```
[1] "2017-01-31 20:11:59 UTC"
[1] "2017-01-31 08:01:00 UTC"

6.2.1.3 (3) time zone

You can also force the creation of a date-time from a date by supplying a timezone:

```{r}
ymd(20170131, tz = "UTC") # Universal Time Coordinate: PST is -8 from UTC
```
[1] "2017-01-31 UTC"

6.2.2 From Individual Components

  • Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns.
```{r}
flights %>% 
  select(year, month, day, hour, minute)
```
# A tibble: 336,776 × 5
    year month   day  hour minute
   <int> <int> <int> <dbl>  <dbl>
 1  2013     1     1     5     15
 2  2013     1     1     5     29
 3  2013     1     1     5     40
 4  2013     1     1     5     45
 5  2013     1     1     6      0
 6  2013     1     1     5     58
 7  2013     1     1     6      0
 8  2013     1     1     6      0
 9  2013     1     1     6      0
10  2013     1     1     6      0
# ℹ 336,766 more rows

6.2.2.1 (1) make_date() / make_datetime()

  • To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:
```{r}
flights %>% 
  select(year, month, day, hour, minute, sched_dep_time) %>% 
  mutate(departure = make_datetime(year, month, day, hour, minute))
```
# A tibble: 336,776 × 7
    year month   day  hour minute sched_dep_time departure          
   <int> <int> <int> <dbl>  <dbl>          <int> <dttm>             
 1  2013     1     1     5     15            515 2013-01-01 05:15:00
 2  2013     1     1     5     29            529 2013-01-01 05:29:00
 3  2013     1     1     5     40            540 2013-01-01 05:40:00
 4  2013     1     1     5     45            545 2013-01-01 05:45:00
 5  2013     1     1     6      0            600 2013-01-01 06:00:00
 6  2013     1     1     5     58            558 2013-01-01 05:58:00
 7  2013     1     1     6      0            600 2013-01-01 06:00:00
 8  2013     1     1     6      0            600 2013-01-01 06:00:00
 9  2013     1     1     6      0            600 2013-01-01 06:00:00
10  2013     1     1     6      0            600 2013-01-01 06:00:00
# ℹ 336,766 more rows

6.2.2.2 (2) Create flights_dt

```{r}
make_datetime_100 <- function(year, month, day, time) {
  make_datetime(year, month, day, time %/% 100, time %% 100)
}

flights_dt <- flights %>% 
  filter(!is.na(dep_time), !is.na(arr_time)) %>% 
  mutate(
    dep_time = make_datetime_100(year, month, day, dep_time),
    arr_time = make_datetime_100(year, month, day, arr_time),
    sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
    sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
  ) %>% 
  select(origin, dest, ends_with("delay"), ends_with("time"))

flights_dt

# visualize the distribution of departure times across the year
flights_dt %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day

60*60*24 # number of seconds in a day

# distribution of departure time within a single day
flights_dt %>% 
  filter(dep_time < ymd(20130102)) %>% 
  ggplot(aes(dep_time)) + 
  geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
```
# A tibble: 328,063 × 9
   origin dest  dep_delay arr_delay dep_time            sched_dep_time     
   <chr>  <chr>     <dbl>     <dbl> <dttm>              <dttm>             
 1 EWR    IAH           2        11 2013-01-01 05:17:00 2013-01-01 05:15:00
 2 LGA    IAH           4        20 2013-01-01 05:33:00 2013-01-01 05:29:00
 3 JFK    MIA           2        33 2013-01-01 05:42:00 2013-01-01 05:40:00
 4 JFK    BQN          -1       -18 2013-01-01 05:44:00 2013-01-01 05:45:00
 5 LGA    ATL          -6       -25 2013-01-01 05:54:00 2013-01-01 06:00:00
 6 EWR    ORD          -4        12 2013-01-01 05:54:00 2013-01-01 05:58:00
 7 EWR    FLL          -5        19 2013-01-01 05:55:00 2013-01-01 06:00:00
 8 LGA    IAD          -3       -14 2013-01-01 05:57:00 2013-01-01 06:00:00
 9 JFK    MCO          -3        -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA    ORD          -2         8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 328,053 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
[1] 86400

Tip

Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.

6.2.3 From Other Types

6.2.3.1 (1) as_datetime() / as_date()

You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date()

```{r}
as_datetime(today())

as_date(now())
```
[1] "2025-03-27 UTC"
[1] "2025-03-27"
  • Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().
```{r}
as_datetime(60 * 60 * 10) # "Unix Epoch", 1970-01-01.

as_date(365 * 10 + 2)
```
[1] "1970-01-01 10:00:00 UTC"
[1] "1980-01-01"

6.3 Date-time components

Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.

6.3.1 Getting components

```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")

year(datetime)

month(datetime)

mday(datetime) # day of the month

yday(datetime) # day of the year

wday(datetime) # day of the week starting with Sunday

wday(now()) 
```
[1] 2016
[1] 7
[1] 8
[1] 190
[1] 6
[1] 5
  • For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week.
  • Set abbr = FALSE to return the full name.
```{r}
month(datetime, label = TRUE)

wday(datetime, label = TRUE, abbr = FALSE)
```
[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
[1] Friday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
  • We can use wday() to see that more flights depart during the week than on the weekend:
```{r}
flights_dt %>% 
  mutate(wday = wday(dep_time, label = TRUE)) %>% 
  ggplot(aes(x = wday)) +
    geom_bar()
```

  • There’s an interesting pattern if we look at the average departure delay by minute within the hour.
  • It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
```{r}
flights_dt %>% 
  mutate(minute = minute(dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n()) %>% 
  ggplot(aes(minute, avg_delay)) +
    geom_line()
```

  • Interestingly, if we look at the scheduled departure time we don’t see such a strong pattern:
```{r}
sched_dep <- flights_dt %>% 
  mutate(minute = minute(sched_dep_time)) %>% 
  group_by(minute) %>% 
  summarise(
    avg_delay = mean(arr_delay, na.rm = TRUE),
    n = n())

ggplot(sched_dep, aes(minute, avg_delay)) +
  geom_line()
```

6.3.2 Rounding:

6.3.2.1 (1) Basic

  • floor_date(), round_date(), and ceiling_date()
  • Syntax
floor_date(
  x,
  unit = "seconds", 
  week_start = getOption("lubridate.week.start", 7)
  
# unit: minute, hour, day, week, month, bimonth, quarter, season, halfyear and year
  • the number of flights per week:
```{r}
flights_dt %>% 
  count(week = floor_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()

flights_dt %>% 
  count(week = round_date(dep_time, "week")) %>% 
  ggplot(aes(week, n)) +
    geom_line()
```

6.3.2.2 (2) Exercise

```{r}
## print fractional seconds
options(digits.secs = 6)

x <- ymd_hms("2009-08-03 12:01:59.23")
round_date(x, "second")
round_date(x, "minute")
round_date(x, "5 mins")
round_date(x, "hour")
round_date(x, "2 hours")
round_date(x, "day")
round_date(x, "week")
round_date(x, "month")
round_date(x, "bimonth")
round_date(x, "quarter") == round_date(x, "3 months")
round_date(x, "halfyear")
round_date(x, "year")

x <- ymd_hms("2009-08-03 12:01:59.23")
floor_date(x, "second")
floor_date(x, "minute")
floor_date(x, "hour")
floor_date(x, "day")
floor_date(x, "week")
floor_date(x, "month")
floor_date(x, "bimonth")
floor_date(x, "quarter")
floor_date(x, "season")
floor_date(x, "halfyear")
floor_date(x, "year")

x <- ymd_hms("2009-08-03 12:01:59.23")
ceiling_date(x, "second")
ceiling_date(x, "minute")
ceiling_date(x, "5 mins")
ceiling_date(x, "hour")
ceiling_date(x, "day")
ceiling_date(x, "week")
ceiling_date(x, "month")
ceiling_date(x, "bimonth") == ceiling_date(x, "2 months")
ceiling_date(x, "quarter")
ceiling_date(x, "season")
ceiling_date(x, "halfyear")
ceiling_date(x, "year")
```
[1] "2009-08-03 12:01:59 UTC"
[1] "2009-08-03 12:02:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-04 UTC"
[1] "2009-08-02 UTC"
[1] "2009-08-01 UTC"
[1] "2009-09-01 UTC"
[1] TRUE
[1] "2009-07-01 UTC"
[1] "2010-01-01 UTC"
[1] "2009-08-03 12:01:59 UTC"
[1] "2009-08-03 12:01:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 UTC"
[1] "2009-08-02 UTC"
[1] "2009-08-01 UTC"
[1] "2009-07-01 UTC"
[1] "2009-07-01 UTC"
[1] "2009-06-01 UTC"
[1] "2009-07-01 UTC"
[1] "2009-01-01 UTC"
[1] "2009-08-03 12:02:00 UTC"
[1] "2009-08-03 12:02:00 UTC"
[1] "2009-08-03 12:05:00 UTC"
[1] "2009-08-03 13:00:00 UTC"
[1] "2009-08-04 UTC"
[1] "2009-08-09 UTC"
[1] "2009-09-01 UTC"
[1] TRUE
[1] "2009-10-01 UTC"
[1] "2009-09-01 UTC"
[1] "2010-01-01 UTC"
[1] "2010-01-01 UTC"

6.3.3 Setting components

6.3.3.1 (1) Modifying date-time in place

  • make a permanent change
  • use each accessor function to set the components of a date/time:
```{r}
(datetime <- ymd_hms("2016-07-08 12:34:56"))

year(datetime) <- 2023 # permanent change
datetime

month(datetime) <- 3 
datetime

hour(datetime) <- hour(datetime) + 1
datetime
```
[1] "2016-07-08 12:34:56 UTC"
[1] "2023-07-08 12:34:56 UTC"
[1] "2023-03-08 12:34:56 UTC"
[1] "2023-03-08 13:34:56 UTC"

6.3.3.2 (2) Create a new date-time with update()

```{r}
update(datetime, year = 2024, month = 2, mday = 2, hour =2) # not a permanent change
datetime

#If values are too big, they will roll-over:
ymd("2023-02-01") %>% 
  update(mday = 30)

ymd("2023-02-01") %>% 
  update(hour = 48)

# show the distribution of flights across the course of the day for every day of the year:
flights_dt %>% 
  #arrange(desc(dep_time))
  mutate(dep_hour = update(dep_time, yday = 1)) %>% #yday: the first day of the year
  #arrange(desc(dep_hour))
  ggplot(aes(dep_hour)) +
  geom_freqpoly(binwidth = 300)
```
[1] "2024-02-02 02:34:56 UTC"
[1] "2023-03-08 13:34:56 UTC"
[1] "2023-03-02"
[1] "2023-02-03 UTC"

6.3.4 Exercise

6.3.4.1 (1) Distribution of flight times within a day

How does the distribution of flight times within a day change over the course of the year?

```{r}
# flights per hour for the entire year
flights_dt %>% 
  mutate(hour = hour(dep_time)) %>%
  group_by(hour)%>%
  summarize(numflights_per_hour = n())%>%
  ggplot(aes(x = hour, y = numflights_per_hour)) +
    geom_line()

flights %>% 
  filter(!is.na(dep_time)) %>% 
  mutate(hour = dep_time %/% 100) %>%
  group_by(hour)%>%
  summarize(numflights_per_hour = n())%>%
  ggplot(aes(x = hour, y = numflights_per_hour)) +
    geom_line()
```

6.4 Time spans and arithmetics

  • durations, which represent an exact number of seconds.
  • periods, which represent human units like weeks and months.
  • intervals, which represent a starting and ending point.

6.4.1 Durations

6.4.1.1 (1) Base R

  • In R, when you subtract two dates, you get a difftime object:
```{r}
# How old are you?
h_age <- today() - ymd(19971217)
h_age
```
Time difference of 9962 days

6.4.1.2 (2) Duration and convenient constructors

```{r}
as.duration(h_age) # change object to duration in seconds always

# Convenient constructors
dseconds(15)
dminutes(10)
dhours(c(12, 24))
ddays(0:5)
```
[1] "860716800s (~27.27 years)"
[1] "15s"
[1] "600s (~10 minutes)"
[1] "43200s (~12 hours)" "86400s (~1 days)"  
[1] "0s"                "86400s (~1 days)"  "172800s (~2 days)"
[4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"

6.4.1.3 (3) Arithmeric with duration

```{r}
# You can add and multiply durations:
2 * dyears(1)

dyears(1) + dweeks(12) + dhours(15)

# add and subtract durations to and from days:
tomorrow <- today() + ddays(1)
tomorrow
last_year <- today() - dyears(1)
last_year

# Unexpected result due to DST starting in March
one_pm <- ymd_hms("2022-03-12 13:00:00", tz = "America/New_York")
one_pm
one_pm + ddays(1)
```
[1] "63115200s (~2 years)"
[1] "38869200s (~1.23 years)"
[1] "2025-03-28"
[1] "2024-03-26 18:00:00 UTC"
[1] "2022-03-12 13:00:00 EST"
[1] "2022-03-13 14:00:00 EDT"

6.4.2 Periods and constructors

  • It solve the problem with duration coming from the use of seconds
  • Periods are time spans but don’t have a fixed length in seconds,
  • instead they work with “human” times, like days and months.
  • That allows them to work in a more intuitive way:
```{r}
one_pm + days(1)

seconds(15)
minutes(10)
hours(c(12, 24))
days(7)
months(1:6)
weeks(3)
years(1)
```
[1] "2022-03-13 13:00:00 EDT"
[1] "15S"
[1] "10M 0S"
[1] "12H 0M 0S" "24H 0M 0S"
[1] "7d 0H 0M 0S"
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
[5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
[1] "21d 0H 0M 0S"
[1] "1y 0m 0d 0H 0M 0S"

6.4.2.1 (1) Arithmetics

```{r}
# addition and multiplication
10 * (months(6) + days(1))

days(50) + hours(25) + minutes(2)

#  Compared to durations, periods are more likely to do what you expect

## A leap year
ymd("2020-01-01") + dyears(1)
ymd("2020-01-01") + years(1)

## Daylight savings time  
one_pm + ddays(1)
one_pm + days(1)
```
[1] "60m 10d 0H 0M 0S"
[1] "50d 25H 2M 0S"
[1] "2020-12-31 06:00:00 UTC"
[1] "2021-01-01"
[1] "2022-03-13 14:00:00 EDT"
[1] "2022-03-13 13:00:00 EDT"

6.4.2.2 (2) Application to flights data

  • fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.
```{r}
# These are overnight flights
flights %>% 
  filter(arr_time < dep_time) # not the same # of rows as below deu to different metrics used.

flights_dt %>% 
  filter(arr_time < dep_time) 

#  We used the same date information for both the departure and the arrival times, 
#  but these flights arrived on the following day. We can fix this by adding days(1) 
#  to the arrival time of each overnight flight.

flights_dt <- flights_dt %>% 
  mutate(overnight = arr_time < dep_time,
         arr_time = arr_time + days(overnight * 1),
         sched_arr_time = sched_arr_time + days(overnight * 1)
         )

flights_dt %>% 
  filter(overnight, arr_time < dep_time)
```
# A tibble: 10,633 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1     1929           1920         9        3              7
 2  2013     1     1     1939           1840        59       29           2151
 3  2013     1     1     2058           2100        -2        8           2359
 4  2013     1     1     2102           2108        -6      146            158
 5  2013     1     1     2108           2057        11       25             39
 6  2013     1     1     2120           2130       -10       16             18
 7  2013     1     1     2121           2040        41        6           2323
 8  2013     1     1     2128           2135        -7       26             50
 9  2013     1     1     2134           2045        49       20           2352
10  2013     1     1     2136           2145        -9       25             39
# ℹ 10,623 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 10,633 × 9
   origin dest  dep_delay arr_delay dep_time                  
   <chr>  <chr>     <dbl>     <dbl> <dttm>                    
 1 EWR    BQN           9        -4 2013-01-01 19:29:00.000000
 2 JFK    DFW          59        NA 2013-01-01 19:39:00.000000
 3 EWR    TPA          -2         9 2013-01-01 20:58:00.000000
 4 EWR    SJU          -6       -12 2013-01-01 21:02:00.000000
 5 EWR    SFO          11       -14 2013-01-01 21:08:00.000000
 6 LGA    FLL         -10        -2 2013-01-01 21:20:00.000000
 7 EWR    MCO          41        43 2013-01-01 21:21:00.000000
 8 JFK    LAX          -7       -24 2013-01-01 21:28:00.000000
 9 EWR    FLL          49        28 2013-01-01 21:34:00.000000
10 EWR    FLL          -9       -14 2013-01-01 21:36:00.000000
# ℹ 10,623 more rows
# ℹ 4 more variables: sched_dep_time <dttm>, arr_time <dttm>,
#   sched_arr_time <dttm>, air_time <dbl>
# A tibble: 0 × 10
# ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
#   dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
#   sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>

6.4.3 Intervals

  • An interval is a duration with a starting point: that makes it precise so you can determine exactly how long it is
```{r}
dyears(1) / ddays(365) # obvious as duration uses 365 days worth of seconds

years(1)/days(1) # not specific; periods give an estimate since leap years have 366 days

# interval gives you an accurate measurement
today() + years(1)
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
(today() %--% (today() + years(1))) / days(1)

# Durations must be standardized lengths of time. 
# There is no dmonths() since months do not have a standard number of days.
(today() %--% (today() + years(1))) / dmonths(1) # not integer
(today() %--% next_year) / months(1) # integer

# how many periods fall into an interval? Do the integer division
(today() %--% next_year) %/% days(1)
```
[1] 1.000685
[1] 365.25
[1] "2026-03-27"
[1] 365
[1] 365
[1] 11.99179
[1] 12
[1] 365

6.4.4 Summary

How do you pick between duration, periods, and intervals?

  • As always, pick the simplest data structure that solves your problem.
  • If you only care about physical time, use a duration;
  • if you need to add human times, use a period;
  • if you need to figure out how long a span is in human units, use an interval.

6.5 Time zones

6.5.1 (1) Basics

  • R uses the international standard IANA time zones: “/
  • Ex) “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”
  • It’s worth reading the raw time zone database (available at http://www.iana.org/time-zones)
```{r}
Sys.timezone()
Sys.time()
Sys.Date()

# the complete list of all time zone names 
length(OlsonNames())
head(OlsonNames())
```
[1] "America/Los_Angeles"
[1] "2025-03-27 12:12:34.682331 PDT"
[1] "2025-03-27"
[1] 596
[1] "Africa/Abidjan"     "Africa/Accra"       "Africa/Addis_Ababa"
[4] "Africa/Algiers"     "Africa/Asmara"      "Africa/Asmera"     
  • In R, the time zone is an attribute of the date-time that only controls printing.
  • For example, these three objects represent the same instant in time:
```{r}
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))

(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))

(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))

x1 - x2
x1 - x3
```
[1] "2015-06-01 12:00:00 EDT"
[1] "2015-06-01 18:00:00 CEST"
[1] "2015-06-02 04:00:00 NZST"
Time difference of 0 secs
Time difference of 0 secs
  • lubridate always uses UTC.
  • UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and.
```{r}
# Operations that combine date-times, like c(), will often drop the time zone. 
# In that case, the date-times will display in your local time zone:
x4 <- c(x1, x2, x3)
x4
```
[1] "2015-06-01 12:00:00 EDT" "2015-06-01 12:00:00 EDT"
[3] "2015-06-01 12:00:00 EDT"

6.5.2 (2) Change time zone

  • Keep the instant in time the same, and change how it’s displayed.
  • Use this when the instant is correct, but you want a more natural display.
```{r}
x4a <- with_tz(x4, tzone = "Australia/Lord_Howe")
x4a

x4a - x4
```
[1] "2015-06-02 02:30:00 +1030" "2015-06-02 02:30:00 +1030"
[3] "2015-06-02 02:30:00 +1030"
Time differences in secs
[1] 0 0 0
  • Change the underlying instant in time.
  • Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.
```{r}
x4b <- force_tz(x4, tzone = "Australia/Lord_Howe")
x4b

x4b - x4
```
[1] "2015-06-01 12:00:00 +1030" "2015-06-01 12:00:00 +1030"
[3] "2015-06-01 12:00:00 +1030"
Time differences in hours
[1] -14.5 -14.5 -14.5

7 Appendix

7.1 Grouping

  • You can use parentheses to override the default precedence rules: ::: {.cell}
```{r}
str_extract(c("grey", "gray"), "gre|ay")

str_extract(c("grey", "gray"), "gr(e|a)y")
```
[1] "gre" "ay" 
[1] "grey" "gray"

:::

  • Parenthesis also define “groups” that you can refer to with backreferences, like \1, \2 etc, and can be extracted with str_match().
```{r}
fruit %>% 
  str_subset("(..)\\1")

fruit %>% 
  str_subset("(..)\\1") %>% 
  str_match("(..)\\1")
```
[1] "banana"
     [,1]   [,2]
[1,] "anan" "an"
  • You can use (?:…), the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses.
```{r}
str_match(c("grey", "gray"), "gr(e|a)y")

str_match(c("grey", "gray"), "gr(?:e|a)y")
```
     [,1]   [,2]
[1,] "grey" "e" 
[2,] "gray" "a" 
     [,1]  
[1,] "grey"
[2,] "gray"

8 References