M08: Advanced R Data Types and Structure
1 Topics
tibble
- Strings with
stringr
- Factors with
forcats
- Dates and Times with
lubridate
2 Packages
3 Tibbles
Tibble is “opinionated data frames that make working in the tidyverse a little easier.”
If you’re already familiar with data.frame(), note that tibble() does much less:
- it never changes the type of the inputs (e.g. it never converts strings to factors!),
- it never changes the names of variables, and
- it never creates row names.
3.1 How to create a tibble?
3.1.1 Using tibble() function
```{r}
# tibble() is from tibble package, which is part of tidyverse
tibble(
x = 1:5,
y = 1,
z = x ^ 2 + y
)
```
# A tibble: 5 × 3
x y z
<int> <dbl> <dbl>
1 1 1 2
2 2 1 5
3 3 1 10
4 4 1 17
5 5 1 26
- It’s possible for a tibble to have column names that are not valid R variable names, aka
non-syntactic
names. - To refer to these variables, you need to surround them with
backticks
(`)
# A tibble: 1 × 3
`:)` ` ` `2000`
<chr> <chr> <chr>
1 smile space number
- You’ll also need the
backticks
when working with these variables in other packages, like ggplot2, dplyr, and tidyr.
3.1.2 Using tribble()
, short for transposed
tibble.
tribble()
is customised for data entry in code:- column headings are defined by formulas (i.e. they start with ~), and
- entries are separated by commas.
- This makes it possible to lay out small amounts of data in easy to read form.
3.2 Tibbles vs. data.frame
- There are two main differences in the usage of a tibble vs. a classic data.frame: (1) printing and (2) subsetting.
3.2.1 Printing
- shows only the first 10 rows, and all the columns that fit on screen.
- In addition to its name, each column reports its type, a nice feature borrowed from str():
```{r}
tibble(
a = lubridate::now() + runif(1e3) * 86400,
b = lubridate::today() + runif(1e3) * 30,
c = 1:1e3,
d = runif(1e3),
e = sample(letters, 1e3, replace = TRUE)
)
```
# A tibble: 1,000 × 5
a b c d e
<dttm> <date> <int> <dbl> <chr>
1 2025-03-28 12:06:30 2025-04-01 1 0.779 g
2 2025-03-27 14:13:23 2025-04-04 2 0.256 e
3 2025-03-27 23:33:12 2025-04-23 3 0.167 s
4 2025-03-28 08:26:31 2025-04-08 4 0.0975 y
5 2025-03-28 04:46:30 2025-03-29 5 0.548 k
6 2025-03-27 22:06:59 2025-04-24 6 0.479 e
7 2025-03-27 14:55:36 2025-04-23 7 0.642 q
8 2025-03-27 15:26:08 2025-04-17 8 0.718 k
9 2025-03-28 06:49:33 2025-04-12 9 0.947 x
10 2025-03-28 04:43:40 2025-04-15 10 0.353 y
# ℹ 990 more rows
runif(n, min = 0, max = 1) provides random deviates with uniform distribution
Printing customization
- First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display.
width = Inf
will display all columns:
```{r}
nycflights13::flights %>%
print(n = 10, width = Inf)
# using view()
nycflights13::flights %>%
View()
```
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
arr_delay carrier flight tailnum origin dest air_time distance hour minute
<dbl> <chr> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 11 UA 1545 N14228 EWR IAH 227 1400 5 15
2 20 UA 1714 N24211 LGA IAH 227 1416 5 29
3 33 AA 1141 N619AA JFK MIA 160 1089 5 40
4 -18 B6 725 N804JB JFK BQN 183 1576 5 45
5 -25 DL 461 N668DN LGA ATL 116 762 6 0
6 12 UA 1696 N39463 EWR ORD 150 719 5 58
7 19 B6 507 N516JB EWR FLL 158 1065 6 0
8 -14 EV 5708 N829AS LGA IAD 53 229 6 0
9 -8 B6 79 N593JB JFK MCO 140 944 6 0
10 8 AA 301 N3ALAA LGA ORD 138 733 6 0
time_hour
<dttm>
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00
7 2013-01-01 06:00:00
8 2013-01-01 06:00:00
9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows
3.2.2 Subsetting
```{r}
df <- tibble(
x = runif(5),
y = rnorm(5)
)
df
# Extract by name
df$x
df[["x"]]
# Extract by position
df[[1]]
# To use these in a pipe, you'll need to use the special placeholder .:
df %>% .$x
df %>% .[["x"]]
```
# A tibble: 5 × 2
x y
<dbl> <dbl>
1 0.429 -1.26
2 0.471 -0.320
3 0.384 0.00297
4 0.828 -0.0364
5 0.437 -0.162
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
3.2.2.1 Native operator
Using native operator: it doesn’t wrap LHS up. Thus use
anonymous function
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
[1] 0.4292230 0.4714769 0.3836119 0.8279346 0.4367805
cf Subsetting with deplyr
3.3 Interacting with older code
- Some older functions don’t work with tibbles.
- If you encounter one of these functions, use
as.data.frame()
to turn a tibble back to a data.frame:
4 Strings (with stringr package) and regex
- stringr::cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/strings.pdf
- regular expression cheat sheet: https://github.com/rstudio/cheatsheets/blob/main/regex.pdf
4.1 Introduction
- The focus of this chapter will be on regular expressions, or
regex
for short.- Regular expressions are useful because strings usually contain unstructured or semi-structured data, and
regex
are a concise language for describing patterns in strings.
4.2 String basics
- You can create strings with either
single quotes
ordouble quotes
.- Unlike other languages, there is no difference in behavior.
- Use
"
, unless you want to create a string that contains multiple"
.
4.2.1 Use \
to “escape”
- To include a literal single or double quote in a string you can use to “escape” it:
```{r}
double_quote <- "\"" # or '"'
double_quote # the printed representation shows the escapes.
cat(double_quote) # cat() evaluate it first and print it.
"\""
cat("\"")
'"'
cat('"')
single_quote <- '\'' # or "'"
single_quote # the printed representation shows the escapes.
cat(single_quote)
"'"
cat("'")
'\''
cat('\'') # the same as above.
"\\"
cat("\\")
x <- c("\"", "\\")
x # the printed representation shows the escapes as they are.
cat(x)
# cf.
writeLines(x) # To see the raw contents of the string in different lines
```
[1] "\""
"[1] "\""
"[1] "\""
"[1] "'"
'[1] "'"
'[1] "'"
'[1] "\\"
\[1] "\"" "\\"
" \"
\
4.2.2 Special characters
Quotes
- Three types of quotes are part of the syntax of R:
single
anddouble
quotation marks and thebacktick
(or back quote, ‘???`???’)- In addition,
backslash
is used to escape the following character inside character constants.- Single and double quotes delimit character constants. They can be used interchangeably but
double
quotes are preferred (and character constants are printed using double quotes),- so single quotes are normally only used to delimit character constants containing double quotes.
- Backslash is used to start an escape sequence inside character constants. Escaping a character not in the following table is an error.
- Single quotes need to be escaped by backslash in single-quoted strings, and double quotes in double-quoted strings.
`\n` newline (aka 'line feed')
`\r` carriage return
`\t` tab
`\b` backspace
`\a` alert (bell)
`\f` form feed
`\v` vertical tab
`\\` backslash '???\???'
`\'` ASCII apostrophe '???'???'
`\"` ASCII quotation mark '???"???'
`\`` ASCII grave accent (backtick) '???`???'
4.2.3 String length
- Base R contains many functions to work with strings but we’ll avoid them because they can be inconsistent, which makes them hard to remember.
- Instead, we’ll use functions from
stringr
. These have more intuitive names, and all start withstr_
.
4.2.4 Combining strings
4.2.4.1 str_replace_na()
Like most other functions in R, missing values are contagious. If you want them to print as “NA”, use str_replace_na():
4.2.4.2 Objects of length 0
Objects of length 0 are silently dropped. This is particularly useful in conjunction with if:
4.2.4.3 collapse
To collapse a vector of strings into a single string, use collapse
4.2.5 Subsetting strings
4.2.5.1 str_sub()
str_sub(string, start = 1L, end = -1L)
```{r}
x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3) # argument: start, end
# negative numbers count backwards from end
str_sub(x, end = -1)
str_sub(x, 1, -1)
str_sub(x, -3, -1)
```
[1] "App" "Ban" "Pea"
[1] "Apple" "Banana" "Pear"
[1] "Apple" "Banana" "Pear"
[1] "ple" "ana" "ear"
str_sub()
won’t fail if the string is too short: it will just return as much as possible:
You can also use the assignment form of
str_sub()
to modify strings
4.2.6 Locales
str_to_upper()
str_to_title()
str_to_lower()
```{r}
# Turkish has two i's: with and without a dot, and it
# has a different rule for capitalising them:
str_to_upper(c("i", "i"))
str_to_upper(c("i", "i"), locale = "tr")
```
[1] "I" "I"
[1] "İ" "İ"
- The base R
order()
andsort()
functions sort strings using the current locale.- If you want robust behaviour across different computers, you may want to use
str_sort()
andstr_order()
which take an additionallocale argument:
4.2.7 str_wrap()
str_wrap(string, width = 80, indent = 0, exdent = 0)
```{r}
thanks_path <- file.path(R.home("doc"), "THANKS")
thanks <- str_c(readLines(thanks_path), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
cat(str_wrap(thanks, width = 40), "\n")
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")
cat(str_wrap(thanks, width = 0, exdent = 2), "\n")
```
R would not be what it is today without the invaluable help of these people
outside of the (former and current) R Core team, who contributed by donating
code, bug fixes and documentation: Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson, Roger Bivand, Ben Bolker, David Brahm,
G"oran Brostr"om, Patrick Burns, Vince Carey, Saikat DebRoy, Matt Dowle, Brian
D'Urso, Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian Fischmeister,
John Fox, Paul Gilbert, Yu Gong, Gabor Grothendieck, Frank E Harrell Jr, Peter
M. Haverty, Torsten Hothorn, Robert King, Kjetil Kjernsmo, Roger Koenker,
Philippe Lambert, Jan de Leeuw, Jim Lindsey, Patrick Lindsey, Catherine Loader,
Gordon Maclean, Arni Magnusson, John Maindonald, David Meyer, Ei-ji Nakama,
Jens Oehlschl"agel, Steve Oncley, Richard O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel, Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have written code that has been adopted by R and
is acknowledged in the code files, including
R would not be what it is today without
the invaluable help of these people
outside of the (former and current) R
Core team, who contributed by donating
code, bug fixes and documentation:
Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson,
Roger Bivand, Ben Bolker, David Brahm,
G"oran Brostr"om, Patrick Burns, Vince
Carey, Saikat DebRoy, Matt Dowle,
Brian D'Urso, Lyndon Drake, Dirk
Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert,
Yu Gong, Gabor Grothendieck, Frank E
Harrell Jr, Peter M. Haverty, Torsten
Hothorn, Robert King, Kjetil Kjernsmo,
Roger Koenker, Philippe Lambert, Jan
de Leeuw, Jim Lindsey, Patrick Lindsey,
Catherine Loader, Gordon Maclean,
Arni Magnusson, John Maindonald,
David Meyer, Ei-ji Nakama, Jens
Oehlschl"agel, Steve Oncley, Richard
O'Keefe, Hubert Palme, Roger D. Peng,
Jose' C. Pinheiro, Tony Plate, Anthony
Rossini, Jonathan Rougier, Petr Savicky,
Guenther Sawitzki, Marc Schwartz, Arun
Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry
Therneau, Rolf Turner, Bill Venables,
Gregory R. Warnes, Andreas Weingessel,
Morten Welinder, James Wettenhall, Simon
Wood, and Achim Zeileis. Others have
written code that has been adopted by R
and is acknowledged in the code files,
including
R would not be what it is today without the invaluable
help of these people outside of the (former and current)
R Core team, who contributed by donating code, bug fixes
and documentation: Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson, Roger Bivand, Ben
Bolker, David Brahm, G"oran Brostr"om, Patrick Burns,
Vince Carey, Saikat DebRoy, Matt Dowle, Brian D'Urso,
Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert, Yu Gong, Gabor
Grothendieck, Frank E Harrell Jr, Peter M. Haverty,
Torsten Hothorn, Robert King, Kjetil Kjernsmo, Roger
Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey,
Patrick Lindsey, Catherine Loader, Gordon Maclean, Arni
Magnusson, John Maindonald, David Meyer, Ei-ji Nakama,
Jens Oehlschl"agel, Steve Oncley, Richard O'Keefe, Hubert
Palme, Roger D. Peng, Jose' C. Pinheiro, Tony Plate, Anthony
Rossini, Jonathan Rougier, Petr Savicky, Guenther Sawitzki,
Marc Schwartz, Arun Srinivasan, Detlef Steuer, Bill Simpson,
Gordon Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner,
Bill Venables, Gregory R. Warnes, Andreas Weingessel, Morten
Welinder, James Wettenhall, Simon Wood, and Achim Zeileis.
Others have written code that has been adopted by R and is
acknowledged in the code files, including
R would not be what it is today without the invaluable help
of these people outside of the (former and current) R
Core team, who contributed by donating code, bug fixes
and documentation: Valerio Aimale, Suharto Anggono, Thomas
Baier, Gabe Becker, Henrik Bengtsson, Roger Bivand, Ben
Bolker, David Brahm, G"oran Brostr"om, Patrick Burns,
Vince Carey, Saikat DebRoy, Matt Dowle, Brian D'Urso,
Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian
Fischmeister, John Fox, Paul Gilbert, Yu Gong, Gabor
Grothendieck, Frank E Harrell Jr, Peter M. Haverty,
Torsten Hothorn, Robert King, Kjetil Kjernsmo, Roger
Koenker, Philippe Lambert, Jan de Leeuw, Jim Lindsey,
Patrick Lindsey, Catherine Loader, Gordon Maclean, Arni
Magnusson, John Maindonald, David Meyer, Ei-ji Nakama,
Jens Oehlschl"agel, Steve Oncley, Richard O'Keefe, Hubert
Palme, Roger D. Peng, Jose' C. Pinheiro, Tony Plate,
Anthony Rossini, Jonathan Rougier, Petr Savicky, Guenther
Sawitzki, Marc Schwartz, Arun Srinivasan, Detlef Steuer,
Bill Simpson, Gordon Smyth, Adrian Trapletti, Terry
Therneau, Rolf Turner, Bill Venables, Gregory R. Warnes,
Andreas Weingessel, Morten Welinder, James Wettenhall,
Simon Wood, and Achim Zeileis. Others have written code
that has been adopted by R and is acknowledged in the code
files, including
R
would
not
be
what
it
is
today
without
the
invaluable
help
of
these
people
outside
of
the
(former
and
current)
R
Core
team,
who
contributed
by
donating
code,
bug
fixes
and
documentation:
Valerio
Aimale,
Suharto
Anggono,
Thomas
Baier,
Gabe
Becker,
Henrik
Bengtsson,
Roger
Bivand,
Ben
Bolker,
David
Brahm,
G"oran
Brostr"om,
Patrick
Burns,
Vince
Carey,
Saikat
DebRoy,
Matt
Dowle,
Brian
D'Urso,
Lyndon
Drake,
Dirk
Eddelbuettel,
Claus
Ekstrom,
Sebastian
Fischmeister,
John
Fox,
Paul
Gilbert,
Yu
Gong,
Gabor
Grothendieck,
Frank
E
Harrell
Jr,
Peter
M.
Haverty,
Torsten
Hothorn,
Robert
King,
Kjetil
Kjernsmo,
Roger
Koenker,
Philippe
Lambert,
Jan
de
Leeuw,
Jim
Lindsey,
Patrick
Lindsey,
Catherine
Loader,
Gordon
Maclean,
Arni
Magnusson,
John
Maindonald,
David
Meyer,
Ei-ji
Nakama,
Jens
Oehlschl"agel,
Steve
Oncley,
Richard
O'Keefe,
Hubert
Palme,
Roger
D.
Peng,
Jose'
C.
Pinheiro,
Tony
Plate,
Anthony
Rossini,
Jonathan
Rougier,
Petr
Savicky,
Guenther
Sawitzki,
Marc
Schwartz,
Arun
Srinivasan,
Detlef
Steuer,
Bill
Simpson,
Gordon
Smyth,
Adrian
Trapletti,
Terry
Therneau,
Rolf
Turner,
Bill
Venables,
Gregory
R.
Warnes,
Andreas
Weingessel,
Morten
Welinder,
James
Wettenhall,
Simon
Wood,
and
Achim
Zeileis.
Others
have
written
code
that
has
been
adopted
by
R
and
is
acknowledged
in
the
code
files,
including
4.2.8 str_trim()
str_trim(string, side = c(“both”, “left”, “right”))
```{r}
cat("\nString with trailing and leading white space\n")
cat("\n\nString with trailing and leading white space\n\n")
str_trim(" String with trailing and leading white space\t")
str_trim("\n\nString with trailing and leading white space\n\n")
```
String with trailing and leading white space
String with trailing and leading white space
[1] "String with trailing and leading white space"
[1] "String with trailing and leading white space"
4.2.9 str_squish()
str_squish(string)
4.3 Matching patterns with regular expressions
Regex
are a very terse language that allow you to describe patterns in strings.- To learn regular expressions, we’ll use
str_view()
.- They take a character vector and a regular expression, and show you how they match.
- Once you’ve mastered pattern matching, you’ll learn how to apply those ideas with various
stringr
functions.
4.3.1 Basic matches
- str_view shows the first match
- str_view(string, pattern, match = NA)
4.3.1.1 (1) exact strings
4.3.1.2 (2) .
.
matches any character (except a newline):
4.3.2 Anchors
- By default, regular expressions will match any part of a string.
- It’s often useful to anchor the regular expression so that it matches from the start or end of the string.
4.3.2.1 ^
and $
- You can use:
- ^ to match the start of the string.
- $ to match the end of the string.
[1] │ <a>pple
[2] │ banan<a>
- To force a regular expression to only match a complete string, anchor it with both ^ and $:
4.3.2.2 boundary with \b
- You can also match the boundary between words with
- search for o avoid matching summarise, summary, rowsum and so on.
4.3.3 Chracter classes and alternatives
- There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:
- matches any digit.
- : matches any whitespace (e.g. space, tab, newline).
- Remember, to create a regular expression containing
\d
or\s
, you’ll need to escape the\
for the string, so you’ll type\\d
or\\s
.- A character class containing a single character is a nice alternative to backslash escapes when you want to include a single metacharacter in a regex.
```{r}
# Look for a literal character that normally has special meaning in a regex
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c") # same as below.
str_view(c("abc", "a.c", "a*c", "a c"), "a\\.c")
str_view(c("abc", "a.c", "a*c", "a c"), ".[*]c") # same as below
str_view(c("abc", "a.c", "a*c", "a c"), ".\\*c")
str_view(c("abc", "a.c", "a*c", "a c"), "a[ ]") # same as below
str_view(c("abc", "a.c", "a*c", "a c"), "a\\ ")
```
[2] │ <a.c>
[2] │ <a.c>
[3] │ <a*c>
[3] │ <a*c>
[4] │ <a >c
[4] │ <a >c
4.3.3.1 Alternate
4.3.4 Repetition (= Quantifiers)
The next step up in power involves controlling how many times a pattern matches:
?
: 0 or 1+
: 1 or more*
: 0 or more
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
str_view(x, "CC+")
str_view(x, "CC*") #greedy
str_view(x, 'C[LX]+') #[LX]: matches either L or X
str_extract(x, "CC?") # greedy
str_extract(x, "CC+") # greedy
str_extract(x, 'C[LX]+') # greedy
```
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII
[1] "CC"
[1] "CCC"
[1] "CLXXX"
You can also specify the number of matches precisely:
- {n}: exactly n
- {n,}: n or more
- {,m}: at most m
- {n,m}: between n and m
```{r}
str_view(x, "C{2}") #
str_view(x, "C{2,}") # greedy
str_view(x, "C{2,3}") # greedy
str_extract(x, "C{2}")
str_extract(x, "C{2,}")
str_extract(x, "C{2,3}")
```
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CCC>LXXXVIII
[1] "CC"
[1] "CCC"
[1] "CCC"
- By default these matches are “greedy”: they will match the longest string possible.
- You can make them “lazy”, matching the shortest string possible by putting a
?
after them. - This is an advanced feature of regular expressions, but it’s useful to know that it exists:
??
: 0 or 1, prefer 0.+?
: 1 or more, match as few times as possible.*?
: 0 or more, match as few times as possible.{n,}?
: n or more, match as few times as possible.{n,m}?
: between n and m, , match as few times as possible, but at least n.
```{r}
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "CC?")
# cf
str_view(x, "CC??")
str_view(x, 'C{2,3}?') # lazy
str_view(x, 'C[LX]+?') # lazy
str_view(x, 'C[LX]+') # greedy
str_extract(x, c("C{2,3}", "C{2,3}?"))
str_extract(x, c("C[LX]+", "C[LX]+?"))
```
[1] │ 1888 is the longest year in Roman numerals: MD<CC><C>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<C><C><C>LXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MD<CC>CLXXXVIII
[1] │ 1888 is the longest year in Roman numerals: MDCC<CL>XXXVIII
[1] │ 1888 is the longest year in Roman numerals: MDCC<CLXXX>VIII
[1] "CCC" "CC"
[1] "CLXXX" "CL"
4.3.5 Grouping and backreferences
- Parentheses also create a numbered
capturing group
(number 1, 2 etc.).- A capturing group stores the part of the string matched by the part of the regular expression inside the parentheses.
- You can refer to the same text as previously matched by a capturing group with
backreferences
, like \1, \2 etc
```{r}
str_view(fruit, "(..)\\1", match = TRUE) # "\\1" means group 1
str_view(fruit, "(.)\\1", match = TRUE) #"(.)" means one word
str_view(fruit, "(..)\1", match = TRUE) # wrong. needs \\
str_view(fruit, "(.)(.)\\2\\1", match = TRUE)
```
[4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
[1] │ a<pp>le
[5] │ be<ll> pe<pp>er
[6] │ bilbe<rr>y
[7] │ blackbe<rr>y
[8] │ blackcu<rr>ant
[9] │ bl<oo>d orange
[10] │ bluebe<rr>y
[11] │ boysenbe<rr>y
[16] │ che<rr>y
[17] │ chili pe<pp>er
[19] │ cloudbe<rr>y
[21] │ cranbe<rr>y
[23] │ cu<rr>ant
[28] │ e<gg>plant
[29] │ elderbe<rr>y
[32] │ goji be<rr>y
[33] │ g<oo>sebe<rr>y
[38] │ hucklebe<rr>y
[47] │ lych<ee>
[50] │ mulbe<rr>y
... and 9 more
[5] │ bell p<eppe>r
[17] │ chili p<eppe>r
4.3.6 Look arounds
These assertions look ahead or behind the current match without “consuming” any characters (i.e. changing the input position).
(?=...)
:positive look-ahead
assertion. Matches if … matches at the current input.- followed by
(?!...)
:negative look-ahead
assertion. Matches if … does not match at the current input.- not followed by
(?<=...)
:positive look-behind
assertion. Matches if … matches text preceding the current position, with the last character of the match being the character just before the current position. Length must be bounded (i.e. no * or +).- preceded by
(?<!...)
:negative look-behind
assertion. Matches if … does not match text preceding the current position. Length must be bounded (i.e. no * or +).- not preceded by
These are useful when you want to check that a pattern exists, but you don’t want to include it in the result:
```{r}
x <- c("1 piece", "2 pieces", "3")
str_extract(x, "\\d+(?= pieces?)") # positive look-ahead assertion: followed by
x1 <- c("piece 1", "pieces 2", "3")
str_extract(x1, "(?<=pieces?) \\d") # positive look-behind assertion: preceded by
y <- c("100", "$400")
str_extract(y, "(?<=\\$)\\d+") # positive look-behind assertion: preceded by
```
[1] "1" "2" NA
[1] " 1" " 2" NA
[1] NA "400"
4.4 Tools
4.4.1 str_detect()
4.4.1.1 (1) Overview
- Detect matches.
- It returns a logical vector(TRUE = 1; FALSE = 0) the same length as the input
4.4.1.2 (2) counting and proportion
- That makes sum() and mean() useful if you want to answer questions about matches across a larger vector:
```{r}
# How many common words start with t in words dataset.
class(words)
as_tibble(words)
sum(str_detect(words, "^t"))
# What proportion of common words end with a vowel?
mean(str_detect(words, "[aeiou]$"))
```
[1] "character"
# A tibble: 980 × 1
value
<chr>
1 a
2 able
3 about
4 absolute
5 accept
6 account
7 achieve
8 across
9 act
10 active
# ℹ 970 more rows
[1] 65
[1] 0.2765306
When you have
complex logical conditions
(e.g. match a or b but not c unless d) it’s often easier to combine multiple str_detect() calls with logical operators, rather than trying to create a single regular expression.For example, here are two ways to find all words that don’t contain any vowels
4.4.1.3 (3) str_subset()
- A common use of str_detect() is to select the elements that match a pattern.
- You can do this with
logical subsetting
, or the convenientstr_subset()
wrapper:
```{r}
str_detect(words, "x$")
words[str_detect(words, "x$")] # logical subsetting
str_subset(words, "x$") # wrapper of str_detect()
```
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[601] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[613] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[637] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[673] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[685] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[697] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[709] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[721] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[733] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[745] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[769] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[805] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[829] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[841] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[853] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[889] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[901] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[913] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[925] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[937] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[973] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[1] "box" "sex" "six" "tax"
[1] "box" "sex" "six" "tax"
4.4.1.4 (4) with filter()
in data frame
```{r}
df <- tibble(
word = words,
i = seq_along(word)
)
df
df %>%
filter(str_detect(word, "x$"))
```
# A tibble: 980 × 2
word i
<chr> <int>
1 a 1
2 able 2
3 about 3
4 absolute 4
5 accept 5
6 account 6
7 achieve 7
8 across 8
9 act 9
10 active 10
# ℹ 970 more rows
# A tibble: 4 × 2
word i
<chr> <int>
1 box 108
2 sex 747
3 six 772
4 tax 841
4.4.1.5 (5) str_count()
str_count()
: rather than a simple yes or no, it tells you how many matches there are in a string:
4.4.1.6 (6) str_count() with mutate()
```{r}
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]") # matches anything except a, e, i, o, u.
)
```
# A tibble: 980 × 4
word i vowels consonants
<chr> <int> <int> <int>
1 a 1 1 0
2 able 2 2 2
3 about 3 3 2
4 absolute 4 4 4
5 accept 5 2 4
6 account 6 3 4
7 achieve 7 4 3
8 across 8 2 4
9 act 9 1 2
10 active 10 3 3
# ℹ 970 more rows
4.4.1.7 (7) No pattern overlapping
- Note that matches never overlap. For example, in “abababa”, how many times will the pattern “aba” match?
- Regular expressions say two, not three:
- many
stringr
functions come in pairs: one function works with a single match, and the other works with all matches. The second function will have the suffix _all.
4.4.2 str_extract()
4.4.2.1 (1) Overview
- Extract matches
- Note that
str_extract()
only extracts the first match.
```{r}
# sentences dataset from stringr: Sample character vectors for practicing string manipulations.
class(sentences)
length(sentences)
head(sentences)
```
[1] "character"
[1] 720
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
[6] "The juice of lemons makes fine punch."
Goal: Find all sentences that contain a colour.
```{r}
colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
has_color <- str_subset(sentences, color_match)
has_color
matches <- str_extract(has_color, color_match)
head(matches)
```
[1] "red|orange|yellow|green|blue|purple"
[1] "Glue the sheet to the dark blue background."
[2] "Two blue fish swam in the tank."
[3] "The colt reared and threw the tall rider."
[4] "The wide road shimmered in the hot sun."
[5] "See the cat glaring at the scared mouse."
[6] "A wisp of cloud hung in the blue air."
[7] "Leaves turn brown and yellow in the fall."
[8] "He ordered peach pie with ice cream."
[9] "Pure bred poodles have curls."
[10] "The spot on the blotter was made by green ink."
[11] "Mud was spattered on the front of his white shirt."
[12] "The sofa cushion is red and of light weight."
[13] "The sky that morning was clear and bright blue."
[14] "Torn scraps littered the stone floor."
[15] "The doctor cured him with these pills."
[16] "The new girl was fired today at noon."
[17] "The third act was dull and tired the players."
[18] "A blue crane is a tall wading bird."
[19] "Live wires should be kept covered."
[20] "It is hard to erase blue or red ink."
[21] "The wreck occurred by the bank on Main Street."
[22] "The lamp shone with a steady green flame."
[23] "The box is held by a bright red snapper."
[24] "The prince ordered his head chopped off."
[25] "The houses are built of red clay bricks."
[26] "The red tape bound the smuggled food."
[27] "Nine men were hired to dig the ruins."
[28] "The flint sputtered and lit a pine torch."
[29] "Hedge apples may stain your hands green."
[30] "The old pan was covered with hard fudge."
[31] "The plant grew large and green in the window."
[32] "The store walls were lined with colored frocks."
[33] "The purple tie was ten years old."
[34] "Bathe and relax in the cool green grass."
[35] "The clan gathered on each dull night."
[36] "The lake sparkled in the red hot sun."
[37] "Mark the spot with a sign painted red."
[38] "Smoke poured out of every crack."
[39] "Serve the hot rum to the tired heroes."
[40] "The couch cover and hall drapes were blue."
[41] "He offered proof in the form of a large chart."
[42] "A man in a blue sweater sat at the desk."
[43] "A sip of tea revives his tired friend."
[44] "The door was barred, locked, and bolted as well."
[45] "A thick coat of black paint covered all."
[46] "The small red neon lamp went out."
[47] "Paint the sockets in the wall dull green."
[48] "Wake and rise, and step into the green outdoors."
[49] "The green light in the brown box flickered."
[50] "He put his last cartridge into the gun and fired."
[51] "The ram scared the school children off."
[52] "Tear a thin sheet from the yellow pad."
[53] "Dimes showered down from all sides."
[54] "The sky in the west is tinged with orange red."
[55] "The red paper brightened the dim stage."
[56] "The hail pattered on the burnt brown grass."
[57] "The big red apple fell to the ground."
[1] "blue" "blue" "red" "red" "red" "blue"
4.4.2.2 (2) str_extract_all()
```{r}
more <- sentences[str_count(sentences, color_match) > 1]
str_view(more, color_match)
# single match allows you to use simpler data structure
str_extract(more, color_match)
#To get all matches, use str_extract_all(). It returns a list:
str_extract_all(more, color_match)
#If you use simplify = TRUE, str_extract_all() will return a matrix with short matches expanded to the same length as the longest:
str_extract_all(more, color_match, simplify = TRUE)
x <- c("a", "a b", "a b c")
str_extract_all(x, "[a-z]", simplify = TRUE)
```
[1] │ It is hard to erase <blue> or <red> ink.
[2] │ The <green> light in the brown box flicke<red>.
[3] │ The sky in the west is tinged with <orange> <red>.
[1] "blue" "green" "orange"
[[1]]
[1] "blue" "red"
[[2]]
[1] "green" "red"
[[3]]
[1] "orange" "red"
[,1] [,2]
[1,] "blue" "red"
[2,] "green" "red"
[3,] "orange" "red"
[,1] [,2] [,3]
[1,] "a" "" ""
[2,] "a" "b" ""
[3,] "a" "b" "c"
4.4.3 Grouped matches
- You can also use
parentheses
to extract parts of a complex match.- For example, imagine we want to extract nouns from the sentences.
- As a heuristic, we’ll look for any word that comes after “a” or “the”.
- Defining a “word” in a regular expression is a little tricky, so here I use a simple approximation:
- a sequence of at least one character that isn’t a space.
4.4.3.1 (1) str_extract()
It gives us the complete match
4.4.3.2 (2) str_matches()
- gives each individual component
- it returns a matrix, with one column for the complete match followed by one column for each group
[,1] [,2] [,3]
[1,] "the smooth" "the" "smooth"
[2,] "the sheet" "the" "sheet"
[3,] "the depth" "the" "depth"
[4,] "a chicken" "a" "chicken"
[5,] "the parked" "the" "parked"
[6,] "the sun" "the" "sun"
[7,] "the huge" "the" "huge"
[8,] "the ball" "the" "ball"
[9,] "the woman" "the" "woman"
[10,] "a helps" "a" "helps"
[[1]]
[,1] [,2] [,3]
[1,] "the smooth" "the" "smooth"
[[2]]
[,1] [,2] [,3]
[1,] "the sheet" "the" "sheet"
[2,] "the dark" "the" "dark"
[[3]]
[,1] [,2] [,3]
[1,] "the depth" "the" "depth"
[2,] "a well." "a" "well."
[[4]]
[,1] [,2] [,3]
[1,] "a chicken" "a" "chicken"
[2,] "a rare" "a" "rare"
[[5]]
[,1] [,2] [,3]
[1,] "the parked" "the" "parked"
[[6]]
[,1] [,2] [,3]
[1,] "the sun" "the" "sun"
[[7]]
[,1] [,2] [,3]
[1,] "the huge" "the" "huge"
[2,] "the clear" "the" "clear"
[[8]]
[,1] [,2] [,3]
[1,] "the ball" "the" "ball"
[[9]]
[,1] [,2] [,3]
[1,] "the woman" "the" "woman"
[[10]]
[,1] [,2] [,3]
[1,] "a helps" "a" "helps"
[2,] "the evening." "the" "evening."
4.4.3.3 (3) tidyr::extract()
If your data is in a tibble, it’s often easier to use tidyr::extract().
- It works like
str_match()
but requires you to name the matches, which are then placed in new columns:- Usage
extract(
data,
col,
into,
regex = "([[:alnum:]]+)",
remove = TRUE,
convert = FALSE,
...
)
```{r}
tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE # If TRUE, remove input column from output data frame.
)
```
# A tibble: 720 × 3
sentence article noun
<chr> <chr> <chr>
1 The birch canoe slid on the smooth planks. the smooth
2 Glue the sheet to the dark blue background. the sheet
3 It's easy to tell the depth of a well. the depth
4 These days a chicken leg is a rare dish. a chicken
5 Rice is often served in round bowls. <NA> <NA>
6 The juice of lemons makes fine punch. <NA> <NA>
7 The box was thrown beside the parked truck. the parked
8 The hogs were fed chopped corn and garbage. <NA> <NA>
9 Four hours of steady work faced us. <NA> <NA>
10 A large size in stockings is hard to sell. <NA> <NA>
# ℹ 710 more rows
4.4.4 Replacing matches: str_replace()
4.4.4.1 (1) Simple
str_replace()
andstr_replace_all()
allow you to replace matches with new strings.- The simplest use is to replace a pattern with a fixed string:
4.4.4.2 (2) Named vector
- With
str_replace_all()
you can perform multiple replacements by supplying a named vector.
4.4.4.3 (3) Insert components of match
- Instead of replacing with a fixed string you can use
backreferences
to insert components of the match.- Ex) flip the order of the second and third words.
```{r}
sentences %>%
head(5)
sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
```
[1] "The birch canoe slid on the smooth planks."
[2] "Glue the sheet to the dark blue background."
[3] "It's easy to tell the depth of a well."
[4] "These days a chicken leg is a rare dish."
[5] "Rice is often served in round bowls."
[1] "The canoe birch slid on the smooth planks."
[2] "Glue sheet the to the dark blue background."
[3] "It's to easy tell the depth of a well."
[4] "These a days chicken leg is a rare dish."
[5] "Rice often is served in round bowls."
4.4.5 Splitting: str_split()
4.4.5.1 (1) Sentences to words
split a string up into pieces. For example, we could split sentences into words:
```{r}
# returns a list
sentences %>%
head(5) %>%
str_split("") # one letter at a time
sentences %>%
head(5) %>%
str_split(" ") # one word at a time
sentences %>%
head(5) %>%
str_split(" ") # one sentence at a time
```
[[1]]
[1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i"
[20] "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t" "h" " " "p" "l" "a"
[39] "n" "k" "s" "."
[[2]]
[1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t"
[20] "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b" "a" "c" "k" "g" "r"
[39] "o" "u" "n" "d" "."
[[3]]
[1] "I" "t" "'" "s" " " "e" "a" "s" "y" " " "t" "o" " " "t" "e" "l" "l" " " "t"
[20] "h" "e" " " "d" "e" "p" "t" "h" " " "o" "f" " " "a" " " "w" "e" "l" "l" "."
[[4]]
[1] "T" "h" "e" "s" "e" " " "d" "a" "y" "s" " " "a" " " "c" "h" "i" "c" "k" "e"
[20] "n" " " "l" "e" "g" " " "i" "s" " " "a" " " "r" "a" "r" "e" " " "d" "i" "s"
[39] "h" "."
[[5]]
[1] "R" "i" "c" "e" " " "i" "s" " " "o" "f" "t" "e" "n" " " "s" "e" "r" "v" "e"
[20] "d" " " "i" "n" " " "r" "o" "u" "n" "d" " " "b" "o" "w" "l" "s" "."
[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
[8] "planks."
[[2]]
[1] "Glue" "the" "sheet" "to" "the"
[6] "dark" "blue" "background."
[[3]]
[1] "It's" "easy" "to" "tell" "the" "depth" "of" "a" "well."
[[4]]
[1] "These" "days" "a" "chicken" "leg" "is" "a"
[8] "rare" "dish."
[[5]]
[1] "Rice" "is" "often" "served" "in" "round" "bowls."
[[1]]
[1] "The birch canoe slid on the smooth planks."
[[2]]
[1] "Glue the sheet to the dark blue background."
[[3]]
[1] "It's easy to tell the depth of a well."
[[4]]
[1] "These days a chicken leg is a rare dish."
[[5]]
[1] "Rice is often served in round bowls."
4.4.5.2 (2) Splitting as an element
4.4.5.3 (3) Splitting as a matrix
Return a matrix with “simplify = TRUE”
```{r}
sentences %>%
head(5) %>%
str_split(" ", simplify = TRUE)
fields <- c("Name: Hadley", "Country: NZ", "Age: 35")
fields
str(fields)
fields %>% str_split(": ", n = 2, simplify = TRUE) %>% #n=2: indicate a maximum number
tibble()
```
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "The" "birch" "canoe" "slid" "on" "the" "smooth" "planks."
[2,] "Glue" "the" "sheet" "to" "the" "dark" "blue" "background."
[3,] "It's" "easy" "to" "tell" "the" "depth" "of" "a"
[4,] "These" "days" "a" "chicken" "leg" "is" "a" "rare"
[5,] "Rice" "is" "often" "served" "in" "round" "bowls." ""
[,9]
[1,] ""
[2,] ""
[3,] "well."
[4,] "dish."
[5,] ""
[1] "Name: Hadley" "Country: NZ" "Age: 35"
chr [1:3] "Name: Hadley" "Country: NZ" "Age: 35"
# A tibble: 3 × 1
.[,1] [,2]
<chr> <chr>
1 Name Hadley
2 Country NZ
3 Age 35
4.4.5.4 (4)boudnary ("word")
Instead of splitting up strings by patterns, you can also split up by character, line, sentence and word boundary()s:
boundary(
type = c("character", "line_break", "sentence", "word"),
skip_word_none = NA,
```{r}
x <- "This is a sentence. This is another sentence."
str_view(x, boundary("word"))
str_split(x, boundary("word")) # better outcome than the one below
str_split(x, " ") # not as good as above
words <- c("These are some words.")
str_count(words, boundary("word"))
str_split(words, " ")[[1]]
str_split(words, boundary("word"))[[1]]
```
[1] │ <This> <is> <a> <sentence>. <This> <is> <another> <sentence>.
[[1]]
[1] "This" "is" "a" "sentence" "This" "is" "another"
[8] "sentence"
[[1]]
[1] "This" "is" "a" "sentence." "" "This"
[7] "is" "another" "sentence."
[1] 4
[1] "These" "are" "" "" "some" "words."
[1] "These" "are" "some" "words"
4.4.6 Find matches with str_locate()
- str_locate() and str_locate_all() give you the starting and ending positions of each match.
- These are particularly useful when none of the other functions does exactly what you want.
- You can use str_locate() to find the matching pattern, str_sub() to extract and/or modify them.
```{r}
fruit <- c("apple", "banana", "pear", "pineapple")
str_locate(fruit, "$")
str_locate(fruit, "a")
str_locate(fruit, "e")
str_locate(fruit, c("a", "b", "p", "p"))
str_locate_all(fruit, "a")
str_locate_all(fruit, "e")
str_locate_all(fruit, c("a", "b", "p", "p"))
# Find location of every character
str_locate_all(fruit, "")
```
start end
[1,] 6 5
[2,] 7 6
[3,] 5 4
[4,] 10 9
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 5 5
start end
[1,] 5 5
[2,] NA NA
[3,] 2 2
[4,] 4 4
start end
[1,] 1 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
[[1]]
start end
[1,] 1 1
[[2]]
start end
[1,] 2 2
[2,] 4 4
[3,] 6 6
[[3]]
start end
[1,] 3 3
[[4]]
start end
[1,] 5 5
[[1]]
start end
[1,] 5 5
[[2]]
start end
[[3]]
start end
[1,] 2 2
[[4]]
start end
[1,] 4 4
[2,] 9 9
[[1]]
start end
[1,] 1 1
[[2]]
start end
[1,] 1 1
[[3]]
start end
[1,] 1 1
[[4]]
start end
[1,] 1 1
[2,] 6 6
[3,] 7 7
[[1]]
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[[2]]
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
[[3]]
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[[4]]
start end
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 4 4
[5,] 5 5
[6,] 6 6
[7,] 7 7
[8,] 8 8
[9,] 9 9
4.5 Other types of pattern
4.5.1 regex()
When you use a pattern that’s a string, it’s automatically wrapped into a call to regex()
```{r}
# The regular call:
str_view(fruit, "nana")
# Is shorthand for
str_view(fruit, regex("nana"))
```
[2] │ ba<nana>
[2] │ ba<nana>
You can use the other arguments of regex() to control details of the match
4.5.1.1 (1) ignore_case = TRUE
allows characters to match either their uppercase or lowercase forms. This always uses the current locale.
4.5.1.2 (2) multiline = TRUE
allows ^ and $ to match the start and end of each line rather than the start and end of the complete string.
4.6 Other uses of regular expressions
There are two useful function in base R that also use regular expressions:
4.6.1 apropos()
- searches all objects available from the global environment.
- This is useful if you can’t quite remember the name of the function.
[1] "%+replace%" "replace" "replace_na" "setReplaceMethod"
[5] "str_replace" "str_replace_all" "str_replace_na" "theme_replace"
[1] "cummax" "max" "max.col" "max_height" "max_width"
[6] "mem.maxNSize" "mem.maxVSize" "pmax" "pmax.int" "promax"
[11] "slice_max" "varimax" "which.max"
4.6.2 dir()
- lists all the files in a directory.
- The pattern argument takes a regular expression and only returns file names that match the pattern.
- For example, you can find all the quarto Markdown files in the current directory with the following.
5 Factors with forcats
Factors with forcats:: cheat Sheet: http://www.flutterbys.com.au/stats/downloads/slides/figure/factors.pdf
5.1 Introduction
Historically, factors were much easier to work with than characters. As a result, many of the functions in base R automatically convert characters to factors. This means that factors often crop up in places where they’re not actually helpful. Fortunately, you don’t need to worry about that in the tidyverse, and can focus on situations where factors are genuinely useful.
5.2 Creating factors
5.2.1 (1) factor()
```{r}
x1 <- c("Dec", "Apr", "Jan", "Mar")
month_levels <- c(
"Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
)
y1 <- factor(x1, levels = month_levels)
y1
sort(y1)
# any values not in the set will be silently converted to NA:
x2 <- c("Dec", "Apr", "Jam", "Mar")
y2 <- factor(x2, levels = month_levels)
y2
# If you omit the levels, they'll be taken from the data in alphabetical order:
factor(x1)
```
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Dec Apr <NA> Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
[1] Dec Apr Jan Mar
Levels: Apr Dec Jan Mar
5.2.2 (2) readr::parse_factor()
If you want a warning, you can use readr::parse_factor().
5.2.3 (3) unique()
or fct_inorder()
Sometimes you’d prefer that the order of the levels match the order of the first appearance in the data. You can do that when creating the factor by setting levels to unique(x), or after the fact, with fct_inorder():
5.3 Survey data
# A tibble: 21,483 × 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2000 Never married 26 White $8000 to 9999 Ind,near … Prot… Sout… 12
2 2000 Divorced 48 White $8000 to 9999 Not str r… Prot… Bapt… NA
3 2000 Widowed 67 White Not applicable Independe… Prot… No d… 2
4 2000 Never married 39 White Not applicable Ind,near … Orth… Not … 4
5 2000 Divorced 25 White Not applicable Not str d… None Not … 1
6 2000 Married 25 White $20000 - 24999 Strong de… Prot… Sout… NA
7 2000 Never married 36 White $25000 or more Not str r… Chri… Not … 3
8 2000 Divorced 44 White $7000 to 7999 Ind,near … Prot… Luth… NA
9 2000 Married 44 White $25000 or more Not str d… Prot… Other 0
10 2000 Married 47 White $25000 or more Strong re… Prot… Sout… 3
# ℹ 21,473 more rows
year: year of survey, 2000-2014
age: age. Maximum age truncated to 89.
marital: marital status
race: race
rincome: reported income
partyid: party affiliation
relig: religion
denom: denomination
tvhours: hours per day watching tv
```{r}
skimr::skim(gss_cat)
levels(gss_cat$race)
# When factors are stored in a tibble, you can't see their levels so easily.
# One way to see them is with count():
gss_cat %>%
count(race)
# Also with a barplot
ggplot(gss_cat, aes(race)) +
geom_bar()
# By default, ggplot2 will drop levels that don't have any values.
# You can force them to display with:
ggplot(gss_cat, aes(race)) +
geom_bar() +
scale_x_discrete(drop = FALSE)
```
Name | gss_cat |
Number of rows | 21483 |
Number of columns | 9 |
_______________________ | |
Column type frequency: | |
factor | 6 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
marital | 0 | 1 | FALSE | 6 | Mar: 10117, Nev: 5416, Div: 3383, Wid: 1807 |
race | 0 | 1 | FALSE | 3 | Whi: 16395, Bla: 3129, Oth: 1959, Not: 0 |
rincome | 0 | 1 | FALSE | 16 | $25: 7363, Not: 7043, $20: 1283, $10: 1168 |
partyid | 0 | 1 | FALSE | 10 | Ind: 4119, Not: 3690, Str: 3490, Not: 3032 |
relig | 0 | 1 | FALSE | 15 | Pro: 10846, Cat: 5124, Non: 3523, Chr: 689 |
denom | 0 | 1 | FALSE | 30 | Not: 10072, Oth: 2534, No : 1683, Sou: 1536 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2006.50 | 4.45 | 2000 | 2002 | 2006 | 2010 | 2014 | ▇▃▇▂▆ |
age | 76 | 1.00 | 47.18 | 17.29 | 18 | 33 | 46 | 59 | 89 | ▇▇▇▅▂ |
tvhours | 10146 | 0.53 | 2.98 | 2.59 | 0 | 1 | 2 | 4 | 24 | ▇▂▁▁▁ |
[1] "Other" "Black" "White" "Not applicable"
# A tibble: 3 × 2
race n
<fct> <int>
1 Other 1959
2 Black 3129
3 White 16395
- These levels represent valid values that simply did not occur in this dataset. Unfortunately, dplyr doesn’t yet have a drop option, but it will in the future.
5.3.1 Visualizating survey data: Exercise
5.3.1.1 (1) rincome
- Explore the distribution of rincome (reported income).
- What makes the default bar chart hard to understand?
- How could you improve the plot?
```{r}
gss_cat
gss_cat %>%
ggplot(aes(rincome)) +
geom_bar() +
coord_flip() +
scale_x_discrete(drop = FALSE)
```
# A tibble: 21,483 × 9
year marital age race rincome partyid relig denom tvhours
<int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
1 2000 Never married 26 White $8000 to 9999 Ind,near … Prot… Sout… 12
2 2000 Divorced 48 White $8000 to 9999 Not str r… Prot… Bapt… NA
3 2000 Widowed 67 White Not applicable Independe… Prot… No d… 2
4 2000 Never married 39 White Not applicable Ind,near … Orth… Not … 4
5 2000 Divorced 25 White Not applicable Not str d… None Not … 1
6 2000 Married 25 White $20000 - 24999 Strong de… Prot… Sout… NA
7 2000 Never married 36 White $25000 or more Not str r… Chri… Not … 3
8 2000 Divorced 44 White $7000 to 7999 Ind,near … Prot… Luth… NA
9 2000 Married 44 White $25000 or more Not str d… Prot… Other 0
10 2000 Married 47 White $25000 or more Strong re… Prot… Sout… 3
# ℹ 21,473 more rows
5.3.1.2 (2) relig & partyid
What is the most common relig in this survey? What’s the most common partyid?
5.3.1.3 (3) relgion ands donomination
Which relig does denom (denomination) apply to? How can you find out with a table? How can you find out with a visualisation?
```{r}
gss_cat %>%
ggplot(aes(denom)) +
geom_bar() +
coord_flip() +
scale_x_discrete(drop = FALSE)
# create a table and visualize it
gss_cat %>%
count(relig, denom) %>%
#view() %>%
ggplot(aes(relig, n, fill = denom)) +
geom_col() +
coord_flip() +
scale_x_discrete(drop = FALSE) #+
#facet_wrap(~ denom)
```
Only protestants or Christians responded to deonomination questions.
5.4 Modifying factor order:
5.4.1 fct_reorder()
- reorder factor based on a continuous variable
5.4.1.1 (1) Reordering nominal (arbitrary) factors
- explore the average number of hours spent watching TV per day across religions:
```{r}
# hard to interpret the pattern of the relationship
relig_summary <- gss_cat %>%
group_by(relig) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n() )
relig_summary
ggplot(relig_summary, aes(tvhours, relig)) +
geom_point()
# reorder the religion by its frequency
ggplot(relig_summary, aes(tvhours, fct_reorder(relig, tvhours))) +
geom_point()
# It is better to move fct_reorder out of aes() and into a separate mutate() step.
relig_summary %>%
mutate(relig = fct_reorder(relig, tvhours)) %>%
ggplot(aes(tvhours, relig)) +
geom_point()
```
# A tibble: 15 × 4
relig age tvhours n
<fct> <dbl> <dbl> <int>
1 No answer 49.5 2.72 93
2 Don't know 35.9 4.62 15
3 Inter-nondenominational 40.0 2.87 109
4 Native american 38.9 3.46 23
5 Christian 40.1 2.79 689
6 Orthodox-christian 50.4 2.42 95
7 Moslem/islam 37.6 2.44 104
8 Other eastern 45.9 1.67 32
9 Hinduism 37.7 1.89 71
10 Buddhism 44.7 2.38 147
11 Other 41.0 2.73 224
12 None 41.2 2.71 3523
13 Jewish 52.4 2.52 388
14 Catholic 46.9 2.96 5124
15 Protestant 49.9 3.15 10846
5.4.1.2 (2) Reordering ordinal (principled) factors?
How average age varies across reported income level?
```{r}
rincome_summary <- gss_cat %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n() )
rincome_summary
ggplot(rincome_summary, aes(age, fct_reorder(rincome, age))) + geom_point()
```
# A tibble: 16 × 4
rincome age tvhours n
<fct> <dbl> <dbl> <int>
1 No answer 45.5 2.90 183
2 Don't know 45.6 3.41 267
3 Refused 47.6 2.48 975
4 $25000 or more 44.2 2.23 7363
5 $20000 - 24999 41.5 2.78 1283
6 $15000 - 19999 40.0 2.91 1048
7 $10000 - 14999 41.1 3.02 1168
8 $8000 to 9999 41.1 3.15 340
9 $7000 to 7999 38.2 2.65 188
10 $6000 to 6999 40.3 3.17 215
11 $5000 to 5999 37.8 3.16 227
12 $4000 to 4999 38.9 3.15 226
13 $3000 to 3999 37.8 3.31 276
14 $1000 to 2999 34.5 3.00 395
15 Lt $1000 40.5 3.36 286
16 Not applicable 56.1 3.79 7043
Here, arbitrarily reordering the levels isn’t a good idea! That’s because rincome already has a principled order that we shouldn’t mess with. Reserve fct_reorder()
for factors whose levels are arbitrarily ordered.
5.4.2 fct_relevel()
- takes a factor, f, and then any number of levels that you want to move to the front of the line.
```{r}
ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) +
geom_point()
# releveling in wrangling part
gss_cat %>%
mutate(rincome = fct_relevel(rincome, "Not applicable")) %>%
group_by(rincome) %>%
summarise(
age = mean(age, na.rm = TRUE),
tvhours = mean(tvhours, na.rm = TRUE),
n = n() ) %>%
ggplot(aes(age, rincome)) +
geom_point()
```
5.4.3 fct_reorder2()
- Reorders the factor by the y values associated with the largest x values.
- This makes the plot easier to read because the line colours line up with the legend.
```{r}
by_age <- gss_cat %>%
filter(!is.na(age)) %>%
count(age, marital) %>%
group_by(age) %>%
mutate(prop = n / sum(n))
ggplot(by_age, aes(age, prop, colour = marital)) +
geom_line(na.rm = TRUE)
# the line colors line up with the color legend.
ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) +
geom_line() +
labs(colour = "Marital Status")
```
5.4.4 fct_infreq()
+ fct_rev()
fct_infreq()
changes the order of levels by number of observations with each level (largest first)
fct_rev()
reverse the order of levels.
Used togehter both will order levels in increasing frequency.
this is the simplest type of reordering because it doesn’t need any extra variables.
```{r}
gss_cat %>%
mutate(marital = marital %>% fct_infreq()) %>%
ggplot(aes(marital)) +
geom_bar()
# barplot 1
marital_order1 <- gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev())
gss_cat %>%
mutate(marital = marital %>% fct_infreq() %>% fct_rev()) %>%
ggplot(aes(marital)) +
geom_bar()
# barplot 2
marital_order2 <- gss_cat %>%
mutate(marital = fct_infreq(marital),
marital = fct_rev(marital))
gss_cat %>%
mutate(marital = fct_infreq(marital),
marital = fct_rev(marital)) %>%
ggplot(aes(marital)) +
geom_bar()
# comparison
levels(marital_order1$marital)
levels(marital_order2$marital)
```
[1] "No answer" "Separated" "Widowed" "Divorced"
[5] "Never married" "Married"
[1] "No answer" "Separated" "Widowed" "Divorced"
[5] "Never married" "Married"
5.5 Modifying factor levels
- More powerful than changing the orders of the levels is changing their values.
- This allows you to clarify labels for publication, and collapse levels for high-level displays.
5.5.1 fct_recode
- The most general and powerful tool is
fct_recode()
.- It allows you to recode, or change, the value of each level.
5.5.1.1 (1) recode it
```{r}
# The levels are terse and inconsistent.
gss_cat %>% count(partyid)
# Let's tweak them to be longer and use a parallel construction.
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat"
)) %>%
count(partyid)
```
# A tibble: 10 × 2
partyid n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Strong republican 2314
5 Not str republican 3032
6 Ind,near rep 1791
7 Independent 4119
8 Ind,near dem 2499
9 Not str democrat 3690
10 Strong democrat 3490
# A tibble: 10 × 2
partyid n
<fct> <int>
1 No answer 154
2 Don't know 1
3 Other party 393
4 Republican, strong 2314
5 Republican, weak 3032
6 Independent, near rep 1791
7 Independent 4119
8 Independent, near dem 2499
9 Democrat, weak 3690
10 Democrat, strong 3490
fct_recode()
will leave levels that aren’t explicitly mentioned as is, and will warn you if you accidentally refer to a level that doesn’t exist.
5.5.1.2 (2) Combine multiple levels into a new level
- To combine groups, you can assign multiple old levels to the same new level:
```{r}
gss_cat %>%
mutate(partyid = fct_recode(partyid,
"Republican, strong" = "Strong republican",
"Republican, weak" = "Not str republican",
"Independent, near rep" = "Ind,near rep",
"Independent, near dem" = "Ind,near dem",
"Democrat, weak" = "Not str democrat",
"Democrat, strong" = "Strong democrat",
"Other" = "No answer",
"Other" = "Don't know",
"Other" = "Other party"
)) %>%
count(partyid)
```
# A tibble: 8 × 2
partyid n
<fct> <int>
1 Other 548
2 Republican, strong 2314
3 Republican, weak 3032
4 Independent, near rep 1791
5 Independent 4119
6 Independent, near dem 2499
7 Democrat, weak 3690
8 Democrat, strong 3490
Note: You must use this technique with care: if you group together categories that are truly different you will end up with misleading results.
5.5.2 fct_collapse()
- If you want to collapse a lot of levels, fct_collapse() is a useful variant of fct_recode().
- For each new variable, you can provide a vector of old levels:
```{r}
gss_cat %>%
mutate(partyid = fct_collapse(partyid,
other = c("No answer", "Don't know", "Other party"),
rep = c("Strong republican", "Not str republican"),
ind = c("Ind,near rep", "Independent", "Ind,near dem"),
dem = c("Not str democrat", "Strong democrat")
)) %>%
count(partyid)
```
# A tibble: 4 × 2
partyid n
<fct> <int>
1 other 548
2 rep 5346
3 ind 8409
4 dem 7180
5.5.3 fct_lump()
- Sometimes you just want to lump together all the small groups to make a plot or table simpler. That’s the job of fct_lump():
- The default behaviour is to progressively lump together the smallest groups, ensuring that the aggregate is still the smallest group.
```{r}
# In this case it's not very helpful: it is true that the majority of Americans
# in this survey are Protestant, but we've probably over collapsed.
gss_cat %>%
mutate(relig = fct_lump(relig)) %>%
count(relig)
# Instead, we can use the n parameter to specify how many groups (excluding other) we want to keep:
gss_cat %>%
mutate(relig = fct_lump(relig, n = 10)) %>%
count(relig, sort = TRUE) %>%
print(n = Inf)
```
# A tibble: 2 × 2
relig n
<fct> <int>
1 Protestant 10846
2 Other 10637
# A tibble: 10 × 2
relig n
<fct> <int>
1 Protestant 10846
2 Catholic 5124
3 None 3523
4 Christian 689
5 Other 458
6 Jewish 388
7 Buddhism 147
8 Inter-nondenominational 109
9 Moslem/islam 104
10 Orthodox-christian 95
5.5.4 Excercise
- How have the proportions of people identifying as Democrat, Republican, and Independent changed over time?
- How could you collapse rincome into a small set of categories?
6 Dates and Times with lubridate package
6.1 Introduction
Dates and times are hard because they have to reconcile two physical phenomena (the rotation of the Earth and its orbit around the sun) with a whole raft of geopolitical phenomena including months, time zones, and DST.
6.1.1 Prep
6.2 Creating date/times
There are three types of date/time data that refer to an instant in time:
- A date: Tibbles print this as
. - A time within a day: Tibbles print this as
- A date-time is a date plus a time: it uniquely identifies an instant in time (typically to the nearest second). Tibbles print this as
. Elsewhere in R these are called POSIXct. - we are only going to focus on dates and date-times as R doesn’t have a native class for storing times. If you need one, you can use the hms package.
[1] "2025-03-27"
[1] "2025-03-27 12:12:27 PDT"
Three ways to create a date/time: * From a string. * From individual date-time components. * From an existing date/time object.
6.2.1 From Strings
- Date/time data often comes as strings.
- one approach is to parsing strings into date-times in date-times.
- Another approach is to use the helpers provided by lubridate.
- They automatically work out the format once you specify the order of the component.
- To use them, identify the order in which year, month, and day appear in your dates,
- then arrange “y”, “m”, and “d” in the same order.
- That gives you the name of the lubridate function that will parse your date.
- For example, see next.
6.2.1.1 (1) dates only
6.2.1.2 (2) date-time
- To create a date-time, add an underscore and one or more of “h”, “m”, and “s” to the name of the parsing function.
6.2.1.3 (3) time zone
You can also force the creation of a date-time from a date by supplying a timezone:
6.2.2 From Individual Components
- Instead of a single string, sometimes you’ll have the individual components of the date-time spread across multiple columns.
# A tibble: 336,776 × 5
year month day hour minute
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 5 15
2 2013 1 1 5 29
3 2013 1 1 5 40
4 2013 1 1 5 45
5 2013 1 1 6 0
6 2013 1 1 5 58
7 2013 1 1 6 0
8 2013 1 1 6 0
9 2013 1 1 6 0
10 2013 1 1 6 0
# ℹ 336,766 more rows
6.2.2.1 (1) make_date() / make_datetime()
- To create a date/time from this sort of input, use make_date() for dates, or make_datetime() for date-times:
```{r}
flights %>%
select(year, month, day, hour, minute, sched_dep_time) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
```
# A tibble: 336,776 × 7
year month day hour minute sched_dep_time departure
<int> <int> <int> <dbl> <dbl> <int> <dttm>
1 2013 1 1 5 15 515 2013-01-01 05:15:00
2 2013 1 1 5 29 529 2013-01-01 05:29:00
3 2013 1 1 5 40 540 2013-01-01 05:40:00
4 2013 1 1 5 45 545 2013-01-01 05:45:00
5 2013 1 1 6 0 600 2013-01-01 06:00:00
6 2013 1 1 5 58 558 2013-01-01 05:58:00
7 2013 1 1 6 0 600 2013-01-01 06:00:00
8 2013 1 1 6 0 600 2013-01-01 06:00:00
9 2013 1 1 6 0 600 2013-01-01 06:00:00
10 2013 1 1 6 0 600 2013-01-01 06:00:00
# ℹ 336,766 more rows
6.2.2.2 (2) Create flights_dt
```{r}
make_datetime_100 <- function(year, month, day, time) {
make_datetime(year, month, day, time %/% 100, time %% 100)
}
flights_dt <- flights %>%
filter(!is.na(dep_time), !is.na(arr_time)) %>%
mutate(
dep_time = make_datetime_100(year, month, day, dep_time),
arr_time = make_datetime_100(year, month, day, arr_time),
sched_dep_time = make_datetime_100(year, month, day, sched_dep_time),
sched_arr_time = make_datetime_100(year, month, day, sched_arr_time)
) %>%
select(origin, dest, ends_with("delay"), ends_with("time"))
flights_dt
# visualize the distribution of departure times across the year
flights_dt %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 86400) # 86400 seconds = 1 day
60*60*24 # number of seconds in a day
# distribution of departure time within a single day
flights_dt %>%
filter(dep_time < ymd(20130102)) %>%
ggplot(aes(dep_time)) +
geom_freqpoly(binwidth = 600) # 600 s = 10 minutes
```
# A tibble: 328,063 × 9
origin dest dep_delay arr_delay dep_time sched_dep_time
<chr> <chr> <dbl> <dbl> <dttm> <dttm>
1 EWR IAH 2 11 2013-01-01 05:17:00 2013-01-01 05:15:00
2 LGA IAH 4 20 2013-01-01 05:33:00 2013-01-01 05:29:00
3 JFK MIA 2 33 2013-01-01 05:42:00 2013-01-01 05:40:00
4 JFK BQN -1 -18 2013-01-01 05:44:00 2013-01-01 05:45:00
5 LGA ATL -6 -25 2013-01-01 05:54:00 2013-01-01 06:00:00
6 EWR ORD -4 12 2013-01-01 05:54:00 2013-01-01 05:58:00
7 EWR FLL -5 19 2013-01-01 05:55:00 2013-01-01 06:00:00
8 LGA IAD -3 -14 2013-01-01 05:57:00 2013-01-01 06:00:00
9 JFK MCO -3 -8 2013-01-01 05:57:00 2013-01-01 06:00:00
10 LGA ORD -2 8 2013-01-01 05:58:00 2013-01-01 06:00:00
# ℹ 328,053 more rows
# ℹ 3 more variables: arr_time <dttm>, sched_arr_time <dttm>, air_time <dbl>
[1] 86400
Note that when you use date-times in a numeric context (like in a histogram), 1 means 1 second, so a binwidth of 86400 means one day. For dates, 1 means 1 day.
6.2.3 From Other Types
6.2.3.1 (1) as_datetime()
/ as_date()
You may want to switch between a date-time and a date. That’s the job of as_datetime() and as_date()
- Sometimes you’ll get date/times as numeric offsets from the “Unix Epoch”, 1970-01-01. If the offset is in seconds, use as_datetime(); if it’s in days, use as_date().
6.3 Date-time components
Now that you know how to get date-time data into R’s date-time data structures, let’s explore what you can do with them. This section will focus on the accessor functions that let you get and set individual components. The next section will look at how arithmetic works with date-times.
6.3.1 Getting components
```{r}
datetime <- ymd_hms("2016-07-08 12:34:56")
year(datetime)
month(datetime)
mday(datetime) # day of the month
yday(datetime) # day of the year
wday(datetime) # day of the week starting with Sunday
wday(now())
```
[1] 2016
[1] 7
[1] 8
[1] 190
[1] 6
[1] 5
- For month() and wday() you can set label = TRUE to return the abbreviated name of the month or day of the week.
- Set abbr = FALSE to return the full name.
[1] Jul
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
[1] Friday
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
- We can use wday() to see that more flights depart during the week than on the weekend:
```{r}
flights_dt %>%
mutate(wday = wday(dep_time, label = TRUE)) %>%
ggplot(aes(x = wday)) +
geom_bar()
```
- There’s an interesting pattern if we look at the average departure delay by minute within the hour.
- It looks like flights leaving in minutes 20-30 and 50-60 have much lower delays than the rest of the hour!
```{r}
flights_dt %>%
mutate(minute = minute(dep_time)) %>%
group_by(minute) %>%
summarise(
avg_delay = mean(arr_delay, na.rm = TRUE),
n = n()) %>%
ggplot(aes(minute, avg_delay)) +
geom_line()
```
- Interestingly, if we look at the scheduled departure time we don’t see such a strong pattern:
6.3.2 Rounding:
6.3.2.1 (1) Basic
- floor_date(), round_date(), and ceiling_date()
- Syntax
floor_date(
x,
unit = "seconds",
week_start = getOption("lubridate.week.start", 7)
# unit: minute, hour, day, week, month, bimonth, quarter, season, halfyear and year
- the number of flights per week:
6.3.2.2 (2) Exercise
```{r}
## print fractional seconds
options(digits.secs = 6)
x <- ymd_hms("2009-08-03 12:01:59.23")
round_date(x, "second")
round_date(x, "minute")
round_date(x, "5 mins")
round_date(x, "hour")
round_date(x, "2 hours")
round_date(x, "day")
round_date(x, "week")
round_date(x, "month")
round_date(x, "bimonth")
round_date(x, "quarter") == round_date(x, "3 months")
round_date(x, "halfyear")
round_date(x, "year")
x <- ymd_hms("2009-08-03 12:01:59.23")
floor_date(x, "second")
floor_date(x, "minute")
floor_date(x, "hour")
floor_date(x, "day")
floor_date(x, "week")
floor_date(x, "month")
floor_date(x, "bimonth")
floor_date(x, "quarter")
floor_date(x, "season")
floor_date(x, "halfyear")
floor_date(x, "year")
x <- ymd_hms("2009-08-03 12:01:59.23")
ceiling_date(x, "second")
ceiling_date(x, "minute")
ceiling_date(x, "5 mins")
ceiling_date(x, "hour")
ceiling_date(x, "day")
ceiling_date(x, "week")
ceiling_date(x, "month")
ceiling_date(x, "bimonth") == ceiling_date(x, "2 months")
ceiling_date(x, "quarter")
ceiling_date(x, "season")
ceiling_date(x, "halfyear")
ceiling_date(x, "year")
```
[1] "2009-08-03 12:01:59 UTC"
[1] "2009-08-03 12:02:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-04 UTC"
[1] "2009-08-02 UTC"
[1] "2009-08-01 UTC"
[1] "2009-09-01 UTC"
[1] TRUE
[1] "2009-07-01 UTC"
[1] "2010-01-01 UTC"
[1] "2009-08-03 12:01:59 UTC"
[1] "2009-08-03 12:01:00 UTC"
[1] "2009-08-03 12:00:00 UTC"
[1] "2009-08-03 UTC"
[1] "2009-08-02 UTC"
[1] "2009-08-01 UTC"
[1] "2009-07-01 UTC"
[1] "2009-07-01 UTC"
[1] "2009-06-01 UTC"
[1] "2009-07-01 UTC"
[1] "2009-01-01 UTC"
[1] "2009-08-03 12:02:00 UTC"
[1] "2009-08-03 12:02:00 UTC"
[1] "2009-08-03 12:05:00 UTC"
[1] "2009-08-03 13:00:00 UTC"
[1] "2009-08-04 UTC"
[1] "2009-08-09 UTC"
[1] "2009-09-01 UTC"
[1] TRUE
[1] "2009-10-01 UTC"
[1] "2009-09-01 UTC"
[1] "2010-01-01 UTC"
[1] "2010-01-01 UTC"
6.3.3 Setting components
6.3.3.1 (1) Modifying date-time in place
- make a permanent change
- use each accessor function to set the components of a date/time:
```{r}
(datetime <- ymd_hms("2016-07-08 12:34:56"))
year(datetime) <- 2023 # permanent change
datetime
month(datetime) <- 3
datetime
hour(datetime) <- hour(datetime) + 1
datetime
```
[1] "2016-07-08 12:34:56 UTC"
[1] "2023-07-08 12:34:56 UTC"
[1] "2023-03-08 12:34:56 UTC"
[1] "2023-03-08 13:34:56 UTC"
6.3.3.2 (2) Create a new date-time with update()
```{r}
update(datetime, year = 2024, month = 2, mday = 2, hour =2) # not a permanent change
datetime
#If values are too big, they will roll-over:
ymd("2023-02-01") %>%
update(mday = 30)
ymd("2023-02-01") %>%
update(hour = 48)
# show the distribution of flights across the course of the day for every day of the year:
flights_dt %>%
#arrange(desc(dep_time))
mutate(dep_hour = update(dep_time, yday = 1)) %>% #yday: the first day of the year
#arrange(desc(dep_hour))
ggplot(aes(dep_hour)) +
geom_freqpoly(binwidth = 300)
```
[1] "2024-02-02 02:34:56 UTC"
[1] "2023-03-08 13:34:56 UTC"
[1] "2023-03-02"
[1] "2023-02-03 UTC"
6.3.4 Exercise
6.3.4.1 (1) Distribution of flight times within a day
How does the distribution of flight times within a day change over the course of the year?
```{r}
# flights per hour for the entire year
flights_dt %>%
mutate(hour = hour(dep_time)) %>%
group_by(hour)%>%
summarize(numflights_per_hour = n())%>%
ggplot(aes(x = hour, y = numflights_per_hour)) +
geom_line()
flights %>%
filter(!is.na(dep_time)) %>%
mutate(hour = dep_time %/% 100) %>%
group_by(hour)%>%
summarize(numflights_per_hour = n())%>%
ggplot(aes(x = hour, y = numflights_per_hour)) +
geom_line()
```
6.4 Time spans and arithmetics
- durations, which represent an exact number of seconds.
- periods, which represent human units like weeks and months.
- intervals, which represent a starting and ending point.
6.4.1 Durations
6.4.1.1 (1) Base R
- In R, when you subtract two dates, you get a difftime object:
6.4.1.2 (2) Duration and convenient constructors
```{r}
as.duration(h_age) # change object to duration in seconds always
# Convenient constructors
dseconds(15)
dminutes(10)
dhours(c(12, 24))
ddays(0:5)
```
[1] "860716800s (~27.27 years)"
[1] "15s"
[1] "600s (~10 minutes)"
[1] "43200s (~12 hours)" "86400s (~1 days)"
[1] "0s" "86400s (~1 days)" "172800s (~2 days)"
[4] "259200s (~3 days)" "345600s (~4 days)" "432000s (~5 days)"
6.4.1.3 (3) Arithmeric with duration
```{r}
# You can add and multiply durations:
2 * dyears(1)
dyears(1) + dweeks(12) + dhours(15)
# add and subtract durations to and from days:
tomorrow <- today() + ddays(1)
tomorrow
last_year <- today() - dyears(1)
last_year
# Unexpected result due to DST starting in March
one_pm <- ymd_hms("2022-03-12 13:00:00", tz = "America/New_York")
one_pm
one_pm + ddays(1)
```
[1] "63115200s (~2 years)"
[1] "38869200s (~1.23 years)"
[1] "2025-03-28"
[1] "2024-03-26 18:00:00 UTC"
[1] "2022-03-12 13:00:00 EST"
[1] "2022-03-13 14:00:00 EDT"
6.4.2 Periods and constructors
- It solve the problem with duration coming from the use of seconds
- Periods are time spans but don’t have a fixed length in seconds,
- instead they work with “human” times, like days and months.
- That allows them to work in a more intuitive way:
```{r}
one_pm + days(1)
seconds(15)
minutes(10)
hours(c(12, 24))
days(7)
months(1:6)
weeks(3)
years(1)
```
[1] "2022-03-13 13:00:00 EDT"
[1] "15S"
[1] "10M 0S"
[1] "12H 0M 0S" "24H 0M 0S"
[1] "7d 0H 0M 0S"
[1] "1m 0d 0H 0M 0S" "2m 0d 0H 0M 0S" "3m 0d 0H 0M 0S" "4m 0d 0H 0M 0S"
[5] "5m 0d 0H 0M 0S" "6m 0d 0H 0M 0S"
[1] "21d 0H 0M 0S"
[1] "1y 0m 0d 0H 0M 0S"
6.4.2.1 (1) Arithmetics
```{r}
# addition and multiplication
10 * (months(6) + days(1))
days(50) + hours(25) + minutes(2)
# Compared to durations, periods are more likely to do what you expect
## A leap year
ymd("2020-01-01") + dyears(1)
ymd("2020-01-01") + years(1)
## Daylight savings time
one_pm + ddays(1)
one_pm + days(1)
```
[1] "60m 10d 0H 0M 0S"
[1] "50d 25H 2M 0S"
[1] "2020-12-31 06:00:00 UTC"
[1] "2021-01-01"
[1] "2022-03-13 14:00:00 EDT"
[1] "2022-03-13 13:00:00 EDT"
6.4.2.2 (2) Application to flights data
- fix an oddity related to our flight dates. Some planes appear to have arrived at their destination before they departed from New York City.
```{r}
# These are overnight flights
flights %>%
filter(arr_time < dep_time) # not the same # of rows as below deu to different metrics used.
flights_dt %>%
filter(arr_time < dep_time)
# We used the same date information for both the departure and the arrival times,
# but these flights arrived on the following day. We can fix this by adding days(1)
# to the arrival time of each overnight flight.
flights_dt <- flights_dt %>%
mutate(overnight = arr_time < dep_time,
arr_time = arr_time + days(overnight * 1),
sched_arr_time = sched_arr_time + days(overnight * 1)
)
flights_dt %>%
filter(overnight, arr_time < dep_time)
```
# A tibble: 10,633 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 1929 1920 9 3 7
2 2013 1 1 1939 1840 59 29 2151
3 2013 1 1 2058 2100 -2 8 2359
4 2013 1 1 2102 2108 -6 146 158
5 2013 1 1 2108 2057 11 25 39
6 2013 1 1 2120 2130 -10 16 18
7 2013 1 1 2121 2040 41 6 2323
8 2013 1 1 2128 2135 -7 26 50
9 2013 1 1 2134 2045 49 20 2352
10 2013 1 1 2136 2145 -9 25 39
# ℹ 10,623 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 10,633 × 9
origin dest dep_delay arr_delay dep_time
<chr> <chr> <dbl> <dbl> <dttm>
1 EWR BQN 9 -4 2013-01-01 19:29:00.000000
2 JFK DFW 59 NA 2013-01-01 19:39:00.000000
3 EWR TPA -2 9 2013-01-01 20:58:00.000000
4 EWR SJU -6 -12 2013-01-01 21:02:00.000000
5 EWR SFO 11 -14 2013-01-01 21:08:00.000000
6 LGA FLL -10 -2 2013-01-01 21:20:00.000000
7 EWR MCO 41 43 2013-01-01 21:21:00.000000
8 JFK LAX -7 -24 2013-01-01 21:28:00.000000
9 EWR FLL 49 28 2013-01-01 21:34:00.000000
10 EWR FLL -9 -14 2013-01-01 21:36:00.000000
# ℹ 10,623 more rows
# ℹ 4 more variables: sched_dep_time <dttm>, arr_time <dttm>,
# sched_arr_time <dttm>, air_time <dbl>
# A tibble: 0 × 10
# ℹ 10 variables: origin <chr>, dest <chr>, dep_delay <dbl>, arr_delay <dbl>,
# dep_time <dttm>, sched_dep_time <dttm>, arr_time <dttm>,
# sched_arr_time <dttm>, air_time <dbl>, overnight <lgl>
6.4.3 Intervals
- An interval is a duration with a starting point: that makes it precise so you can determine exactly how long it is
```{r}
dyears(1) / ddays(365) # obvious as duration uses 365 days worth of seconds
years(1)/days(1) # not specific; periods give an estimate since leap years have 366 days
# interval gives you an accurate measurement
today() + years(1)
next_year <- today() + years(1)
(today() %--% next_year) / ddays(1)
(today() %--% (today() + years(1))) / days(1)
# Durations must be standardized lengths of time.
# There is no dmonths() since months do not have a standard number of days.
(today() %--% (today() + years(1))) / dmonths(1) # not integer
(today() %--% next_year) / months(1) # integer
# how many periods fall into an interval? Do the integer division
(today() %--% next_year) %/% days(1)
```
[1] 1.000685
[1] 365.25
[1] "2026-03-27"
[1] 365
[1] 365
[1] 11.99179
[1] 12
[1] 365
6.4.4 Summary
How do you pick between duration, periods, and intervals?
- As always, pick the simplest data structure that solves your problem.
- If you only care about physical time, use a duration;
- if you need to add human times, use a period;
- if you need to figure out how long a span is in human units, use an interval.
6.5 Time zones
6.5.1 (1) Basics
- R uses the international standard IANA time zones: “
/ ” - Ex) “America/New_York”, “Europe/Paris”, and “Pacific/Auckland”
- It’s worth reading the raw time zone database (available at http://www.iana.org/time-zones)
```{r}
Sys.timezone()
Sys.time()
Sys.Date()
# the complete list of all time zone names
length(OlsonNames())
head(OlsonNames())
```
[1] "America/Los_Angeles"
[1] "2025-03-27 12:12:34.682331 PDT"
[1] "2025-03-27"
[1] 596
[1] "Africa/Abidjan" "Africa/Accra" "Africa/Addis_Ababa"
[4] "Africa/Algiers" "Africa/Asmara" "Africa/Asmera"
- In R, the time zone is an attribute of the date-time that only controls printing.
- For example, these three objects represent the same instant in time:
```{r}
(x1 <- ymd_hms("2015-06-01 12:00:00", tz = "America/New_York"))
(x2 <- ymd_hms("2015-06-01 18:00:00", tz = "Europe/Copenhagen"))
(x3 <- ymd_hms("2015-06-02 04:00:00", tz = "Pacific/Auckland"))
x1 - x2
x1 - x3
```
[1] "2015-06-01 12:00:00 EDT"
[1] "2015-06-01 18:00:00 CEST"
[1] "2015-06-02 04:00:00 NZST"
Time difference of 0 secs
Time difference of 0 secs
- lubridate always uses UTC.
- UTC (Coordinated Universal Time) is the standard time zone used by the scientific community and.
6.5.2 (2) Change time zone
- Keep the instant in time the same, and change how it’s displayed.
- Use this when the instant is correct, but you want a more natural display.
[1] "2015-06-02 02:30:00 +1030" "2015-06-02 02:30:00 +1030"
[3] "2015-06-02 02:30:00 +1030"
Time differences in secs
[1] 0 0 0
- Change the underlying instant in time.
- Use this when you have an instant that has been labelled with the incorrect time zone, and you need to fix it.
7 Appendix
7.1 Grouping
- You can use parentheses to override the default precedence rules: ::: {.cell}
[1] "gre" "ay"
[1] "grey" "gray"
:::
- Parenthesis also define “groups” that you can refer to with backreferences, like \1, \2 etc, and can be extracted with str_match().
[1] "banana"
[,1] [,2]
[1,] "anan" "an"
- You can use (?:…), the non-grouping parentheses, to control precedence but not capture the match in a group. This is slightly more efficient than capturing parentheses.
8 References
- R for Data Science: https://r4ds.had.co.nz/dates-and-times.html#date-time-components
- R Regex Cheatsheet: https://github.com/rstudio/cheatsheets/blob/main/regex.pdf
- Regular expressions: https://stringr.tidyverse.org/articles/regular-expressions.html
- stringr overview: https://stringr.tidyverse.org/
- glue overview: https://glue.tidyverse.org/index.html
- Lubridate overview: https://lubridate.tidyverse.org/
4.5.1.3 (3)
comments = TRUE