M06: Advanced Wrangling with dplyr
1 Topics
- joins relational databases
- Joining data frames when there are no unique key columns
- Operations across columns with across()
- Replace complex if else statements with case_when()
- Add row names to a column with rownames_to_columns()
- Subset distinct/unique rows with distinct()
- row-wise calculation with rowwise()
2 Packages
3 Rankings
3.1 Integer ranking functions
-
Description: Three ranking functions inspired by SQL2003. They differ primarily in how they handle ties:
row_number()
gives every input a unique rank, so that c(10, 20, 20, 30) would get ranks c(1, 2, 3, 4). It’s equivalent to rank(ties.method = “first”).min_rank()
gives every tie the same (smallest) value so that c(10, 20, 20, 30) gets ranks c(1, 2, 2, 4). It’s the way that ranks are usually computed in sports and is equivalent to rank(ties.method = “min”).dense_rank()
works like min_rank(), but doesn’t leave any gaps, so that c(10, 20, 20, 30) gets ranks c(1, 2, 2, 3).
```{r}
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
min_rank(x)
dense_rank(x)
# Example
df <- data.frame(
grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
x = c(3, 2, 1, 1, 2, 2, 1, 1, 1),
y = c(1, 3, 2, 3, 2, 2, 4, 1, 2),
id = 1:9
)
df |>
group_by(grp) |>
mutate(grp_id = row_number())
```
[1] 5 1 4 2 3 NA
[1] 5 1 4 2 2 NA
[1] 4 1 3 2 2 NA
# A tibble: 9 × 5
# Groups: grp [3]
grp x y id grp_id
<dbl> <dbl> <dbl> <int> <int>
1 1 3 1 1 1
2 1 2 3 2 2
3 1 1 2 3 3
4 2 1 3 4 1
5 2 2 2 5 2
6 2 2 2 6 3
7 3 1 4 7 1
8 3 1 1 8 2
9 3 1 2 9 3
3.2 Proportional ranking functions
-
Description: These two ranking functions implement two slightly different ways to compute a percentile. For each x_i in x:
cume_dist(x)
counts the total number of values less than or equal to x_i, and divides it by the number of observations.percent_rank(x)
counts the total number of values less than x_i, and divides it by the number of observations minus 1.In both cases, missing values are ignored when counting the number of observations.
```{r}
# Example
x <- c(5, 1, 3, 2, 2)
cume_dist(x) # is the same as below
sapply(x, function(xi) sum(x <= xi) / length(x))
percent_rank(x) # is the same as below
sapply(x, function(xi) sum(x < xi) / (length(x) - 1))
```
[1] 1.0 0.2 0.8 0.6 0.6
[1] 1.0 0.2 0.8 0.6 0.6
[1] 1.00 0.00 0.75 0.25 0.25
[1] 1.00 0.00 0.75 0.25 0.25
3.3 Bucket a numeric vector into n groups
- Description:
ntile()
is a sort of very rough rank, which breaks the input vector into n buckets.If length(x) is not an integer multiple of n, the size of the buckets will differ by up to one, with larger buckets coming first.
Unlike other ranking functions, ntile() ignores ties: it will create evenly sized buckets even if the same value of x ends up in different buckets.
Usage: ntile(x = row_number(), n)
3.4 row_number() / between()
```{r}
# row_number can be used with single table verbs without specifying x
# (for data frames and databases that support windowing)
mtcars <- mtcars |>
as_tibble()
mutate(mtcars, row_number() == 1L)
mutate(mtcars, row_number() == 2L)
mutate(mtcars, row_number() == 3)
mtcars |> row_number()
#
between(1:12, 7, 9) # inclusive
x <- rnorm(1e2) # rnorm(n, mean = 0, sd = 1)
x[between(x, -1, 1)]
## Or on a tibble using filter
filter(starwars, between(height, 100, 150))
mtcars %>% head()
mtcars %>% filter(between(row_number(), 3, 6))
mtcars %>% filter(between(mpg, 20, 25))
mtcars %>% filter(mpg >= 20, mpg <= 25)
mutate(mtcars, rank_hp = row_number(hp)) %>%
arrange(hp)
mtcars %>% select(mpg) %>% head() %>%
row_number()
```
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: `row_number() == 1L` <lgl>
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: `row_number() == 2L` <lgl>
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# ℹ 1 more variable: `row_number() == 3` <lgl>
[1] 19 20 24 22 15 14 4 26 25 16 13 11 12 7 2 1 5 31 29 32 23 9 8 3 17
[26] 28 27 30 10 18 6 21
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
[1] -0.048653395 0.884879882 -0.141319916 -0.967106045 0.315316490
[6] -0.705965651 -0.738639638 -0.462475899 0.579030499 0.112718486
[11] 0.130504674 0.297763793 0.594838996 0.724857853 0.054272230
[16] -0.101223600 0.845028750 -0.666466510 -0.079490219 0.305178391
[21] -0.149424746 0.004828794 -0.263878257 -0.100063144 0.804041285
[26] -0.899508645 -0.001009739 0.760263085 0.748396156 -0.256134958
[31] -0.560925902 0.611697256 -0.352491415 -0.764351651 -0.206370904
[36] -0.674367653 0.286305575 0.344326412 0.486405335 -0.578689798
[41] -0.354978498 0.085198582 -0.347749444 -0.373557294 -0.040152943
[46] -0.898820781 -0.564838799 0.432060270 -0.066068734 -0.575944089
[51] -0.368287162 -0.229399787 0.532786143 0.160260523 0.234379606
[56] 0.030328456 -0.655526257 0.869484763 -0.540088350 0.276284189
[61] -0.580589202
# A tibble: 5 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Leia Org… 150 49 brown light brown 19 fema… femin…
2 Mon Moth… 150 NA auburn fair blue 48 fema… femin…
3 Watto 137 NA black blue, grey yellow NA male mascu…
4 Sebulba 112 40 none grey, red orange NA male mascu…
5 Gasgano 122 NA none white, bl… black NA male mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# A tibble: 4 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
3 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
# A tibble: 8 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# A tibble: 8 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
6 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb rank_hp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2 1
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 2
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1 3
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1 4
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1 5
6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2 6
7 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 7
8 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 8
9 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1 9
10 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 10
# ℹ 22 more rows
[1] 3 4 6 5 2 1
4 Relational Data and Joins
To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:
- Mutating joins
- add new variables to one data frame from matching observations in another.
- Types
- inner_join(),
- left_join(),
- right_join(),
- full_join(),
- Filtering joins
- filter observations from one data frame based on whether or not they match an observation in the other table.
- semi_join(),
- anti_join().
- Set operations
- treat observations as if they were set elements.
4.1 Data
```{r}
library(tidyverse)
library(nycflights13) #nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:
flights
airlines #the full carrier name from its abbreviated code:
airports #information about each airport, identified by the faa airport code:
planes #information about each plane, identified by its tailnum:
weather #the weather at each NYC airport for each hour:
```
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 16 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
7 F9 Frontier Airlines Inc.
8 FL AirTran Airways Corporation
9 HA Hawaiian Airlines Inc.
10 MQ Envoy Air
11 OO SkyWest Airlines Inc.
12 UA United Air Lines Inc.
13 US US Airways Inc.
14 VX Virgin America
15 WN Southwest Airlines Co.
16 YV Mesa Airlines Inc.
# A tibble: 1,458 × 8
faa name lat lon alt tz dst tzone
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
# ℹ 1,448 more rows
# A tibble: 3,322 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,312 more rows
# A tibble: 26,115 × 15
origin year month day hour temp dewp humid wind_dir wind_speed
<chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
# visib <dbl>, time_hour <dttm>
4.2 Keys (primary vs. foreign)
- The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
- In simple cases, a single variable is sufficient to identify an observation. - For example, each plane is uniquely identified by its tailnum.
- In other cases, multiple variables may be needed.
- For example, to identify an observation in weather you need five variables:
year, month, day, hour, and origin
.
- For example, to identify an observation in weather you need five variables:
A primary key uniquely identifies an observation in its own table. For example,
planes$tailnum
is a primary key because it uniquely identifies each plane in the planes table.A foreign key uniquely identifies an observation in another table. For example,
flights$tailnum
is a foreign key because it appears in the flights table where it matches each flight to a unique plane.
```{r}
# plane$tailnum is primary key
planes %>%
count(tailnum) %>%
filter(n > 1) # confirm there is no duplicates; the same can be accomplished with the following operation.
planes %>%
select(tailnum) %>%
duplicated() %>%
max()
# flights$tailnum is a foreign key
flights %>%
count(tailnum) %>%
filter(n > 1) |>
arrange(-n)
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1) # the output confirms that there are some duplicates. Thus, it is not a primary key.
# airports$faa can be a primary key
airports %>%
count(faa) %>% # faa: airport code = origin
filter(n > 1) # confirms that there is no duplicates
# Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it.
# For example, what's the primary key in the flights table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
flights %>%
count(year, month, day, flight) %>%
filter(n > 1)
flights %>%
count(year, month, day, tailnum) %>%
filter(n > 1)
```
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>
[1] 0
# A tibble: 3,873 × 2
tailnum n
<chr> <int>
1 <NA> 2512
2 N725MQ 575
3 N722MQ 513
4 N723MQ 507
5 N711MQ 486
6 N713MQ 483
7 N258JB 427
8 N298JB 407
9 N353JB 404
10 N351JB 402
# ℹ 3,863 more rows
# A tibble: 3 × 6
year month day hour origin n
<int> <int> <int> <int> <chr> <int>
1 2013 11 3 1 EWR 2
2 2013 11 3 1 JFK 2
3 2013 11 3 1 LGA 2
# A tibble: 0 × 2
# ℹ 2 variables: faa <chr>, n <int>
# A tibble: 29,768 × 5
year month day flight n
<int> <int> <int> <int> <int>
1 2013 1 1 1 2
2 2013 1 1 3 2
3 2013 1 1 4 2
4 2013 1 1 11 3
5 2013 1 1 15 2
6 2013 1 1 21 2
7 2013 1 1 27 4
8 2013 1 1 31 2
9 2013 1 1 32 2
10 2013 1 1 35 2
# ℹ 29,758 more rows
# A tibble: 64,928 × 5
year month day tailnum n
<int> <int> <int> <chr> <int>
1 2013 1 1 N0EGMQ 2
2 2013 1 1 N11189 2
3 2013 1 1 N11536 2
4 2013 1 1 N11544 3
5 2013 1 1 N11551 2
6 2013 1 1 N12540 2
7 2013 1 1 N12567 2
8 2013 1 1 N13123 2
9 2013 1 1 N13538 3
10 2013 1 1 N13566 3
# ℹ 64,918 more rows
If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a *surrogate key*
.
A primary
key and the corresponding foreign
key in another table form a relation. Relations are typically one-to-many. For example, each flight
has one plane, but each plane
has many flights. In other data, you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there’s a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.
4.3 Mutating Joins
A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.
```{r}
flights2 <- flights %>%
select(year:day, hour, origin, dest, tailnum, carrier)
flights2
flights2 %>%
select(-origin, -dest) %>%
left_join(airlines, by = "carrier") # this is the same as below.
flights2 %>%
select(-origin, -dest) %>%
mutate(name = airlines$name[match(carrier, airlines$carrier)])
```
# A tibble: 336,776 × 8
year month day hour origin dest tailnum carrier
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr>
1 2013 1 1 5 EWR IAH N14228 UA
2 2013 1 1 5 LGA IAH N24211 UA
3 2013 1 1 5 JFK MIA N619AA AA
4 2013 1 1 5 JFK BQN N804JB B6
5 2013 1 1 6 LGA ATL N668DN DL
6 2013 1 1 5 EWR ORD N39463 UA
7 2013 1 1 6 EWR FLL N516JB B6
8 2013 1 1 6 LGA IAD N829AS EV
9 2013 1 1 6 JFK MCO N593JB B6
10 2013 1 1 6 LGA ORD N3ALAA AA
# ℹ 336,766 more rows
# A tibble: 336,776 × 7
year month day hour tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr>
1 2013 1 1 5 N14228 UA United Air Lines Inc.
2 2013 1 1 5 N24211 UA United Air Lines Inc.
3 2013 1 1 5 N619AA AA American Airlines Inc.
4 2013 1 1 5 N804JB B6 JetBlue Airways
5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 N39463 UA United Air Lines Inc.
7 2013 1 1 6 N516JB B6 JetBlue Airways
8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 N593JB B6 JetBlue Airways
10 2013 1 1 6 N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
# A tibble: 336,776 × 7
year month day hour tailnum carrier name
<int> <int> <int> <dbl> <chr> <chr> <chr>
1 2013 1 1 5 N14228 UA United Air Lines Inc.
2 2013 1 1 5 N24211 UA United Air Lines Inc.
3 2013 1 1 5 N619AA AA American Airlines Inc.
4 2013 1 1 5 N804JB B6 JetBlue Airways
5 2013 1 1 6 N668DN DL Delta Air Lines Inc.
6 2013 1 1 5 N39463 UA United Air Lines Inc.
7 2013 1 1 6 N516JB B6 JetBlue Airways
8 2013 1 1 6 N829AS EV ExpressJet Airlines Inc.
9 2013 1 1 6 N593JB B6 JetBlue Airways
10 2013 1 1 6 N3ALAA AA American Airlines Inc.
# ℹ 336,766 more rows
- Data set for illustration
4.3.1 Inner Joins
- An inner join keeps observations that appear in both tables.
- The most important property of an inner join is that unmatched rows are not included in the result.
- Inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.
4.3.2 Outer joins
An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:
4.3.2.1 Left joins
- A left join keeps all observations in x.
- The most commonly used join is the left join:
- you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match.
- The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
4.3.2.2 Right joins
- A right join keeps all observations in y.
4.3.2.3 Full joins
- A full join keeps all observations in x and y.
4.3.3 Duplicate keys
What will happen when keys are not unique?
4.3.3.1 One table has duplicate keys.
This is useful when you want to add in additional information as there is typically a one-to-many relationship
4.3.3.2 Both tables have duplicate keys.
This is usually an error because in neither table do the keys uniquely identify an observation. When you join duplicated keys, you get all possible combinations, the Cartesian product:
```{r}
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
3, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
2, "y3",
3, "y4"
)
left_join(x, y, by = "key", relationship = "many-to-many")
```
# A tibble: 6 × 3
key val_x val_y
<dbl> <chr> <chr>
1 1 x1 y1
2 2 x2 y2
3 2 x2 y3
4 2 x3 y2
5 2 x3 y3
6 3 x4 y4
4.3.4 Defining the key columns
4.3.4.1 Default mode
The default, by = NULL, uses all variables that appear in both tables, the so called natural join.
# A tibble: 336,776 × 18
year month day hour origin dest tailnum carrier temp dewp humid
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4
2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8
3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6
4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6
5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8
6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4
7 2013 1 1 6 EWR FLL N516JB B6 37.9 28.0 67.2
8 2013 1 1 6 LGA IAD N829AS EV 39.9 25.0 54.8
9 2013 1 1 6 JFK MCO N593JB B6 37.9 27.0 64.3
10 2013 1 1 6 LGA ORD N3ALAA AA 39.9 25.0 54.8
# ℹ 336,766 more rows
# ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
# precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
4.3.4.2 A character vector, by = “x”
```{r}
flights2 %>%
left_join(planes, by = "tailnum") # specify the variable that is common to use.
```
# A tibble: 336,776 × 16
year.x month day hour origin dest tailnum carrier year.y type
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr>
1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wing mult…
2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wing mult…
3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wing mult…
4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wing mult…
5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wing mult…
6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wing mult…
7 2013 1 1 6 EWR FLL N516JB B6 2000 Fixed wing mult…
8 2013 1 1 6 LGA IAD N829AS EV 1998 Fixed wing mult…
9 2013 1 1 6 JFK MCO N593JB B6 2004 Fixed wing mult…
10 2013 1 1 6 LGA ORD N3ALAA AA NA <NA>
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
# seats <int>, speed <int>, engine <chr>
4.3.4.3 A named character vector: by = c(“a” = “b”)
This will match variable a in table x to variable b in table y. The variables from x will be used in the output.
```{r}
flights2 %>%
left_join(airports, c("dest" = "faa"))
flights2 %>%
left_join(airports, c("origin" = "faa"))
```
# A tibble: 336,776 × 15
year month day hour origin dest tailnum carrier name lat lon alt
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Georg… 30.0 -95.3 97
2 2013 1 1 5 LGA IAH N24211 UA Georg… 30.0 -95.3 97
3 2013 1 1 5 JFK MIA N619AA AA Miami… 25.8 -80.3 8
4 2013 1 1 5 JFK BQN N804JB B6 <NA> NA NA NA
5 2013 1 1 6 LGA ATL N668DN DL Harts… 33.6 -84.4 1026
6 2013 1 1 5 EWR ORD N39463 UA Chica… 42.0 -87.9 668
7 2013 1 1 6 EWR FLL N516JB B6 Fort … 26.1 -80.2 9
8 2013 1 1 6 LGA IAD N829AS EV Washi… 38.9 -77.5 313
9 2013 1 1 6 JFK MCO N593JB B6 Orlan… 28.4 -81.3 96
10 2013 1 1 6 LGA ORD N3ALAA AA Chica… 42.0 -87.9 668
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
# A tibble: 336,776 × 15
year month day hour origin dest tailnum carrier name lat lon alt
<int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 2013 1 1 5 EWR IAH N14228 UA Newar… 40.7 -74.2 18
2 2013 1 1 5 LGA IAH N24211 UA La Gu… 40.8 -73.9 22
3 2013 1 1 5 JFK MIA N619AA AA John … 40.6 -73.8 13
4 2013 1 1 5 JFK BQN N804JB B6 John … 40.6 -73.8 13
5 2013 1 1 6 LGA ATL N668DN DL La Gu… 40.8 -73.9 22
6 2013 1 1 5 EWR ORD N39463 UA Newar… 40.7 -74.2 18
7 2013 1 1 6 EWR FLL N516JB B6 Newar… 40.7 -74.2 18
8 2013 1 1 6 LGA IAD N829AS EV La Gu… 40.8 -73.9 22
9 2013 1 1 6 JFK MCO N593JB B6 John … 40.6 -73.8 13
10 2013 1 1 6 LGA ORD N3ALAA AA La Gu… 40.8 -73.9 22
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
4.3.5 base::merge()
- can perform all four types of mutating join:
merge(x, y, ...)
## Default S3 method:
merge(x, y, ...)
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, ...)
4.4 Filtering joins
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
- semi_join(x, y) keeps all observations in x that have a match in y.
- anti_join(x, y) drops all observations in x that have a match in y.
- Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do:
4.4.1 Semi joins
Semi-joins are useful for matching filtered summary tables back to the original rows.
- only keeps the rows in x that have a match in y:
```{r}
#For example, imagine you've found the top ten most popular destinations:
top_dest <- flights %>%
count(dest, sort = TRUE) %>%
head(10)
top_dest
# Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:
flights %>%
filter(dest %in% top_dest$dest) |>
select(dest)
# Using semi join, we can achieve the above filtering.
flights %>%
semi_join(top_dest)
```
# A tibble: 10 × 2
dest n
<chr> <int>
1 ORD 17283
2 ATL 17215
3 LAX 16174
4 BOS 15508
5 MCO 14082
6 CLT 14064
7 SFO 13331
8 FLL 12055
9 MIA 11728
10 DCA 9705
# A tibble: 141,145 × 1
dest
<chr>
1 MIA
2 ATL
3 ORD
4 FLL
5 MCO
6 ORD
7 LAX
8 SFO
9 BOS
10 FLL
# ℹ 141,135 more rows
# A tibble: 141,145 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 542 540 2 923 850
2 2013 1 1 554 600 -6 812 837
3 2013 1 1 554 558 -4 740 728
4 2013 1 1 555 600 -5 913 854
5 2013 1 1 557 600 -3 838 846
6 2013 1 1 558 600 -2 753 745
7 2013 1 1 558 600 -2 924 917
8 2013 1 1 558 600 -2 923 937
9 2013 1 1 559 559 0 702 706
10 2013 1 1 600 600 0 851 858
# ℹ 141,135 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
4.4.2 Anti joins
- The inverse of a semi-join is an anti-join.
- An anti-join keeps the rows that don’t have a match:
- Anti-joins are useful for diagnosing join mismatches.
4.5 Join problems
- Expect to have some problems when gathering data in your own.
- Take some measures to minimize the problems
4.5.1 (1) Identify primary key
- Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier.
- If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that’s unique in your current data but the relationship might not be true in general.
4.5.2 (2) Check that none of the variables in the primary key are missing.
If a value is missing then it can’t identify an observation!
4.5.3 (3) Check that your foreign keys match primary keys in another table.
The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.
If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.
Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.
4.6 Set operations
- Useful when you want to break a single complex filter into simpler pieces.
- All these operations work with a complete row, comparing the values of every variable.
- These expect the x and y inputs to have the same variables, and treat the observations like sets:
4.6.1 intersect()
intersect(x, y): return only observations in both x and y.
4.6.2 union()
union(x, y): return unique observations in x and y.
4.6.3 setdiff()
setdiff(x, y): return observations in x, but not in y.
5 Joining data frame when there are no unique key columns
- Situation: Neither data frame has a unique key column. The closest equivalent of the key column is the dates variable of monthly data. Each df has multiple entries per month, so the dates column has lots of duplicates.
5.1 Data
```{r}
df_1 <- tibble(
dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-06-01"), as.Date("2018-05-01")),
x1 = c(10L, 11L, 12L, 13L),
text1 = c("text a", "text b", "text c", "text d")
)
df_2 <- tibble(
dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-05-01"), as.Date("2018-04-01")),
x2 = c(20L, 21L, 22L, 23L),
text2 = c("text aa", "text bb", "text cc", "text dd")
)
df_1
df_2
```
# A tibble: 4 × 3
dates x1 text1
<date> <int> <chr>
1 2018-07-01 10 text a
2 2018-06-01 11 text b
3 2018-06-01 12 text c
4 2018-05-01 13 text d
# A tibble: 4 × 3
dates x2 text2
<date> <int> <chr>
1 2018-07-01 20 text aa
2 2018-06-01 21 text bb
3 2018-05-01 22 text cc
4 2018-04-01 23 text dd
5.2 left_join()
```{r}
df_1 %>%
left_join(df_2) # duplicate problem
# solution
left_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
df_2 %>% group_by(dates) %>% mutate(id = row_number()),
by = c("dates", "id"))
df_1 %>%
group_by(dates) %>%
mutate(id = row_number()) %>%
left_join(df_2 %>% group_by(dates) %>% mutate(id = row_number()),
by = c("dates", "id"))
# even though I didn't give "by" argument, I get the identical outcome.
left_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
df_2 %>% group_by(dates) %>% mutate(id = row_number()))
```
# A tibble: 4 × 5
dates x1 text1 x2 text2
<date> <int> <chr> <int> <chr>
1 2018-07-01 10 text a 20 text aa
2 2018-06-01 11 text b 21 text bb
3 2018-06-01 12 text c 21 text bb
4 2018-05-01 13 text d 22 text cc
# A tibble: 4 × 6
# Groups: dates [3]
dates x1 text1 id x2 text2
<date> <int> <chr> <int> <int> <chr>
1 2018-07-01 10 text a 1 20 text aa
2 2018-06-01 11 text b 1 21 text bb
3 2018-06-01 12 text c 2 NA <NA>
4 2018-05-01 13 text d 1 22 text cc
# A tibble: 4 × 6
# Groups: dates [3]
dates x1 text1 id x2 text2
<date> <int> <chr> <int> <int> <chr>
1 2018-07-01 10 text a 1 20 text aa
2 2018-06-01 11 text b 1 21 text bb
3 2018-06-01 12 text c 2 NA <NA>
4 2018-05-01 13 text d 1 22 text cc
# A tibble: 4 × 6
# Groups: dates [3]
dates x1 text1 id x2 text2
<date> <int> <chr> <int> <int> <chr>
1 2018-07-01 10 text a 1 20 text aa
2 2018-06-01 11 text b 1 21 text bb
3 2018-06-01 12 text c 2 NA <NA>
4 2018-05-01 13 text d 1 22 text cc
5.3 full_join()
```{r}
df_1 %>%
full_join(df_2) # duplicate problem
full_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
df_2 %>% group_by(dates) %>% mutate(id = row_number()),
by = c("dates", "id"))
full_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
df_2 %>% group_by(dates) %>% mutate(id = row_number()))
```
# A tibble: 5 × 5
dates x1 text1 x2 text2
<date> <int> <chr> <int> <chr>
1 2018-07-01 10 text a 20 text aa
2 2018-06-01 11 text b 21 text bb
3 2018-06-01 12 text c 21 text bb
4 2018-05-01 13 text d 22 text cc
5 2018-04-01 NA <NA> 23 text dd
# A tibble: 5 × 6
# Groups: dates [4]
dates x1 text1 id x2 text2
<date> <int> <chr> <int> <int> <chr>
1 2018-07-01 10 text a 1 20 text aa
2 2018-06-01 11 text b 1 21 text bb
3 2018-06-01 12 text c 2 NA <NA>
4 2018-05-01 13 text d 1 22 text cc
5 2018-04-01 NA <NA> 1 23 text dd
# A tibble: 5 × 6
# Groups: dates [4]
dates x1 text1 id x2 text2
<date> <int> <chr> <int> <int> <chr>
1 2018-07-01 10 text a 1 20 text aa
2 2018-06-01 11 text b 1 21 text bb
3 2018-06-01 12 text c 2 NA <NA>
4 2018-05-01 13 text d 1 22 text cc
5 2018-04-01 NA <NA> 1 23 text dd
6 Column-wise operations with across()
6.1 Data
```{r}
#data() # list all data available
data(starwars) # Loads specified data sets
starwars # print the data set
?starwars # review starwars characters dataset from dplyr
glimpse(starwars)
```
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
6.2 Introduction
Column-wise operations with across()
- dplyr has had more capability for functions for operating on columns than on rows but it was not always convenient.
- If you want to perform the same operation on multiple columns, copying and pasting could be tedious and error prone:
```{r}
df %>%
group_by(g1, g2) %>%
summarise(a = mean(a), b = mean(b), c = mean(c), d = mean(d)) # the same as below
```
- We can now use the across() function to write this kind of operation more succinctly and transparently:
across syntax
- across() provides new functionality while replacing older functions such as
mutate_if()
ormutate_at()
6.3 Basic Usage
across()
is only used inside other functions, e.g.,summarize()
ormutate()
across() doesn’t work with select() or rename() because they already use tidy select syntax;
Like group_by() and rowwise(), it does not change the data itself but changes how other functions operate on the data.
across() has two primary arguments:
.cols
= selects the columns you want to manipulate (notice the period at the beginning).It uses tidy selection (like select()) so you can pick variables by position, name, and type.
The default is .cols = everything() so all columns are selected
-
.fns
= , is a function (or a list of functions) you want to apply to each column (.again, note the period at the beginning of the argument name)- Examples:
.fns = mean
or.fns = max
- This can also be a purrr style formula like
~ .x / 2.
- The .x is the a pronoun for the columns that get passed to the function
- This argument is optional so you can omit it to leave the data untransformed
- Examples:
6.3.1 Using across() with summarize()
- Here are a couple of examples of across() in conjunction with its favorite verb, summarize().
- We use
where()
inside the across() to select only those columns of the desired type -
.fns=
can take a single function - You can add additional arguments to be passed to the function, e.g.,
na.rm = TRUE
.
```{r}
skimr::skim(starwars)
starwars %>%
summarize(height_mean = mean(height, na.rm = TRUE),
mass_mean = mean(mass, na.rm = TRUE)) # too much typing
starwars %>%
summarize(across(.cols = height:mass, .fns = mean, na.rm = TRUE)) # this is better
```
Name | starwars |
Number of rows | 87 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 11 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 16 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.60 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▂▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
# A tibble: 1 × 2
height_mean mass_mean
<dbl> <dbl>
1 175. 97.3
# A tibble: 1 × 2
height mass
<dbl> <dbl>
1 175. 97.3
6.3.2 Using summarize()/where(is.numeric) or where(is.character)
-
where()
is from {tidyselect}
- This selection helper selects the variables for which a function returns TRUE.
- is.numeric() is from base R
-
is.numeric
is a more general test of an object being interpretable as numbers.
6.3.2.1 Numeric vectors
```{r}
starwars %>%
summarise(across(where(is.numeric),
.fns = median)) # NA's because of missing values
starwars %>%
summarise(across(.cols = where(is.numeric), #.cols can be omitted
.fns = median, # .fns can be omitted
na.rm = TRUE))
starwars %>%
summarise(across(where(is.numeric),
median,
na.rm = TRUE))
```
# A tibble: 1 × 3
height mass birth_year
<int> <dbl> <dbl>
1 NA NA NA
# A tibble: 1 × 3
height mass birth_year
<int> <dbl> <dbl>
1 180 79 52
# A tibble: 1 × 3
height mass birth_year
<int> <dbl> <dbl>
1 180 79 52
6.3.2.2 Formula approach
- The formula approach gives us the ability to combine functions with arguments.
- Using this, you start with the
~
to signify “as a function of” and then put wherever your column name would normally as a .x.
```{r}
starwars %>%
summarise(across(where(is.numeric), min, na.rm = TRUE)) # the same as below
starwars %>%
summarise(across(where(is.numeric),
~ min(.x, na.rm = TRUE)))
starwars %>%
summarise(across(where(is.numeric),
~ median(.x, na.rm = TRUE)))
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric),
~ max(.x, na.rm = TRUE))) # same as below
starwars %>%
group_by(homeworld) %>%
filter(n() > 1) %>%
summarise(across(where(is.numeric),
max, na.rm = TRUE))
```
# A tibble: 1 × 3
height mass birth_year
<int> <dbl> <dbl>
1 66 15 8
# A tibble: 1 × 3
height mass birth_year
<int> <dbl> <dbl>
1 66 15 8
# A tibble: 1 × 3
height mass birth_year
<int> <dbl> <dbl>
1 180 79 52
# A tibble: 10 × 4
homeworld height mass birth_year
<chr> <int> <dbl> <dbl>
1 Alderaan 191 79 67
2 Corellia 180 80 29
3 Coruscant 184 50 91
4 Kamino 229 88 31.5
5 Kashyyyk 234 136 200
6 Mirial 170 56.2 58
7 Naboo 224 85 82
8 Ryloth 180 55 48
9 Tatooine 202 136 112
10 <NA> 200 140 896
# A tibble: 10 × 4
homeworld height mass birth_year
<chr> <int> <dbl> <dbl>
1 Alderaan 191 79 67
2 Corellia 180 80 29
3 Coruscant 184 50 91
4 Kamino 229 88 31.5
5 Kashyyyk 234 136 200
6 Mirial 170 56.2 58
7 Naboo 224 85 82
8 Ryloth 180 55 48
9 Tatooine 202 136 112
10 <NA> 200 140 896
6.3.3 Using character vectors: where
, is.character
, length()
Task: Count unique values for character variables using formula style
- is.character() is from base R
- Create or test for objects of type “character”.
```{r}
# find the number of unique values in character variables
summary(starwars) # ugly output due to the fact that there are lists.
class(starwars)
skimr::skim(starwars) # can tell you the answer: 12 unique value, not counting NA
unique(starwars$hair_color) # unique() is a base R function: the same as below
starwars %>%
pull(hair_color) %>%
unique()
# testing with one variable first
starwars %>%
summarize(unique(hair_color))
starwars %>%
summarize(length(unique(hair_color))
)
# Applying to all variables
starwars %>%
summarize(across(where(is.character),
~ length(unique(.x))
)
)
```
name height mass hair_color
Length:87 Min. : 66.0 Min. : 15.00 Length:87
Class :character 1st Qu.:167.0 1st Qu.: 55.60 Class :character
Mode :character Median :180.0 Median : 79.00 Mode :character
Mean :174.6 Mean : 97.31
3rd Qu.:191.0 3rd Qu.: 84.50
Max. :264.0 Max. :1358.00
NA's :6 NA's :28
skin_color eye_color birth_year sex
Length:87 Length:87 Min. : 8.00 Length:87
Class :character Class :character 1st Qu.: 35.00 Class :character
Mode :character Mode :character Median : 52.00 Mode :character
Mean : 87.57
3rd Qu.: 72.00
Max. :896.00
NA's :44
gender homeworld species
Length:87 Length:87 Length:87
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
films.Length films.Class films.Mode
5 -none- character
6 -none- character
7 -none- character
4 -none- character
5 -none- character
3 -none- character
3 -none- character
1 -none- character
1 -none- character
6 -none- character
3 -none- character
2 -none- character
5 -none- character
4 -none- character
1 -none- character
3 -none- character
3 -none- character
1 -none- character
5 -none- character
5 -none- character
3 -none- character
1 -none- character
1 -none- character
2 -none- character
1 -none- character
2 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
3 -none- character
1 -none- character
3 -none- character
2 -none- character
1 -none- character
1 -none- character
1 -none- character
2 -none- character
1 -none- character
1 -none- character
2 -none- character
1 -none- character
1 -none- character
3 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
3 -none- character
3 -none- character
3 -none- character
2 -none- character
2 -none- character
2 -none- character
1 -none- character
3 -none- character
2 -none- character
1 -none- character
1 -none- character
1 -none- character
2 -none- character
2 -none- character
1 -none- character
1 -none- character
2 -none- character
2 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
2 -none- character
1 -none- character
1 -none- character
2 -none- character
1 -none- character
1 -none- character
2 -none- character
2 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
1 -none- character
vehicles.Length vehicles.Class vehicles.Mode
2 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
2 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
starships.Length starships.Class starships.Mode
2 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
5 -none- character
3 -none- character
0 -none- character
2 -none- character
2 -none- character
0 -none- character
0 -none- character
1 -none- character
1 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
3 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
0 -none- character
1 -none- character
0 -none- character
0 -none- character
[1] "tbl_df" "tbl" "data.frame"
Name | starwars |
Number of rows | 87 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 11 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 16 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.60 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▂▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
[1] "blond" NA "none" "brown"
[5] "brown, grey" "black" "auburn, white" "auburn, grey"
[9] "white" "grey" "auburn" "blonde"
[1] "blond" NA "none" "brown"
[5] "brown, grey" "black" "auburn, white" "auburn, grey"
[9] "white" "grey" "auburn" "blonde"
# A tibble: 12 × 1
`unique(hair_color)`
<chr>
1 blond
2 <NA>
3 none
4 brown
5 brown, grey
6 black
7 auburn, white
8 auburn, grey
9 white
10 grey
11 auburn
12 blonde
# A tibble: 1 × 1
`length(unique(hair_color))`
<int>
1 12
# A tibble: 1 × 8
name hair_color skin_color eye_color sex gender homeworld species
<int> <int> <int> <int> <int> <int> <int> <int>
1 87 12 31 15 5 3 49 38
6.3.4 Example: multiple summaries
Example with group by species and filter for groups with >1 row and adding a summary that is not inside across to count the rows in each group
```{r}
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
summarise(across(c(sex, gender, homeworld),
~ length(unique(.x))),
n = n())
# what happens if we add n = n() first.
# The only difference: n comes after species and before other variables.
starwars %>%
group_by(species) %>%
filter(n() > 1) %>%
summarise(n = n(),
across(c(sex, gender, homeworld),
~ length(unique(.x))))
# cf.
starwars %>%
filter(n() > 1) %>%
count(species, sex, gender, homeworld)
```
# A tibble: 9 × 5
species sex gender homeworld n
<chr> <int> <int> <int> <int>
1 Droid 1 2 3 6
2 Gungan 1 1 1 3
3 Human 2 2 15 35
4 Kaminoan 2 2 1 2
5 Mirialan 1 1 1 2
6 Twi'lek 2 2 1 2
7 Wookiee 1 1 1 2
8 Zabrak 1 1 2 2
9 <NA> 1 1 3 4
# A tibble: 9 × 5
species n sex gender homeworld
<chr> <int> <int> <int> <int>
1 Droid 6 1 2 3
2 Gungan 3 1 1 1
3 Human 35 2 2 15
4 Kaminoan 2 2 2 1
5 Mirialan 2 1 1 1
6 Twi'lek 2 2 2 1
7 Wookiee 2 1 1 1
8 Zabrak 2 1 1 2
9 <NA> 4 1 1 3
# A tibble: 65 × 5
species sex gender homeworld n
<chr> <chr> <chr> <chr> <int>
1 Aleena male masculine Aleen Minor 1
2 Besalisk male masculine Ojom 1
3 Cerean male masculine Cerea 1
4 Chagrian male masculine Champala 1
5 Clawdite female feminine Zolan 1
6 Droid none feminine <NA> 1
7 Droid none masculine Naboo 1
8 Droid none masculine Tatooine 2
9 Droid none masculine <NA> 2
10 Dug male masculine Malastare 1
# ℹ 55 more rows
6.3.5 across(), not selecting grouping variables
Because across() is usually used in combination with summarize() and mutate(), it doesn’t select grouping variables to avoid accidentally modifying them:
Example where the grouping variable g is not selected, even though it is numeric, so is not summed.
```{r}
across <- data.frame(g = c(1, 1, 2), x = c(-1, 1, 3), y = c(-1, -4, -9))
across
across %>%
group_by(g) %>%
summarize(across(where(is.numeric), sum)
) # the same as below
across %>%
group_by(g) %>%
summarize(across(where(is.numeric), ~ sum(.x))
)
```
g x y
1 1 -1 -1
2 1 1 -4
3 2 3 -9
# A tibble: 2 × 3
g x y
<dbl> <dbl> <dbl>
1 1 0 -5
2 2 3 -9
# A tibble: 2 × 3
g x y
<dbl> <dbl> <dbl>
1 1 0 -5
2 2 3 -9
6.4 Using across() with Multiple Functions
- You can transform each variable with more than one function
- Supply a named list of functions in the second argument:
```{r}
# Example with no list but argument
starwars %>%
summarise(across(where(is.numeric),
.fns = list(median, mean), na.rm = TRUE)
)
# Example with list and argument
starwars %>%
summarize(across(where(is.numeric),
list(min = min, max = max, mean = mean), # column = function
na.rm = TRUE))
starwars %>%
summarise(across(where(is.numeric),
.fns = list(Median = median, Mean = mean), na.rm = TRUE) # Median is the name, while mean is the function. names always precede functions.
)
```
# A tibble: 1 × 6
height_1 height_2 mass_1 mass_2 birth_year_1 birth_year_2
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 180 175. 79 97.3 52 87.6
# A tibble: 1 × 9
height_min height_max height_mean mass_min mass_max mass_mean birth_year_min
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 66 264 175. 15 1358 97.3 8
# ℹ 2 more variables: birth_year_max <dbl>, birth_year_mean <dbl>
# A tibble: 1 × 6
height_Median height_Mean mass_Median mass_Mean birth_year_Median
<int> <dbl> <dbl> <dbl> <dbl>
1 180 175. 79 97.3 52
# ℹ 1 more variable: birth_year_Mean <dbl>
6.5 Controlling names with glue()
In the last case where we wanted to create a median and mean across all numeric variables, you’ll notice that the new variable names were always {name_of_variable}_{mean/median}. This is the default behavior of across - the name, an underscore, and the name of the function from the named list provided.
To customize how things are named, you can use glue syntax (we will cover what glue is in a few weeks) and the .names argument.
{.fn} will refer to the name of the function you used
{.col} will refer to the name of the column
```{r}
#change default, {.col}_{.fn} to {.fn}_{.col}
starwars %>%
summarise(across(.cols = where(is.numeric),
.fns = list(Median = median, Mean = mean),
na.rm = TRUE,
.names = "{.fn}_{.col}") # default: "{.col}_{.fn}" when multiple function is used
)
#change to greater customization
starwars %>%
summarise(across(where(is.numeric),
.fns = list(Median = median, Mean = mean),
na.rm = TRUE,
.names = "The {.fn} of {.col}"))
```
# A tibble: 1 × 6
Median_height Mean_height Median_mass Mean_mass Median_birth_year
<int> <dbl> <dbl> <dbl> <dbl>
1 180 175. 79 97.3 52
# ℹ 1 more variable: Mean_birth_year <dbl>
# A tibble: 1 × 6
`The Median of height` `The Mean of height` `The Median of mass`
<int> <dbl> <dbl>
1 180 175. 79
# ℹ 3 more variables: `The Mean of mass` <dbl>,
# `The Median of birth_year` <dbl>, `The Mean of birth_year` <dbl>
6.6 Order matters
- Be careful when combining numeric summaries with
is.numeric()
:
```{r}
number <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))
number
# Here n became NA because n is numeric, so the across() computes its standard deviation, and
# the standard deviation of 3 (a constant) is NA.
number %>%
summarise(n = n(),
across(where(is.numeric), sd))
# You probably want to compute n() last to avoid this problem:
number %>%
summarise(across(where(is.numeric), sd),
n = n())
```
x y
1 1 1
2 2 4
3 3 9
n x y
1 NA 1 4.041452
x y n
1 1 4.041452 3
6.7 Examples
6.7.1 Example 1
- Find all rows where no variable has missing values:
```{r}
starwars %>%
filter(across(everything(), ~ !is.na(.x))) # we cannot do without "~" since we have "!"
```
# A tibble: 29 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 Darth V… 202 136 none white yellow 41.9 male mascu…
3 Leia Or… 150 49 brown light brown 19 fema… femin…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Chewbac… 228 112 brown unknown blue 200 male mascu…
10 Han Solo 180 80 brown fair brown 29 male mascu…
# ℹ 19 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
6.7.2 Example 2
- Find all distinct rows for variables with the word color in the name
- For some verbs, like group_by(), count() and distinct(), you can omit the summary functions:
# A tibble: 67 × 3
hair_color skin_color eye_color
<chr> <chr> <chr>
1 auburn fair blue
2 auburn, grey fair blue
3 auburn, white fair blue-gray
4 black blue, grey yellow
5 black brown brown
6 black dark brown
7 black dark dark
8 black fair brown
9 black light brown
10 black tan brown
# ℹ 57 more rows
6.7.3 Example 3
- Count all combinations of variables with a given pattern:
```{r}
starwars %>%
count(across(contains("color")), sort = TRUE)
starwars %>%
count(across(contains("color")), sort = TRUE) %>%
arrange(hair_color, skin_color)
starwars %>%
count(across(contains("color"))) %>%
arrange(hair_color, skin_color)
```
# A tibble: 67 × 4
hair_color skin_color eye_color n
<chr> <chr> <chr> <int>
1 brown light brown 6
2 brown fair blue 4
3 none grey black 4
4 black dark brown 3
5 blond fair blue 3
6 black fair brown 2
7 black tan brown 2
8 black yellow blue 2
9 brown fair brown 2
10 none white yellow 2
# ℹ 57 more rows
# A tibble: 67 × 4
hair_color skin_color eye_color n
<chr> <chr> <chr> <int>
1 auburn fair blue 1
2 auburn, grey fair blue 1
3 auburn, white fair blue-gray 1
4 black blue, grey yellow 1
5 black brown brown 1
6 black dark brown 3
7 black dark dark 1
8 black fair brown 2
9 black light brown 1
10 black tan brown 2
# ℹ 57 more rows
# A tibble: 67 × 4
hair_color skin_color eye_color n
<chr> <chr> <chr> <int>
1 auburn fair blue 1
2 auburn, grey fair blue 1
3 auburn, white fair blue-gray 1
4 black blue, grey yellow 1
5 black brown brown 1
6 black dark brown 3
7 black dark dark 1
8 black fair brown 2
9 black light brown 1
10 black tan brown 2
# ℹ 57 more rows
6.8 Exercise
6.8.1 median
- Calculate the median for each numeric variable for each species and gender
```{r}
glimpse(starwars)
starwars %>%
group_by(species, gender) %>%
summarise(across(where(is.numeric), ~ median(.x, na.rm = TRUE))
)
```
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
# A tibble: 42 × 5
# Groups: species [38]
species gender height mass birth_year
<chr> <chr> <dbl> <dbl> <dbl>
1 Aleena masculine 79 15 NA
2 Besalisk masculine 198 102 NA
3 Cerean masculine 198 82 92
4 Chagrian masculine 196 NA NA
5 Clawdite feminine 168 55 NA
6 Droid feminine 96 NA NA
7 Droid masculine 132 53.5 33
8 Dug masculine 112 40 NA
9 Ewok masculine 88 20 8
10 Geonosian masculine 183 80 NA
# ℹ 32 more rows
6.8.2 min & max, count, sort
Calculate the min and max for each numeric variable other than birth year for each species and gender and count how many are in each group and sort from largest to smallest count.
```{r}
starwars %>%
group_by(species, gender) %>%
summarise(across(where(is.numeric) & !birth_year,
.fns = list(min = min, max = max),
na.rm = TRUE,
.names = "{.fn}_{.col}"),
n = n()) %>%
arrange(desc(n))
# same as above
starwars %>%
group_by(species, gender) %>%
summarise(across(where(is.numeric) & !birth_year,
.fns = list(min = ~ min(.x, na.rm = TRUE),
max = ~ max(.x, na.rm = TRUE))),
n = n()) %>%
arrange(desc(n)) %>%
ungroup()
```
# A tibble: 42 × 7
# Groups: species [38]
species gender min_height max_height min_mass max_mass n
<chr> <chr> <int> <int> <dbl> <dbl> <int>
1 Human masculine 170 202 75 136 26
2 Human feminine 150 185 45 75 9
3 Droid masculine 96 200 32 140 5
4 <NA> <NA> 157 185 48 110 4
5 Gungan masculine 196 224 66 82 3
6 Mirialan feminine 166 170 50 56.2 2
7 Wookiee masculine 228 234 112 136 2
8 Zabrak masculine 171 175 80 80 2
9 Aleena masculine 79 79 15 15 1
10 Besalisk masculine 198 198 102 102 1
# ℹ 32 more rows
# A tibble: 42 × 7
species gender height_min height_max mass_min mass_max n
<chr> <chr> <int> <int> <dbl> <dbl> <int>
1 Human masculine 170 202 75 136 26
2 Human feminine 150 185 45 75 9
3 Droid masculine 96 200 32 140 5
4 <NA> <NA> 157 185 48 110 4
5 Gungan masculine 196 224 66 82 3
6 Mirialan feminine 166 170 50 56.2 2
7 Wookiee masculine 228 234 112 136 2
8 Zabrak masculine 171 175 80 80 2
9 Aleena masculine 79 79 15 15 1
10 Besalisk masculine 198 198 102 102 1
# ℹ 32 more rows
7 case_when()
This function allows you to vectorize (and replace) multiple if_else() statements in a succinct and clear manner.
The syntax is case_when(…)
-
The dot dot dots are a placeholder for a series of two-side formulas
- The Left hand side (LHS) determines which variables match a given case - this must return a logical vector
- The Right hand side (RHS) provides the new or replacement value and all have to be of the same type of vector
- Both LHS and RHS may be of length either 1 or n
- you always end with a case of TRUE for when all of the other cases are FALSE
Example of a vectorized if - else
```{r}
x <- 1:50
case_when(
x < 5 ~ "less than 5", # start with specific cases first
x < 10 ~ "less than 10",
TRUE ~ as.character(x)
)
```
[1] "less than 5" "less than 5" "less than 5" "less than 5" "less than 10"
[6] "less than 10" "less than 10" "less than 10" "less than 10" "10"
[11] "11" "12" "13" "14" "15"
[16] "16" "17" "18" "19" "20"
[21] "21" "22" "23" "24" "25"
[26] "26" "27" "28" "29" "30"
[31] "31" "32" "33" "34" "35"
[36] "36" "37" "38" "39" "40"
[41] "41" "42" "43" "44" "45"
[46] "46" "47" "48" "49" "50"
7.1 fizzbuz problem
```{r}
case_when(
x %% 15 == 0 ~ "fizz buzz",
x %% 3 == 0 ~ "fizz",
x %% 5 == 0 ~ "buzz",
TRUE ~ as.character(x)
)
```
[1] "1" "2" "fizz" "4" "buzz" "fizz"
[7] "7" "8" "fizz" "buzz" "11" "fizz"
[13] "13" "14" "fizz buzz" "16" "17" "fizz"
[19] "19" "buzz" "fizz" "22" "23" "fizz"
[25] "buzz" "26" "fizz" "28" "29" "fizz buzz"
[31] "31" "32" "fizz" "34" "buzz" "fizz"
[37] "37" "38" "fizz" "buzz" "41" "fizz"
[43] "43" "44" "fizz buzz" "46" "47" "fizz"
[49] "49" "buzz"
7.2 Caveat
- Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.
- This won’t work:
```{r}
case_when(
x %% 5 == 0 ~ "fizz",
x %% 7 == 0 ~ "buzz",
x %% 35 == 0 ~ "fizz buzz", # notice that for 35, fizz was printed instead of fizz buzz.
TRUE ~ as.character(x),
)
```
[1] "1" "2" "3" "4" "fizz" "6" "buzz" "8" "9" "fizz"
[11] "11" "12" "13" "buzz" "fizz" "16" "17" "18" "19" "fizz"
[21] "buzz" "22" "23" "24" "fizz" "26" "27" "buzz" "29" "fizz"
[31] "31" "32" "33" "34" "fizz" "36" "37" "38" "39" "fizz"
[41] "41" "buzz" "43" "44" "fizz" "46" "47" "48" "buzz" "fizz"
7.3 with mutate()
` - case_when() is particularly useful inside mutate() when you want to create a new variable that relies on a complex combination of existing variables
```{r}
starwars %>%
select(name:mass, gender, species) %>%
mutate(height_cat = case_when(height > 191 ~ "tall",
height < 167 ~ "short",
TRUE ~ "average") )
# using quantile()
starwars %>%
select(name:mass, gender, species) %>%
mutate(height_cat = case_when(height > quantile(height, 3/4, na.rm = TRUE) ~ "tall",
height < quantile(height, 1/4, na.rm = TRUE) ~ "short",
TRUE ~ "average") )
# But this can be even more complicated...
starwars %>%
select(name:mass, gender, species) %>%
mutate(type = case_when(height >= 200 | mass >= 200 ~ "large",
species == "Droid" ~ "robot",
TRUE ~ "other" ) )
```
# A tibble: 87 × 6
name height mass gender species height_cat
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 masculine Human average
2 C-3PO 167 75 masculine Droid average
3 R2-D2 96 32 masculine Droid short
4 Darth Vader 202 136 masculine Human tall
5 Leia Organa 150 49 feminine Human short
6 Owen Lars 178 120 masculine Human average
7 Beru Whitesun Lars 165 75 feminine Human short
8 R5-D4 97 32 masculine Droid short
9 Biggs Darklighter 183 84 masculine Human average
10 Obi-Wan Kenobi 182 77 masculine Human average
# ℹ 77 more rows
# A tibble: 87 × 6
name height mass gender species height_cat
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 masculine Human average
2 C-3PO 167 75 masculine Droid average
3 R2-D2 96 32 masculine Droid short
4 Darth Vader 202 136 masculine Human tall
5 Leia Organa 150 49 feminine Human short
6 Owen Lars 178 120 masculine Human average
7 Beru Whitesun Lars 165 75 feminine Human short
8 R5-D4 97 32 masculine Droid short
9 Biggs Darklighter 183 84 masculine Human average
10 Obi-Wan Kenobi 182 77 masculine Human average
# ℹ 77 more rows
# A tibble: 87 × 6
name height mass gender species type
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 masculine Human other
2 C-3PO 167 75 masculine Droid robot
3 R2-D2 96 32 masculine Droid robot
4 Darth Vader 202 136 masculine Human large
5 Leia Organa 150 49 feminine Human other
6 Owen Lars 178 120 masculine Human other
7 Beru Whitesun Lars 165 75 feminine Human other
8 R5-D4 97 32 masculine Droid robot
9 Biggs Darklighter 183 84 masculine Human other
10 Obi-Wan Kenobi 182 77 masculine Human other
# ℹ 77 more rows
8 rownames_to_column()
tibble::rownames_to_columns() - You many occasionally see data sets where there are row names. - This is allowed but not common with data frames as row names are removed when using […] - Tidy data (a tibble) does not use row_names so they are stripped when coerced to a tibble - Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column. - To detect if a data frame has row_names use has_rownames()
```{r}
head(state.x77)
#view(state.x77)
class(state.x77)
str(state.x77)
has_rownames(state.x77) # doesn't recognize rownames since the data is matrix.
has_rownames(as_tibble(state.x77)) # doesn't recognize it either since tibble removes the row names.
has_rownames(as.data.frame(state.x77)) # yes it does.
state.x77 %>%
as.data.frame() %>%
rownames_to_column("state") %>%
#str() %>%
as_tibble()
```
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California 21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
[1] "matrix" "array"
num [1:50, 1:8] 3615 365 2212 2110 21198 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
[1] FALSE
[1] FALSE
[1] TRUE
# A tibble: 50 × 9
state Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alabama 3615 3624 2.1 69.0 15.1 41.3 20 50708
2 Alaska 365 6315 1.5 69.3 11.3 66.7 152 566432
3 Arizona 2212 4530 1.8 70.6 7.8 58.1 15 113417
4 Arkans… 2110 3378 1.9 70.7 10.1 39.9 65 51945
5 Califo… 21198 5114 1.1 71.7 10.3 62.6 20 156361
6 Colora… 2541 4884 0.7 72.1 6.8 63.9 166 103766
7 Connec… 3100 5348 1.1 72.5 3.1 56 139 4862
8 Delawa… 579 4809 0.9 70.1 6.2 54.6 103 1982
9 Florida 8277 4815 1.3 70.7 10.7 52.6 11 54090
10 Georgia 4931 4091 2 68.5 13.9 40.6 60 58073
# ℹ 40 more rows
8.1 Exercise
Check if the mtcars data set has row names. If so, convert the rownames to a column named car and convert to a tibble ::: {.cell}
```{r}
head(mtcars)
class(mtcars)
has_rownames(mtcars)
mtcars %>%
rownames_to_column("car") %>%
as_tibble()
```
# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
[1] "tbl_df" "tbl" "data.frame"
[1] FALSE
# A tibble: 32 × 12
car mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
:::
9 distinct()
-
distinct()
subsets only unique/distinct rows from a data frame. - Rows are a subset of the input but appear in the same order.
- Columns are not modified by default
- Groups are not modified.
- Data frame attributes are preserved.
```{r}
starwars %>%
distinct(homeworld)
starwars %>%
distinct(homeworld, species) # this is the same as below
starwars %>%
count(homeworld, species) # it has n additionally, so it can be more useful
library(gapminder)
data("gapminder")
# how many countries are in each continent?
gapminder %>%
count(continent) # incorrect as there are 12 years of data for each country
gapminder %>%
count(country, continent) # still incorrect
gapminder %>%
distinct(country, continent) %>%
count(continent) # correct
```
# A tibble: 49 × 1
homeworld
<chr>
1 Tatooine
2 Naboo
3 Alderaan
4 Stewjon
5 Eriadu
6 Kashyyyk
7 Corellia
8 Rodia
9 Nal Hutta
10 Bestine IV
# ℹ 39 more rows
# A tibble: 57 × 2
homeworld species
<chr> <chr>
1 Tatooine Human
2 Tatooine Droid
3 Naboo Droid
4 Alderaan Human
5 Stewjon Human
6 Eriadu Human
7 Kashyyyk Wookiee
8 Corellia Human
9 Rodia Rodian
10 Nal Hutta Hutt
# ℹ 47 more rows
# A tibble: 57 × 3
homeworld species n
<chr> <chr> <int>
1 Alderaan Human 3
2 Aleen Minor Aleena 1
3 Bespin Human 1
4 Bestine IV <NA> 1
5 Cato Neimoidia Neimodian 1
6 Cerea Cerean 1
7 Champala Chagrian 1
8 Chandrila Human 1
9 Concord Dawn Human 1
10 Corellia Human 2
# ℹ 47 more rows
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 624
2 Americas 300
3 Asia 396
4 Europe 360
5 Oceania 24
# A tibble: 142 × 3
country continent n
<fct> <fct> <int>
1 Afghanistan Asia 12
2 Albania Europe 12
3 Algeria Africa 12
4 Angola Africa 12
5 Argentina Americas 12
6 Australia Oceania 12
7 Austria Europe 12
8 Bahrain Asia 12
9 Bangladesh Asia 12
10 Belgium Europe 12
# ℹ 132 more rows
# A tibble: 5 × 2
continent n
<fct> <int>
1 Africa 52
2 Americas 25
3 Asia 33
4 Europe 30
5 Oceania 2
10 row-wise operation with rowwise()
Before version 1.0, dplyr did not have special capabilities for operating on subsets of rows. You had to use for-loops for operating across rows or subsets of rows.
dplyr 1.0 added the new verb rowwise() to create multiple one-row data frames out of an existing data frame
These row-wise data frames are “virtual” subsets of the original data frame - - You can operate on each subset data frame as if it were its own data frame.
We will discuss a common use case: computing aggregates across multiple columns within a row
10.1 Creating row-wise data frames
- Row-wise operations require a special type of grouping where each group consists of a single row.
- You create this grouping using rowwise()
- Like group_by(), rowwise() doesn’t really do anything itself; it just changes how the other dplyr verbs work.
- For example, compare the results of mutate() in the following code:
```{r}
fruits <- tribble(
~"fruit", ~"height_1", ~"height_2", ~"height_3", ~"width", ~"weight",
"Banana", 4, 4.2, 3.5, 1, 0.5,
"Strawberry", 1, .9, 1.2, 1, .25,
"Pineapple", 18, 17.7, 19.2, 6, 3)
fruits
# mean across all values in all rows
fruits %>%
mutate(height_mean = mean(c(height_1, height_2, height_3)))
# mean across all values in each row
fruits %>%
rowwise(fruit) %>%
mutate(height_mean = mean(c(height_1, height_2, height_3))) # must use c()
fruits %>%
rowwise(fruit) %>%
mutate(height_mean = mean(height_1:height_3)) # Why is the mean different from above?
# Using c_across: See the next section
fruits %>%
rowwise(fruit) %>%
mutate(height_mean = mean(c_across(height_1:height_3))) # correct
fruits %>%
rowwise(fruit) %>%
mutate(height_mean = mean(c_across(contains("height")))) # correct
```
# A tibble: 3 × 6
fruit height_1 height_2 height_3 width weight
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5
2 Strawberry 1 0.9 1.2 1 0.25
3 Pineapple 18 17.7 19.2 6 3
# A tibble: 3 × 7
fruit height_1 height_2 height_3 width weight height_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5 7.74
2 Strawberry 1 0.9 1.2 1 0.25 7.74
3 Pineapple 18 17.7 19.2 6 3 7.74
# A tibble: 3 × 7
# Rowwise: fruit
fruit height_1 height_2 height_3 width weight height_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5 3.9
2 Strawberry 1 0.9 1.2 1 0.25 1.03
3 Pineapple 18 17.7 19.2 6 3 18.3
# A tibble: 3 × 7
# Rowwise: fruit
fruit height_1 height_2 height_3 width weight height_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5 4
2 Strawberry 1 0.9 1.2 1 0.25 1
3 Pineapple 18 17.7 19.2 6 3 18.5
# A tibble: 3 × 7
# Rowwise: fruit
fruit height_1 height_2 height_3 width weight height_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5 3.9
2 Strawberry 1 0.9 1.2 1 0.25 1.03
3 Pineapple 18 17.7 19.2 6 3 18.3
# A tibble: 3 × 7
# Rowwise: fruit
fruit height_1 height_2 height_3 width weight height_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5 3.9
2 Strawberry 1 0.9 1.2 1 0.25 1.03
3 Pineapple 18 17.7 19.2 6 3 18.3
10.2 Per-row Summary Statistics
- dplyr::summarize() makes it really easy to summarize values across the rows within one column.
- We can combine rowwise() and summarize() to make it easy to summarize values across columns within one row.
- Let’s say we want to compute the sum of w, x, y, and z for each row.
- We start by making a row-wise data frame:
- We then use mutate() to add a new column to each row, or.
- Just use summarise() to return the summary:
```{r}
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df
rf <- df %>% rowwise(id)
# mutate to add new column for each row
rf %>% mutate(total = sum(c(w, x, y, z)))
# summarize without mutate
rf %>% summarise(total = sum(c(w, x, y, z)),
.groups= "drop") # the same as ungroup()
```
# A tibble: 6 × 5
id w x y z
<int> <int> <int> <int> <int>
1 1 10 20 30 40
2 2 11 21 31 41
3 3 12 22 32 42
4 4 13 23 33 43
5 5 14 24 34 44
6 6 15 25 35 45
# A tibble: 6 × 6
# Rowwise: id
id w x y z total
<int> <int> <int> <int> <int> <int>
1 1 10 20 30 40 100
2 2 11 21 31 41 104
3 3 12 22 32 42 108
4 4 13 23 33 43 112
5 5 14 24 34 44 116
6 6 15 25 35 45 120
# A tibble: 6 × 2
id total
<int> <int>
1 1 100
2 2 104
3 3 108
4 4 112
5 5 116
6 6 120
10.3 Using c_across
- If you have a lot of variables, you can use c_across() to succinctly select many variables (
c_across()
uses tidy select helpers) - The where(is.numeric()) ensures we only select numeric columns
```{r}
rf %>% summarize(total = sum(across(w:z)), .groups = "drop")
rf %>% summarize(total = sum(c_across(w:z)), .groups = "drop")
df %>%
rowwise(id) %>%
summarize(total = sum(c_across(where(is.numeric))),
.groups = "drop")
rf %>% mutate(total = sum(across(w:z)))
rf %>% mutate(mean = mean(c_across(w:z)))
rf %>% mutate(total = sum(c_across(w:z)))
df %>%
rowwise(id) %>%
mutate(total = sum(c_across(where(is.numeric)))
)
# If we want to use our fruits example...
fruits %>%
rowwise(fruit) %>%
mutate(height_mean = mean(c_across(contains("height"))))
```
# A tibble: 6 × 2
id total
<int> <int>
1 1 100
2 2 104
3 3 108
4 4 112
5 5 116
6 6 120
# A tibble: 6 × 2
id total
<int> <int>
1 1 100
2 2 104
3 3 108
4 4 112
5 5 116
6 6 120
# A tibble: 6 × 2
id total
<int> <int>
1 1 100
2 2 104
3 3 108
4 4 112
5 5 116
6 6 120
# A tibble: 6 × 6
# Rowwise: id
id w x y z total
<int> <int> <int> <int> <int> <int>
1 1 10 20 30 40 100
2 2 11 21 31 41 104
3 3 12 22 32 42 108
4 4 13 23 33 43 112
5 5 14 24 34 44 116
6 6 15 25 35 45 120
# A tibble: 6 × 6
# Rowwise: id
id w x y z mean
<int> <int> <int> <int> <int> <dbl>
1 1 10 20 30 40 25
2 2 11 21 31 41 26
3 3 12 22 32 42 27
4 4 13 23 33 43 28
5 5 14 24 34 44 29
6 6 15 25 35 45 30
# A tibble: 6 × 6
# Rowwise: id
id w x y z total
<int> <int> <int> <int> <int> <int>
1 1 10 20 30 40 100
2 2 11 21 31 41 104
3 3 12 22 32 42 108
4 4 13 23 33 43 112
5 5 14 24 34 44 116
6 6 15 25 35 45 120
# A tibble: 6 × 6
# Rowwise: id
id w x y z total
<int> <int> <int> <int> <int> <int>
1 1 10 20 30 40 100
2 2 11 21 31 41 104
3 3 12 22 32 42 108
4 4 13 23 33 43 112
5 5 14 24 34 44 116
6 6 15 25 35 45 120
# A tibble: 3 × 7
# Rowwise: fruit
fruit height_1 height_2 height_3 width weight height_mean
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Banana 4 4.2 3.5 1 0.5 3.9
2 Strawberry 1 0.9 1.2 1 0.25 1.03
3 Pineapple 18 17.7 19.2 6 3 18.3
10.4 Using both c_across and across
so c_across is a rowwise version of the function we learned earlier, across.
You could combine c_across with column-wise across to compute the proportion of the total for each column:
```{r}
rf %>% #our row-wise data frame
mutate(total = sum(c_across(w:z))) %>% #total each row
ungroup() %>% # ungroup the rows
mutate(across(w:z, ~ .x / total)) # the .x represents each column
```
# A tibble: 6 × 6
id w x y z total
<int> <dbl> <dbl> <dbl> <dbl> <int>
1 1 0.1 0.2 0.3 0.4 100
2 2 0.106 0.202 0.298 0.394 104
3 3 0.111 0.204 0.296 0.389 108
4 4 0.116 0.205 0.295 0.384 112
5 5 0.121 0.207 0.293 0.379 116
6 6 0.125 0.208 0.292 0.375 120
10.5 Exercise
- Let’s create a new variable for the starwars data frame with the maximum of the height, mass, or birth year for each starwars character. Who has the maximum of all the characters?
```{r}
starwars %>%
filter(!is.na(height), !is.na(mass), !is.na(birth_year)) %>%
rowwise() %>%
mutate(max_a = max(height, mass, birth_year, na.rm = TRUE)) %>%
relocate(max_a) %>% # move the variable to the first column
ungroup() %>%
select(name, where(is.numeric)) %>%
slice_max(max_a) # slice() lets you index rows by their (integer) locations.
# alt.
starwars %>%
filter(!is.na(height), !is.na(mass)) %>%
rowwise() %>%
mutate(max_a = max(c(height, mass, birth_year), na.rm = TRUE)) %>%
relocate(max_a) %>%
ungroup() %>%
select(name, where(is.numeric)) %>%
filter(max_a == max(max_a))
```
# A tibble: 1 × 5
name max_a height mass birth_year
<chr> <dbl> <int> <dbl> <dbl>
1 Jabba Desilijic Tiure 1358 175 1358 600
# A tibble: 1 × 5
name max_a height mass birth_year
<chr> <dbl> <int> <dbl> <dbl>
1 Jabba Desilijic Tiure 1358 175 1358 600
11 References
Statistical programming in R: https://american-stat-412612.netlify.app/
Column-wise operations: https://dplyr.tidyverse.org/articles/colwise.html
Row-wise operations: https://dplyr.tidyverse.org/articles/rowwise.html