M06: Advanced Wrangling with dplyr

Author

Jae Jung

Published

March 11, 2025

1 Topics

  • joins relational databases
  • Joining data frames when there are no unique key columns
  • Operations across columns with across()
  • Replace complex if else statements with case_when()
  • Add row names to a column with rownames_to_columns()
  • Subset distinct/unique rows with distinct()
  • row-wise calculation with rowwise()

2 Packages

```{r}
library(tidyverse)
```

3 Rankings

3.1 Integer ranking functions

Note
  • Description: Three ranking functions inspired by SQL2003. They differ primarily in how they handle ties:

    • row_number() gives every input a unique rank, so that c(10, 20, 20, 30) would get ranks c(1, 2, 3, 4). It’s equivalent to rank(ties.method = “first”).

    • min_rank() gives every tie the same (smallest) value so that c(10, 20, 20, 30) gets ranks c(1, 2, 2, 4). It’s the way that ranks are usually computed in sports and is equivalent to rank(ties.method = “min”).

    • dense_rank() works like min_rank(), but doesn’t leave any gaps, so that c(10, 20, 20, 30) gets ranks c(1, 2, 2, 3).

```{r}
x <- c(5, 1, 3, 2, 2, NA)
row_number(x)
min_rank(x)
dense_rank(x)

# Example
df <- data.frame(
  grp = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
  x = c(3, 2, 1, 1, 2, 2, 1, 1, 1),
  y = c(1, 3, 2, 3, 2, 2, 4, 1, 2),
  id = 1:9
)

df |> 
  group_by(grp) |> 
  mutate(grp_id = row_number())
```
[1]  5  1  4  2  3 NA
[1]  5  1  4  2  2 NA
[1]  4  1  3  2  2 NA
# A tibble: 9 × 5
# Groups:   grp [3]
    grp     x     y    id grp_id
  <dbl> <dbl> <dbl> <int>  <int>
1     1     3     1     1      1
2     1     2     3     2      2
3     1     1     2     3      3
4     2     1     3     4      1
5     2     2     2     5      2
6     2     2     2     6      3
7     3     1     4     7      1
8     3     1     1     8      2
9     3     1     2     9      3

3.2 Proportional ranking functions

Note
  • Description: These two ranking functions implement two slightly different ways to compute a percentile. For each x_i in x:

    • cume_dist(x) counts the total number of values less than or equal to x_i, and divides it by the number of observations.

    • percent_rank(x) counts the total number of values less than x_i, and divides it by the number of observations minus 1.

    • In both cases, missing values are ignored when counting the number of observations.

```{r}
# Example
x <- c(5, 1, 3, 2, 2)

cume_dist(x) # is the same as below
sapply(x, function(xi) sum(x <= xi) / length(x))

percent_rank(x) # is the same as below
sapply(x, function(xi) sum(x < xi)  / (length(x) - 1))
```
[1] 1.0 0.2 0.8 0.6 0.6
[1] 1.0 0.2 0.8 0.6 0.6
[1] 1.00 0.00 0.75 0.25 0.25
[1] 1.00 0.00 0.75 0.25 0.25

3.3 Bucket a numeric vector into n groups

Note
  • Description: ntile() is a sort of very rough rank, which breaks the input vector into n buckets.
    • If length(x) is not an integer multiple of n, the size of the buckets will differ by up to one, with larger buckets coming first.

    • Unlike other ranking functions, ntile() ignores ties: it will create evenly sized buckets even if the same value of x ends up in different buckets.

    • Usage: ntile(x = row_number(), n)

```{r}
x <- c(5, 1, 3, 2, 2, NA)
ntile(x, 2)
ntile(x, 4)

# If the bucket sizes are uneven, the larger buckets come first
ntile(1:8, 3)

# Ties are ignored
ntile(rep(1, 8), 3)
```
[1]  2  1  2  1  1 NA
[1]  4  1  3  1  2 NA
[1] 1 1 1 2 2 2 3 3
[1] 1 1 1 2 2 2 3 3

3.4 row_number() / between()

```{r}
# row_number can be used with single table verbs without specifying x
# (for data frames and databases that support windowing)

mtcars <- mtcars |> 
  as_tibble()

mutate(mtcars, row_number() == 1L)
mutate(mtcars, row_number() == 2L)
mutate(mtcars, row_number() == 3)

mtcars |> row_number()

# 
between(1:12, 7, 9) # inclusive

x <- rnorm(1e2) # rnorm(n, mean = 0, sd = 1)
x[between(x, -1, 1)]

## Or on a tibble using filter
filter(starwars, between(height, 100, 150))

mtcars %>% head()
mtcars %>% filter(between(row_number(), 3, 6))

mtcars %>% filter(between(mpg, 20, 25))
mtcars %>% filter(mpg >= 20, mpg <= 25)

mutate(mtcars, rank_hp = row_number(hp)) %>% 
  arrange(hp)

mtcars %>% select(mpg) %>% head() %>% 
  row_number()
```
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 1 more variable: `row_number() == 1L` <lgl>
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 1 more variable: `row_number() == 2L` <lgl>
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# ℹ 1 more variable: `row_number() == 3` <lgl>
 [1] 19 20 24 22 15 14  4 26 25 16 13 11 12  7  2  1  5 31 29 32 23  9  8  3 17
[26] 28 27 30 10 18  6 21
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
 [1] -0.048653395  0.884879882 -0.141319916 -0.967106045  0.315316490
 [6] -0.705965651 -0.738639638 -0.462475899  0.579030499  0.112718486
[11]  0.130504674  0.297763793  0.594838996  0.724857853  0.054272230
[16] -0.101223600  0.845028750 -0.666466510 -0.079490219  0.305178391
[21] -0.149424746  0.004828794 -0.263878257 -0.100063144  0.804041285
[26] -0.899508645 -0.001009739  0.760263085  0.748396156 -0.256134958
[31] -0.560925902  0.611697256 -0.352491415 -0.764351651 -0.206370904
[36] -0.674367653  0.286305575  0.344326412  0.486405335 -0.578689798
[41] -0.354978498  0.085198582 -0.347749444 -0.373557294 -0.040152943
[46] -0.898820781 -0.564838799  0.432060270 -0.066068734 -0.575944089
[51] -0.368287162 -0.229399787  0.532786143  0.160260523  0.234379606
[56]  0.030328456 -0.655526257  0.869484763 -0.540088350  0.276284189
[61] -0.580589202
# A tibble: 5 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Mon Moth…    150    NA auburn     fair       blue              48 fema… femin…
3 Watto        137    NA black      blue, grey yellow            NA male  mascu…
4 Sebulba      112    40 none       grey, red  orange            NA male  mascu…
5 Gasgano      122    NA none       white, bl… black             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# A tibble: 6 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
# A tibble: 4 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
2  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
3  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
4  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
# A tibble: 8 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
5  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
6  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
7  21.5     4  120.    97  3.7   2.46  20.0     1     0     3     1
8  21.4     4  121    109  4.11  2.78  18.6     1     1     4     2
# A tibble: 8 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
5  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
6  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
7  21.5     4  120.    97  3.7   2.46  20.0     1     0     3     1
8  21.4     4  121    109  4.11  2.78  18.6     1     1     4     2
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb rank_hp
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <int>
 1  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2       1
 2  24.4     4 147.     62  3.69  3.19  20       1     0     4     2       2
 3  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1       3
 4  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1       4
 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1       5
 6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2       6
 7  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1       7
 8  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2       8
 9  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1       9
10  18.1     6 225     105  2.76  3.46  20.2     1     0     3     1      10
# ℹ 22 more rows
[1] 3 4 6 5 2 1

4 Relational Data and Joins

To work with relational data you need verbs that work with pairs of tables. There are three families of verbs designed to work with relational data:

  • Mutating joins
    • add new variables to one data frame from matching observations in another.
    • Types
      • inner_join(),
      • left_join(),
      • right_join(),
      • full_join(),
  • Filtering joins
    • filter observations from one data frame based on whether or not they match an observation in the other table.
    • semi_join(),
    • anti_join().
  • Set operations
    • treat observations as if they were set elements.

4.1 Data

```{r}
library(tidyverse)
library(nycflights13) #nycflights13 contains four tibbles that are related to the flights table that you used in data transformation:
flights
airlines #the full carrier name from its abbreviated code:
airports #information about each airport, identified by the faa airport code:
planes #information about each plane, identified by its tailnum:
weather #the weather at each NYC airport for each hour:
```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 16 × 2
   carrier name                       
   <chr>   <chr>                      
 1 9E      Endeavor Air Inc.          
 2 AA      American Airlines Inc.     
 3 AS      Alaska Airlines Inc.       
 4 B6      JetBlue Airways            
 5 DL      Delta Air Lines Inc.       
 6 EV      ExpressJet Airlines Inc.   
 7 F9      Frontier Airlines Inc.     
 8 FL      AirTran Airways Corporation
 9 HA      Hawaiian Airlines Inc.     
10 MQ      Envoy Air                  
11 OO      SkyWest Airlines Inc.      
12 UA      United Air Lines Inc.      
13 US      US Airways Inc.            
14 VX      Virgin America             
15 WN      Southwest Airlines Co.     
16 YV      Mesa Airlines Inc.         
# A tibble: 1,458 × 8
   faa   name                             lat    lon   alt    tz dst   tzone    
   <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
 1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
 2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
 3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
 4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
 5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
 6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
 7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
 8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
 9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
# ℹ 1,448 more rows
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows
# A tibble: 26,115 × 15
   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
 1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
 2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
 3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
 4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
 5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
 6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
 7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
 8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
 9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
# ℹ 26,105 more rows
# ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#   visib <dbl>, time_hour <dttm>

4.2 Keys (primary vs. foreign)

Note
  • The variables used to connect each pair of tables are called keys. A key is a variable (or set of variables) that uniquely identifies an observation.
  • In simple cases, a single variable is sufficient to identify an observation. - For example, each plane is uniquely identified by its tailnum.
  • In other cases, multiple variables may be needed.
    • For example, to identify an observation in weather you need five variables: year, month, day, hour, and origin.
  • A primary key uniquely identifies an observation in its own table. For example, planes$tailnum is a primary key because it uniquely identifies each plane in the planes table.

  • A foreign key uniquely identifies an observation in another table. For example, flights$tailnum is a foreign key because it appears in the flights table where it matches each flight to a unique plane.

```{r}
# plane$tailnum is primary key
planes %>% 
  count(tailnum) %>% 
  filter(n > 1) # confirm there is no duplicates; the same can be accomplished with the following operation.

planes %>% 
  select(tailnum) %>% 
  duplicated() %>% 
  max()

# flights$tailnum is a foreign key
flights %>% 
  count(tailnum) %>% 
  filter(n > 1) |> 
  arrange(-n)

weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1) # the output confirms that there are some duplicates. Thus, it is not a primary key.

# airports$faa can be a primary key
airports %>% 
  count(faa) %>% # faa: airport code = origin
  filter(n > 1) # confirms that there is no duplicates

# Sometimes a table doesn't have an explicit primary key: each row is an observation, but no combination of variables reliably identifies it. 
# For example, what's the primary key in the flights table? You might think it would be the date plus the flight or tail number, but neither of those are unique:
flights %>% 
  count(year, month, day, flight) %>% 
  filter(n > 1)

flights %>% 
  count(year, month, day, tailnum) %>% 
  filter(n > 1)
```
# A tibble: 0 × 2
# ℹ 2 variables: tailnum <chr>, n <int>
[1] 0
# A tibble: 3,873 × 2
   tailnum     n
   <chr>   <int>
 1 <NA>     2512
 2 N725MQ    575
 3 N722MQ    513
 4 N723MQ    507
 5 N711MQ    486
 6 N713MQ    483
 7 N258JB    427
 8 N298JB    407
 9 N353JB    404
10 N351JB    402
# ℹ 3,863 more rows
# A tibble: 3 × 6
   year month   day  hour origin     n
  <int> <int> <int> <int> <chr>  <int>
1  2013    11     3     1 EWR        2
2  2013    11     3     1 JFK        2
3  2013    11     3     1 LGA        2
# A tibble: 0 × 2
# ℹ 2 variables: faa <chr>, n <int>
# A tibble: 29,768 × 5
    year month   day flight     n
   <int> <int> <int>  <int> <int>
 1  2013     1     1      1     2
 2  2013     1     1      3     2
 3  2013     1     1      4     2
 4  2013     1     1     11     3
 5  2013     1     1     15     2
 6  2013     1     1     21     2
 7  2013     1     1     27     4
 8  2013     1     1     31     2
 9  2013     1     1     32     2
10  2013     1     1     35     2
# ℹ 29,758 more rows
# A tibble: 64,928 × 5
    year month   day tailnum     n
   <int> <int> <int> <chr>   <int>
 1  2013     1     1 N0EGMQ      2
 2  2013     1     1 N11189      2
 3  2013     1     1 N11536      2
 4  2013     1     1 N11544      3
 5  2013     1     1 N11551      2
 6  2013     1     1 N12540      2
 7  2013     1     1 N12567      2
 8  2013     1     1 N13123      2
 9  2013     1     1 N13538      3
10  2013     1     1 N13566      3
# ℹ 64,918 more rows
Caution

If a table lacks a primary key, it’s sometimes useful to add one with mutate() and row_number(). That makes it easier to match observations if you’ve done some filtering and want to check back in with the original data. This is called a *surrogate key*.

Tip

A primary key and the corresponding foreign key in another table form a relation. Relations are typically one-to-many. For example, each flight has one plane, but each plane has many flights. In other data, you’ll occasionally see a 1-to-1 relationship. You can think of this as a special case of 1-to-many. You can model many-to-many relations with a many-to-1 relation plus a 1-to-many relation. For example, in this data there’s a many-to-many relationship between airlines and airports: each airline flies to many airports; each airport hosts many airlines.

4.3 Mutating Joins

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other.

```{r}
flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)
flights2

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier") # this is the same as below.

flights2 %>%
  select(-origin, -dest) %>% 
  mutate(name = airlines$name[match(carrier, airlines$carrier)])
```
# A tibble: 336,776 × 8
    year month   day  hour origin dest  tailnum carrier
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>  
 1  2013     1     1     5 EWR    IAH   N14228  UA     
 2  2013     1     1     5 LGA    IAH   N24211  UA     
 3  2013     1     1     5 JFK    MIA   N619AA  AA     
 4  2013     1     1     5 JFK    BQN   N804JB  B6     
 5  2013     1     1     6 LGA    ATL   N668DN  DL     
 6  2013     1     1     5 EWR    ORD   N39463  UA     
 7  2013     1     1     6 EWR    FLL   N516JB  B6     
 8  2013     1     1     6 LGA    IAD   N829AS  EV     
 9  2013     1     1     6 JFK    MCO   N593JB  B6     
10  2013     1     1     6 LGA    ORD   N3ALAA  AA     
# ℹ 336,766 more rows
# A tibble: 336,776 × 7
    year month   day  hour tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 N593JB  B6      JetBlue Airways         
10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
# ℹ 336,766 more rows
# A tibble: 336,776 × 7
    year month   day  hour tailnum carrier name                    
   <int> <int> <int> <dbl> <chr>   <chr>   <chr>                   
 1  2013     1     1     5 N14228  UA      United Air Lines Inc.   
 2  2013     1     1     5 N24211  UA      United Air Lines Inc.   
 3  2013     1     1     5 N619AA  AA      American Airlines Inc.  
 4  2013     1     1     5 N804JB  B6      JetBlue Airways         
 5  2013     1     1     6 N668DN  DL      Delta Air Lines Inc.    
 6  2013     1     1     5 N39463  UA      United Air Lines Inc.   
 7  2013     1     1     6 N516JB  B6      JetBlue Airways         
 8  2013     1     1     6 N829AS  EV      ExpressJet Airlines Inc.
 9  2013     1     1     6 N593JB  B6      JetBlue Airways         
10  2013     1     1     6 N3ALAA  AA      American Airlines Inc.  
# ℹ 336,766 more rows
  • Data set for illustration
```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
```

4.3.1 Inner Joins

  • An inner join keeps observations that appear in both tables.
  • The most important property of an inner join is that unmatched rows are not included in the result.
  • Inner joins are usually not appropriate for use in analysis because it’s too easy to lose observations.
```{r}
x %>% 
  inner_join(y, by = "key")
```
# A tibble: 2 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   

4.3.2 Outer joins

An outer join keeps observations that appear in at least one of the tables. There are three types of outer joins:

4.3.2.1 Left joins

  • A left join keeps all observations in x.
  • The most commonly used join is the left join:
  • you use this whenever you look up additional data from another table, because it preserves the original observations even when there isn’t a match.
  • The left join should be your default join: use it unless you have a strong reason to prefer one of the others.
```{r}
x
y

x %>% 
  left_join(y, by = "key" )
```
# A tibble: 3 × 2
    key val_x
  <dbl> <chr>
1     1 x1   
2     2 x2   
3     3 x3   
# A tibble: 3 × 2
    key val_y
  <dbl> <chr>
1     1 y1   
2     2 y2   
3     4 y3   
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 

4.3.2.2 Right joins

  • A right join keeps all observations in y.
```{r}
x %>% 
  right_join(y, by = "key" )
```
# A tibble: 3 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     4 <NA>  y3   

4.3.2.3 Full joins

  • A full join keeps all observations in x and y.
```{r}
x %>% 
  full_join(y, by = "key" )
```
# A tibble: 4 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     3 x3    <NA> 
4     4 <NA>  y3   

4.3.3 Duplicate keys

What will happen when keys are not unique?

4.3.3.1 One table has duplicate keys.

This is useful when you want to add in additional information as there is typically a one-to-many relationship

```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     1, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2"
)
left_join(x, y, by = "key")
```
# A tibble: 4 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x3    y2   
4     1 x4    y1   

4.3.3.2 Both tables have duplicate keys.

This is usually an error because in neither table do the keys uniquely identify an observation. When you join duplicated keys, you get all possible combinations, the Cartesian product:

```{r}
x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     2, "x3",
     3, "x4"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     2, "y3",
     3, "y4"
)
left_join(x, y, by = "key", relationship = "many-to-many")
```
# A tibble: 6 × 3
    key val_x val_y
  <dbl> <chr> <chr>
1     1 x1    y1   
2     2 x2    y2   
3     2 x2    y3   
4     2 x3    y2   
5     2 x3    y3   
6     3 x4    y4   

4.3.4 Defining the key columns

4.3.4.1 Default mode

The default, by = NULL, uses all variables that appear in both tables, the so called natural join.

```{r}
flights2 %>% 
  left_join(weather)
```
# A tibble: 336,776 × 18
    year month   day  hour origin dest  tailnum carrier  temp  dewp humid
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4
 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8
 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6
 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6
 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8
 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4
 7  2013     1     1     6 EWR    FLL   N516JB  B6       37.9  28.0  67.2
 8  2013     1     1     6 LGA    IAD   N829AS  EV       39.9  25.0  54.8
 9  2013     1     1     6 JFK    MCO   N593JB  B6       37.9  27.0  64.3
10  2013     1     1     6 LGA    ORD   N3ALAA  AA       39.9  25.0  54.8
# ℹ 336,766 more rows
# ℹ 7 more variables: wind_dir <dbl>, wind_speed <dbl>, wind_gust <dbl>,
#   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

4.3.4.2 A character vector, by = “x”

```{r}
flights2 %>% 
  left_join(planes, by = "tailnum") # specify the variable that is common to use.
```
# A tibble: 336,776 × 16
   year.x month   day  hour origin dest  tailnum carrier year.y type            
    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>           
 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wing mult…
 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wing mult…
 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wing mult…
 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wing mult…
 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wing mult…
 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wing mult…
 7   2013     1     1     6 EWR    FLL   N516JB  B6        2000 Fixed wing mult…
 8   2013     1     1     6 LGA    IAD   N829AS  EV        1998 Fixed wing mult…
 9   2013     1     1     6 JFK    MCO   N593JB  B6        2004 Fixed wing mult…
10   2013     1     1     6 LGA    ORD   N3ALAA  AA          NA <NA>            
# ℹ 336,766 more rows
# ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
#   seats <int>, speed <int>, engine <chr>

4.3.4.3 A named character vector: by = c(“a” = “b”)

This will match variable a in table x to variable b in table y. The variables from x will be used in the output.

```{r}
flights2 %>% 
  left_join(airports, c("dest" = "faa"))

flights2 %>% 
  left_join(airports, c("origin" = "faa"))
```
# A tibble: 336,776 × 15
    year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Georg…  30.0 -95.3    97
 2  2013     1     1     5 LGA    IAH   N24211  UA      Georg…  30.0 -95.3    97
 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami…  25.8 -80.3     8
 4  2013     1     1     5 JFK    BQN   N804JB  B6      <NA>    NA    NA      NA
 5  2013     1     1     6 LGA    ATL   N668DN  DL      Harts…  33.6 -84.4  1026
 6  2013     1     1     5 EWR    ORD   N39463  UA      Chica…  42.0 -87.9   668
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Fort …  26.1 -80.2     9
 8  2013     1     1     6 LGA    IAD   N829AS  EV      Washi…  38.9 -77.5   313
 9  2013     1     1     6 JFK    MCO   N593JB  B6      Orlan…  28.4 -81.3    96
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      Chica…  42.0 -87.9   668
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
# A tibble: 336,776 × 15
    year month   day  hour origin dest  tailnum carrier name     lat   lon   alt
   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>  <dbl> <dbl> <dbl>
 1  2013     1     1     5 EWR    IAH   N14228  UA      Newar…  40.7 -74.2    18
 2  2013     1     1     5 LGA    IAH   N24211  UA      La Gu…  40.8 -73.9    22
 3  2013     1     1     5 JFK    MIA   N619AA  AA      John …  40.6 -73.8    13
 4  2013     1     1     5 JFK    BQN   N804JB  B6      John …  40.6 -73.8    13
 5  2013     1     1     6 LGA    ATL   N668DN  DL      La Gu…  40.8 -73.9    22
 6  2013     1     1     5 EWR    ORD   N39463  UA      Newar…  40.7 -74.2    18
 7  2013     1     1     6 EWR    FLL   N516JB  B6      Newar…  40.7 -74.2    18
 8  2013     1     1     6 LGA    IAD   N829AS  EV      La Gu…  40.8 -73.9    22
 9  2013     1     1     6 JFK    MCO   N593JB  B6      John …  40.6 -73.8    13
10  2013     1     1     6 LGA    ORD   N3ALAA  AA      La Gu…  40.8 -73.9    22
# ℹ 336,766 more rows
# ℹ 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

4.3.5 base::merge()

  • can perform all four types of mutating join:
merge(x, y, ...)

## Default S3 method:
merge(x, y, ...)

merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
      incomparables = NULL, ...)

4.4 Filtering joins

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:

  • semi_join(x, y) keeps all observations in x that have a match in y.
  • anti_join(x, y) drops all observations in x that have a match in y.
Important
  • Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do:

4.4.1 Semi joins

Semi-joins are useful for matching filtered summary tables back to the original rows.

  • only keeps the rows in x that have a match in y:
```{r}
#For example, imagine you've found the top ten most popular destinations:

top_dest <- flights %>%
  count(dest, sort = TRUE) %>%
  head(10)
top_dest

# Now you want to find each flight that went to one of those destinations. You could construct a filter yourself:

flights %>% 
  filter(dest %in% top_dest$dest) |> 
  select(dest)

# Using semi join, we can achieve the above filtering.
flights %>% 
  semi_join(top_dest)
```
# A tibble: 10 × 2
   dest      n
   <chr> <int>
 1 ORD   17283
 2 ATL   17215
 3 LAX   16174
 4 BOS   15508
 5 MCO   14082
 6 CLT   14064
 7 SFO   13331
 8 FLL   12055
 9 MIA   11728
10 DCA    9705
# A tibble: 141,145 × 1
   dest 
   <chr>
 1 MIA  
 2 ATL  
 3 ORD  
 4 FLL  
 5 MCO  
 6 ORD  
 7 LAX  
 8 SFO  
 9 BOS  
10 FLL  
# ℹ 141,135 more rows
# A tibble: 141,145 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      542            540         2      923            850
 2  2013     1     1      554            600        -6      812            837
 3  2013     1     1      554            558        -4      740            728
 4  2013     1     1      555            600        -5      913            854
 5  2013     1     1      557            600        -3      838            846
 6  2013     1     1      558            600        -2      753            745
 7  2013     1     1      558            600        -2      924            917
 8  2013     1     1      558            600        -2      923            937
 9  2013     1     1      559            559         0      702            706
10  2013     1     1      600            600         0      851            858
# ℹ 141,135 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

4.4.2 Anti joins

  • The inverse of a semi-join is an anti-join.
  • An anti-join keeps the rows that don’t have a match:
  • Anti-joins are useful for diagnosing join mismatches.
```{r}
flights %>%
  anti_join(planes, by = "tailnum") %>%
  count(tailnum, sort = TRUE)
```
# A tibble: 722 × 2
   tailnum     n
   <chr>   <int>
 1 <NA>     2512
 2 N725MQ    575
 3 N722MQ    513
 4 N723MQ    507
 5 N713MQ    483
 6 N735MQ    396
 7 N0EGMQ    371
 8 N534MQ    364
 9 N542MQ    363
10 N531MQ    349
# ℹ 712 more rows

4.5 Join problems

Tip
  • Expect to have some problems when gathering data in your own.
  • Take some measures to minimize the problems

4.5.1 (1) Identify primary key

  • Start by identifying the variables that form the primary key in each table. You should usually do this based on your understanding of the data, not empirically by looking for a combination of variables that give a unique identifier.
  • If you just look for variables without thinking about what they mean, you might get (un)lucky and find a combination that’s unique in your current data but the relationship might not be true in general.
```{r}
# For example, the altitude and longitude uniquely identify each airport, but they are not good identifiers!
airports %>% count(alt, lon) %>% filter(n > 1)
```
# A tibble: 0 × 3
# ℹ 3 variables: alt <dbl>, lon <dbl>, n <int>

4.5.2 (2) Check that none of the variables in the primary key are missing.

If a value is missing then it can’t identify an observation!

4.5.3 (3) Check that your foreign keys match primary keys in another table.

  • The best way to do this is with an anti_join(). It’s common for keys not to match because of data entry errors. Fixing these is often a lot of work.

  • If you do have missing keys, you’ll need to be thoughtful about your use of inner vs. outer joins, carefully considering whether or not you want to drop rows that don’t have a match.

  • Be aware that simply checking the number of rows before and after the join is not sufficient to ensure that your join has gone smoothly.

4.6 Set operations

  • Useful when you want to break a single complex filter into simpler pieces.
  • All these operations work with a complete row, comparing the values of every variable.
  • These expect the x and y inputs to have the same variables, and treat the observations like sets:
```{r}
df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
```

4.6.1 intersect()

intersect(x, y): return only observations in both x and y.

```{r}
df1
df2

intersect(df1, df2)
```
# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     1     2
# A tibble: 1 × 2
      x     y
  <dbl> <dbl>
1     1     1

4.6.2 union()

union(x, y): return unique observations in x and y.

```{r}
# Note that we get 3 rows, not 4
union(df1, df2)
```
# A tibble: 3 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
3     1     2

4.6.3 setdiff()

setdiff(x, y): return observations in x, but not in y.

```{r}
df1
df2

setdiff(df1, df2)
setdiff(df2, df1)
```
# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     2     1
# A tibble: 2 × 2
      x     y
  <dbl> <dbl>
1     1     1
2     1     2
# A tibble: 1 × 2
      x     y
  <dbl> <dbl>
1     2     1
# A tibble: 1 × 2
      x     y
  <dbl> <dbl>
1     1     2

5 Joining data frame when there are no unique key columns

  • Situation: Neither data frame has a unique key column. The closest equivalent of the key column is the dates variable of monthly data. Each df has multiple entries per month, so the dates column has lots of duplicates.

5.1 Data

```{r}
df_1 <- tibble(
  dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-06-01"), as.Date("2018-05-01")),
  x1 = c(10L, 11L, 12L, 13L),
  text1 = c("text a", "text b", "text c", "text d")
)

df_2 <- tibble(
  dates = c(as.Date("2018-07-01"), as.Date("2018-06-01"), as.Date("2018-05-01"), as.Date("2018-04-01")),
  x2 = c(20L, 21L, 22L, 23L),
  text2 = c("text aa", "text bb", "text cc", "text dd")
)
  
df_1
df_2
```
# A tibble: 4 × 3
  dates         x1 text1 
  <date>     <int> <chr> 
1 2018-07-01    10 text a
2 2018-06-01    11 text b
3 2018-06-01    12 text c
4 2018-05-01    13 text d
# A tibble: 4 × 3
  dates         x2 text2  
  <date>     <int> <chr>  
1 2018-07-01    20 text aa
2 2018-06-01    21 text bb
3 2018-05-01    22 text cc
4 2018-04-01    23 text dd

5.2 left_join()

```{r}
df_1 %>% 
  left_join(df_2) # duplicate problem

# solution
left_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
          df_2 %>% group_by(dates) %>% mutate(id = row_number()), 
          by = c("dates", "id"))

df_1 %>% 
  group_by(dates) %>% 
  mutate(id = row_number()) %>% 
  left_join(df_2 %>% group_by(dates) %>% mutate(id = row_number()),
            by = c("dates", "id"))

# even though I didn't give "by" argument, I get the identical outcome.
left_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
          df_2 %>% group_by(dates) %>% mutate(id = row_number()))
```
# A tibble: 4 × 5
  dates         x1 text1     x2 text2  
  <date>     <int> <chr>  <int> <chr>  
1 2018-07-01    10 text a    20 text aa
2 2018-06-01    11 text b    21 text bb
3 2018-06-01    12 text c    21 text bb
4 2018-05-01    13 text d    22 text cc
# A tibble: 4 × 6
# Groups:   dates [3]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA <NA>   
4 2018-05-01    13 text d     1    22 text cc
# A tibble: 4 × 6
# Groups:   dates [3]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA <NA>   
4 2018-05-01    13 text d     1    22 text cc
# A tibble: 4 × 6
# Groups:   dates [3]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA <NA>   
4 2018-05-01    13 text d     1    22 text cc

5.3 full_join()

```{r}
df_1 %>% 
  full_join(df_2) # duplicate problem

full_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
          df_2 %>% group_by(dates) %>% mutate(id = row_number()), 
          by = c("dates", "id"))

full_join(df_1 %>% group_by(dates) %>% mutate(id = row_number()),
          df_2 %>% group_by(dates) %>% mutate(id = row_number()))
```
# A tibble: 5 × 5
  dates         x1 text1     x2 text2  
  <date>     <int> <chr>  <int> <chr>  
1 2018-07-01    10 text a    20 text aa
2 2018-06-01    11 text b    21 text bb
3 2018-06-01    12 text c    21 text bb
4 2018-05-01    13 text d    22 text cc
5 2018-04-01    NA <NA>      23 text dd
# A tibble: 5 × 6
# Groups:   dates [4]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA <NA>   
4 2018-05-01    13 text d     1    22 text cc
5 2018-04-01    NA <NA>       1    23 text dd
# A tibble: 5 × 6
# Groups:   dates [4]
  dates         x1 text1     id    x2 text2  
  <date>     <int> <chr>  <int> <int> <chr>  
1 2018-07-01    10 text a     1    20 text aa
2 2018-06-01    11 text b     1    21 text bb
3 2018-06-01    12 text c     2    NA <NA>   
4 2018-05-01    13 text d     1    22 text cc
5 2018-04-01    NA <NA>       1    23 text dd

6 Column-wise operations with across()

6.1 Data

```{r}
#data() # list all data available
data(starwars) # Loads specified data sets
starwars # print the data set
?starwars # review starwars characters dataset from dplyr
glimpse(starwars)
```
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…

6.2 Introduction

Column-wise operations with across()

  • dplyr has had more capability for functions for operating on columns than on rows but it was not always convenient.
  • If you want to perform the same operation on multiple columns, copying and pasting could be tedious and error prone:
```{r}
df %>% 
      group_by(g1, g2) %>% 
      summarise(a = mean(a), b = mean(b), c = mean(c), d = mean(d)) # the same as below
```
  • We can now use the across() function to write this kind of operation more succinctly and transparently:

across syntax

```{r}
df %>% 
      group_by(g1, g2) %>% 
      summarise(across(a:d, mean))
```
  • across() provides new functionality while replacing older functions such as mutate_if() or mutate_at()

6.3 Basic Usage

  • across() is only used inside other functions, e.g., summarize() or mutate()

  • across() doesn’t work with select() or rename() because they already use tidy select syntax;

  • Like group_by() and rowwise(), it does not change the data itself but changes how other functions operate on the data.

  • across() has two primary arguments:

  • .cols = selects the columns you want to manipulate (notice the period at the beginning).

  • It uses tidy selection (like select()) so you can pick variables by position, name, and type.

  • The default is .cols = everything() so all columns are selected

  • .fns = , is a function (or a list of functions) you want to apply to each column (.again, note the period at the beginning of the argument name)

    • Examples: .fns = mean or .fns = max
    • This can also be a purrr style formula like ~ .x / 2.
    • The .x is the a pronoun for the columns that get passed to the function
    • This argument is optional so you can omit it to leave the data untransformed

6.3.1 Using across() with summarize()

  • Here are a couple of examples of across() in conjunction with its favorite verb, summarize().
  • We use where()inside the across() to select only those columns of the desired type
  • .fns= can take a single function
  • You can add additional arguments to be passed to the function, e.g., na.rm = TRUE.
```{r}
skimr::skim(starwars)

starwars %>% 
  summarize(height_mean = mean(height, na.rm = TRUE),
            mass_mean = mean(mass, na.rm = TRUE)) # too much typing

starwars %>% 
  summarize(across(.cols = height:mass, .fns = mean, na.rm = TRUE)) # this is better
```
Data summary
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 11 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 16 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.60 34.77 66 167.0 180 191.0 264 ▂▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁
# A tibble: 1 × 2
  height_mean mass_mean
        <dbl>     <dbl>
1        175.      97.3
# A tibble: 1 × 2
  height  mass
   <dbl> <dbl>
1   175.  97.3

6.3.2 Using summarize()/where(is.numeric) or where(is.character)

  • where() is from {tidyselect}
  • This selection helper selects the variables for which a function returns TRUE.
  • is.numeric() is from base R
  • is.numeric is a more general test of an object being interpretable as numbers.

6.3.2.1 Numeric vectors

```{r}
starwars %>% 
  summarise(across(where(is.numeric), 
                   .fns = median)) # NA's because of missing values
starwars %>% 
  summarise(across(.cols = where(is.numeric), #.cols can be omitted
                   .fns = median, # .fns can be omitted
                   na.rm = TRUE))

starwars %>% 
  summarise(across(where(is.numeric), 
                   median,
                   na.rm = TRUE))
```
# A tibble: 1 × 3
  height  mass birth_year
   <int> <dbl>      <dbl>
1     NA    NA         NA
# A tibble: 1 × 3
  height  mass birth_year
   <int> <dbl>      <dbl>
1    180    79         52
# A tibble: 1 × 3
  height  mass birth_year
   <int> <dbl>      <dbl>
1    180    79         52

6.3.2.2 Formula approach

  • The formula approach gives us the ability to combine functions with arguments.
  • Using this, you start with the ~ to signify “as a function of” and then put wherever your column name would normally as a .x.
```{r}
starwars %>% 
          summarise(across(where(is.numeric), min, na.rm = TRUE)) # the same as below

starwars %>% 
          summarise(across(where(is.numeric), 
                           ~ min(.x, na.rm = TRUE)))

starwars %>% 
          summarise(across(where(is.numeric), 
                           ~ median(.x, na.rm = TRUE)))

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric),
                   ~ max(.x, na.rm = TRUE))) # same as below

starwars %>% 
  group_by(homeworld) %>% 
  filter(n() > 1) %>% 
  summarise(across(where(is.numeric),
                   max, na.rm = TRUE))
```
# A tibble: 1 × 3
  height  mass birth_year
   <int> <dbl>      <dbl>
1     66    15          8
# A tibble: 1 × 3
  height  mass birth_year
   <int> <dbl>      <dbl>
1     66    15          8
# A tibble: 1 × 3
  height  mass birth_year
   <int> <dbl>      <dbl>
1    180    79         52
# A tibble: 10 × 4
   homeworld height  mass birth_year
   <chr>      <int> <dbl>      <dbl>
 1 Alderaan     191  79         67  
 2 Corellia     180  80         29  
 3 Coruscant    184  50         91  
 4 Kamino       229  88         31.5
 5 Kashyyyk     234 136        200  
 6 Mirial       170  56.2       58  
 7 Naboo        224  85         82  
 8 Ryloth       180  55         48  
 9 Tatooine     202 136        112  
10 <NA>         200 140        896  
# A tibble: 10 × 4
   homeworld height  mass birth_year
   <chr>      <int> <dbl>      <dbl>
 1 Alderaan     191  79         67  
 2 Corellia     180  80         29  
 3 Coruscant    184  50         91  
 4 Kamino       229  88         31.5
 5 Kashyyyk     234 136        200  
 6 Mirial       170  56.2       58  
 7 Naboo        224  85         82  
 8 Ryloth       180  55         48  
 9 Tatooine     202 136        112  
10 <NA>         200 140        896  

6.3.3 Using character vectors: where, is.character, length()

Task: Count unique values for character variables using formula style

  • is.character() is from base R
  • Create or test for objects of type “character”.
```{r}
# find the number of unique values in character variables
summary(starwars) # ugly output due to the fact that there are lists.
class(starwars)
skimr::skim(starwars) # can tell you the answer: 12 unique value, not counting NA

unique(starwars$hair_color) # unique() is a base R function: the same as below

starwars %>% 
  pull(hair_color) %>% 
  unique()


# testing with one variable first
starwars %>% 
  summarize(unique(hair_color)) 

starwars %>% 
  summarize(length(unique(hair_color)) 
            )

# Applying to all variables
starwars %>% 
  summarize(across(where(is.character),
                   ~ length(unique(.x)) 
                   )
            )
```
     name               height           mass          hair_color       
 Length:87          Min.   : 66.0   Min.   :  15.00   Length:87         
 Class :character   1st Qu.:167.0   1st Qu.:  55.60   Class :character  
 Mode  :character   Median :180.0   Median :  79.00   Mode  :character  
                    Mean   :174.6   Mean   :  97.31                     
                    3rd Qu.:191.0   3rd Qu.:  84.50                     
                    Max.   :264.0   Max.   :1358.00                     
                    NA's   :6       NA's   :28                          
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
                                                                        
  skin_color         eye_color           birth_year         sex           
 Length:87          Length:87          Min.   :  8.00   Length:87         
 Class :character   Class :character   1st Qu.: 35.00   Class :character  
 Mode  :character   Mode  :character   Median : 52.00   Mode  :character  
                                       Mean   : 87.57                     
                                       3rd Qu.: 72.00                     
                                       Max.   :896.00                     
                                       NA's   :44                         
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
                                                                          
    gender           homeworld           species         
 Length:87          Length:87          Length:87         
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
                                                         
 films.Length  films.Class  films.Mode
 5          -none-     character      
 6          -none-     character      
 7          -none-     character      
 4          -none-     character      
 5          -none-     character      
 3          -none-     character      
 3          -none-     character      
 1          -none-     character      
 1          -none-     character      
 6          -none-     character      
 3          -none-     character      
 2          -none-     character      
 5          -none-     character      
 4          -none-     character      
 1          -none-     character      
 3          -none-     character      
 3          -none-     character      
 1          -none-     character      
 5          -none-     character      
 5          -none-     character      
 3          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 1          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 3          -none-     character      
 1          -none-     character      
 3          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 3          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 3          -none-     character      
 3          -none-     character      
 3          -none-     character      
 2          -none-     character      
 2          -none-     character      
 2          -none-     character      
 1          -none-     character      
 3          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 2          -none-     character      
 2          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 1          -none-     character      
 vehicles.Length  vehicles.Class  vehicles.Mode
 2          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 2          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 1          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 0          -none-     character               
 starships.Length  starships.Class  starships.Mode
 2          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 5          -none-     character                  
 3          -none-     character                  
 0          -none-     character                  
 2          -none-     character                  
 2          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 3          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
 1          -none-     character                  
 0          -none-     character                  
 0          -none-     character                  
[1] "tbl_df"     "tbl"        "data.frame"
Data summary
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 11 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 16 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.60 34.77 66 167.0 180 191.0 264 ▂▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁
 [1] "blond"         NA              "none"          "brown"        
 [5] "brown, grey"   "black"         "auburn, white" "auburn, grey" 
 [9] "white"         "grey"          "auburn"        "blonde"       
 [1] "blond"         NA              "none"          "brown"        
 [5] "brown, grey"   "black"         "auburn, white" "auburn, grey" 
 [9] "white"         "grey"          "auburn"        "blonde"       
# A tibble: 12 × 1
   `unique(hair_color)`
   <chr>               
 1 blond               
 2 <NA>                
 3 none                
 4 brown               
 5 brown, grey         
 6 black               
 7 auburn, white       
 8 auburn, grey        
 9 white               
10 grey                
11 auburn              
12 blonde              
# A tibble: 1 × 1
  `length(unique(hair_color))`
                         <int>
1                           12
# A tibble: 1 × 8
   name hair_color skin_color eye_color   sex gender homeworld species
  <int>      <int>      <int>     <int> <int>  <int>     <int>   <int>
1    87         12         31        15     5      3        49      38

6.3.4 Example: multiple summaries

Example with group by species and filter for groups with >1 row and adding a summary that is not inside across to count the rows in each group

```{r}
starwars %>% 
  group_by(species) %>% 
  filter(n() > 1) %>%                   
  summarise(across(c(sex, gender, homeworld), 
                   ~ length(unique(.x))),
            n = n())                   
            

# what happens if we add n = n() first.
# The only difference: n comes after species and before other variables.
starwars %>% 
  group_by(species) %>% 
  filter(n() > 1) %>% 
  summarise(n = n(),
            across(c(sex, gender, homeworld), 
                   ~ length(unique(.x))))

# cf.
starwars %>% 
  filter(n() > 1) %>% 
  count(species, sex, gender, homeworld)
```
# A tibble: 9 × 5
  species    sex gender homeworld     n
  <chr>    <int>  <int>     <int> <int>
1 Droid        1      2         3     6
2 Gungan       1      1         1     3
3 Human        2      2        15    35
4 Kaminoan     2      2         1     2
5 Mirialan     1      1         1     2
6 Twi'lek      2      2         1     2
7 Wookiee      1      1         1     2
8 Zabrak       1      1         2     2
9 <NA>         1      1         3     4
# A tibble: 9 × 5
  species      n   sex gender homeworld
  <chr>    <int> <int>  <int>     <int>
1 Droid        6     1      2         3
2 Gungan       3     1      1         1
3 Human       35     2      2        15
4 Kaminoan     2     2      2         1
5 Mirialan     2     1      1         1
6 Twi'lek      2     2      2         1
7 Wookiee      2     1      1         1
8 Zabrak       2     1      1         2
9 <NA>         4     1      1         3
# A tibble: 65 × 5
   species  sex    gender    homeworld       n
   <chr>    <chr>  <chr>     <chr>       <int>
 1 Aleena   male   masculine Aleen Minor     1
 2 Besalisk male   masculine Ojom            1
 3 Cerean   male   masculine Cerea           1
 4 Chagrian male   masculine Champala        1
 5 Clawdite female feminine  Zolan           1
 6 Droid    none   feminine  <NA>            1
 7 Droid    none   masculine Naboo           1
 8 Droid    none   masculine Tatooine        2
 9 Droid    none   masculine <NA>            2
10 Dug      male   masculine Malastare       1
# ℹ 55 more rows

6.3.5 across(), not selecting grouping variables

  • Because across() is usually used in combination with summarize() and mutate(), it doesn’t select grouping variables to avoid accidentally modifying them:

  • Example where the grouping variable g is not selected, even though it is numeric, so is not summed.

```{r}
across <- data.frame(g = c(1, 1, 2), x = c(-1, 1, 3), y = c(-1, -4, -9))
across

across %>% 
  group_by(g) %>% 
  summarize(across(where(is.numeric), sum)
            ) # the same as below

across %>% 
  group_by(g) %>% 
  summarize(across(where(is.numeric), ~ sum(.x))
            )
```
  g  x  y
1 1 -1 -1
2 1  1 -4
3 2  3 -9
# A tibble: 2 × 3
      g     x     y
  <dbl> <dbl> <dbl>
1     1     0    -5
2     2     3    -9
# A tibble: 2 × 3
      g     x     y
  <dbl> <dbl> <dbl>
1     1     0    -5
2     2     3    -9

6.4 Using across() with Multiple Functions

  • You can transform each variable with more than one function
  • Supply a named list of functions in the second argument:
```{r}
# Example with no list but argument
starwars %>% 
  summarise(across(where(is.numeric),
                   .fns = list(median, mean), na.rm = TRUE)
            )  

# Example with list and argument

starwars %>% 
  summarize(across(where(is.numeric), 
                   list(min = min, max = max, mean = mean), # column = function
                   na.rm = TRUE))
starwars %>% 
  summarise(across(where(is.numeric),
                   .fns = list(Median = median, Mean = mean), na.rm = TRUE) # Median is the name, while mean is the function. names always precede functions.
            )  
```
# A tibble: 1 × 6
  height_1 height_2 mass_1 mass_2 birth_year_1 birth_year_2
     <int>    <dbl>  <dbl>  <dbl>        <dbl>        <dbl>
1      180     175.     79   97.3           52         87.6
# A tibble: 1 × 9
  height_min height_max height_mean mass_min mass_max mass_mean birth_year_min
       <int>      <int>       <dbl>    <dbl>    <dbl>     <dbl>          <dbl>
1         66        264        175.       15     1358      97.3              8
# ℹ 2 more variables: birth_year_max <dbl>, birth_year_mean <dbl>
# A tibble: 1 × 6
  height_Median height_Mean mass_Median mass_Mean birth_year_Median
          <int>       <dbl>       <dbl>     <dbl>             <dbl>
1           180        175.          79      97.3                52
# ℹ 1 more variable: birth_year_Mean <dbl>

6.5 Controlling names with glue()

  • In the last case where we wanted to create a median and mean across all numeric variables, you’ll notice that the new variable names were always {name_of_variable}_{mean/median}. This is the default behavior of across - the name, an underscore, and the name of the function from the named list provided.

  • To customize how things are named, you can use glue syntax (we will cover what glue is in a few weeks) and the .names argument.

  • {.fn} will refer to the name of the function you used

  • {.col} will refer to the name of the column

```{r}
#change default, {.col}_{.fn} to {.fn}_{.col}
starwars %>% 
  summarise(across(.cols = where(is.numeric),
                   .fns = list(Median = median, Mean = mean), 
                   na.rm = TRUE, 
                   .names = "{.fn}_{.col}") # default: "{.col}_{.fn}" when multiple function is used
            )  

#change to greater customization
starwars %>% 
  summarise(across(where(is.numeric),
                   .fns = list(Median = median, Mean = mean), 
                    na.rm = TRUE,
                   .names = "The {.fn} of {.col}"))  
```
# A tibble: 1 × 6
  Median_height Mean_height Median_mass Mean_mass Median_birth_year
          <int>       <dbl>       <dbl>     <dbl>             <dbl>
1           180        175.          79      97.3                52
# ℹ 1 more variable: Mean_birth_year <dbl>
# A tibble: 1 × 6
  `The Median of height` `The Mean of height` `The Median of mass`
                   <int>                <dbl>                <dbl>
1                    180                 175.                   79
# ℹ 3 more variables: `The Mean of mass` <dbl>,
#   `The Median of birth_year` <dbl>, `The Mean of birth_year` <dbl>

6.6 Order matters

  • Be careful when combining numeric summaries with is.numeric():
```{r}
number <- data.frame(x = c(1, 2, 3), y = c(1, 4, 9))
number

# Here n became NA because n is numeric, so the across() computes its standard deviation, and 
# the standard deviation of 3 (a constant) is NA.
number %>% 
  summarise(n = n(),
            across(where(is.numeric), sd))

# You probably want to compute n() last to avoid this problem:
number %>% 
  summarise(across(where(is.numeric), sd), 
            n = n())
```
  x y
1 1 1
2 2 4
3 3 9
   n x        y
1 NA 1 4.041452
  x        y n
1 1 4.041452 3

6.7 Examples

6.7.1 Example 1

  • Find all rows where no variable has missing values:
```{r}
starwars %>% 
  filter(across(everything(), ~ !is.na(.x))) # we cannot do without "~" since we have "!"
```
# A tibble: 29 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 3 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
10 Han Solo    180    80 brown      fair       brown           29   male  mascu…
# ℹ 19 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

6.7.2 Example 2

  • Find all distinct rows for variables with the word color in the name
  • For some verbs, like group_by(), count() and distinct(), you can omit the summary functions:
```{r}
starwars %>% 
  distinct(across(contains("color"))) %>% 
  arrange(hair_color, skin_color)
```
# A tibble: 67 × 3
   hair_color    skin_color eye_color
   <chr>         <chr>      <chr>    
 1 auburn        fair       blue     
 2 auburn, grey  fair       blue     
 3 auburn, white fair       blue-gray
 4 black         blue, grey yellow   
 5 black         brown      brown    
 6 black         dark       brown    
 7 black         dark       dark     
 8 black         fair       brown    
 9 black         light      brown    
10 black         tan        brown    
# ℹ 57 more rows

6.7.3 Example 3

  • Count all combinations of variables with a given pattern:
```{r}
starwars %>% 
  count(across(contains("color")), sort = TRUE)

starwars %>% 
  count(across(contains("color")), sort = TRUE) %>% 
  arrange(hair_color, skin_color)

starwars %>% 
  count(across(contains("color"))) %>% 
  arrange(hair_color, skin_color)
```
# A tibble: 67 × 4
   hair_color skin_color eye_color     n
   <chr>      <chr>      <chr>     <int>
 1 brown      light      brown         6
 2 brown      fair       blue          4
 3 none       grey       black         4
 4 black      dark       brown         3
 5 blond      fair       blue          3
 6 black      fair       brown         2
 7 black      tan        brown         2
 8 black      yellow     blue          2
 9 brown      fair       brown         2
10 none       white      yellow        2
# ℹ 57 more rows
# A tibble: 67 × 4
   hair_color    skin_color eye_color     n
   <chr>         <chr>      <chr>     <int>
 1 auburn        fair       blue          1
 2 auburn, grey  fair       blue          1
 3 auburn, white fair       blue-gray     1
 4 black         blue, grey yellow        1
 5 black         brown      brown         1
 6 black         dark       brown         3
 7 black         dark       dark          1
 8 black         fair       brown         2
 9 black         light      brown         1
10 black         tan        brown         2
# ℹ 57 more rows
# A tibble: 67 × 4
   hair_color    skin_color eye_color     n
   <chr>         <chr>      <chr>     <int>
 1 auburn        fair       blue          1
 2 auburn, grey  fair       blue          1
 3 auburn, white fair       blue-gray     1
 4 black         blue, grey yellow        1
 5 black         brown      brown         1
 6 black         dark       brown         3
 7 black         dark       dark          1
 8 black         fair       brown         2
 9 black         light      brown         1
10 black         tan        brown         2
# ℹ 57 more rows

6.8 Exercise

6.8.1 median

  • Calculate the median for each numeric variable for each species and gender
```{r}
glimpse(starwars)

starwars %>% 
  group_by(species, gender) %>% 
  summarise(across(where(is.numeric), ~ median(.x, na.rm = TRUE))
            )
```
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex        <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films      <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
# A tibble: 42 × 5
# Groups:   species [38]
   species   gender    height  mass birth_year
   <chr>     <chr>      <dbl> <dbl>      <dbl>
 1 Aleena    masculine     79  15           NA
 2 Besalisk  masculine    198 102           NA
 3 Cerean    masculine    198  82           92
 4 Chagrian  masculine    196  NA           NA
 5 Clawdite  feminine     168  55           NA
 6 Droid     feminine      96  NA           NA
 7 Droid     masculine    132  53.5         33
 8 Dug       masculine    112  40           NA
 9 Ewok      masculine     88  20            8
10 Geonosian masculine    183  80           NA
# ℹ 32 more rows

6.8.2 min & max, count, sort

Calculate the min and max for each numeric variable other than birth year for each species and gender and count how many are in each group and sort from largest to smallest count.

```{r}
starwars %>% 
  group_by(species, gender) %>% 
  summarise(across(where(is.numeric) & !birth_year,
                   .fns = list(min = min, max = max),
                   na.rm = TRUE,
                   .names = "{.fn}_{.col}"),
            n = n()) %>% 
  arrange(desc(n))

# same as above
starwars %>% 
  group_by(species, gender) %>% 
  summarise(across(where(is.numeric) & !birth_year,
                   .fns = list(min = ~ min(.x, na.rm = TRUE), 
                               max = ~ max(.x, na.rm = TRUE))),
            n = n()) %>% 
  arrange(desc(n)) %>% 
  ungroup()
```
# A tibble: 42 × 7
# Groups:   species [38]
   species  gender    min_height max_height min_mass max_mass     n
   <chr>    <chr>          <int>      <int>    <dbl>    <dbl> <int>
 1 Human    masculine        170        202       75    136      26
 2 Human    feminine         150        185       45     75       9
 3 Droid    masculine         96        200       32    140       5
 4 <NA>     <NA>             157        185       48    110       4
 5 Gungan   masculine        196        224       66     82       3
 6 Mirialan feminine         166        170       50     56.2     2
 7 Wookiee  masculine        228        234      112    136       2
 8 Zabrak   masculine        171        175       80     80       2
 9 Aleena   masculine         79         79       15     15       1
10 Besalisk masculine        198        198      102    102       1
# ℹ 32 more rows
# A tibble: 42 × 7
   species  gender    height_min height_max mass_min mass_max     n
   <chr>    <chr>          <int>      <int>    <dbl>    <dbl> <int>
 1 Human    masculine        170        202       75    136      26
 2 Human    feminine         150        185       45     75       9
 3 Droid    masculine         96        200       32    140       5
 4 <NA>     <NA>             157        185       48    110       4
 5 Gungan   masculine        196        224       66     82       3
 6 Mirialan feminine         166        170       50     56.2     2
 7 Wookiee  masculine        228        234      112    136       2
 8 Zabrak   masculine        171        175       80     80       2
 9 Aleena   masculine         79         79       15     15       1
10 Besalisk masculine        198        198      102    102       1
# ℹ 32 more rows

7 case_when()

  • This function allows you to vectorize (and replace) multiple if_else() statements in a succinct and clear manner.

  • The syntax is case_when(…)

  • The dot dot dots are a placeholder for a series of two-side formulas

    • The Left hand side (LHS) determines which variables match a given case - this must return a logical vector
    • The Right hand side (RHS) provides the new or replacement value and all have to be of the same type of vector
    • Both LHS and RHS may be of length either 1 or n
    • you always end with a case of TRUE for when all of the other cases are FALSE
  • Example of a vectorized if - else

```{r}
x <- 1:50

case_when(               
  x < 5 ~ "less than 5",   # start with specific cases first
  x < 10 ~ "less than 10",
  TRUE ~ as.character(x)
)
```
 [1] "less than 5"  "less than 5"  "less than 5"  "less than 5"  "less than 10"
 [6] "less than 10" "less than 10" "less than 10" "less than 10" "10"          
[11] "11"           "12"           "13"           "14"           "15"          
[16] "16"           "17"           "18"           "19"           "20"          
[21] "21"           "22"           "23"           "24"           "25"          
[26] "26"           "27"           "28"           "29"           "30"          
[31] "31"           "32"           "33"           "34"           "35"          
[36] "36"           "37"           "38"           "39"           "40"          
[41] "41"           "42"           "43"           "44"           "45"          
[46] "46"           "47"           "48"           "49"           "50"          

7.1 fizzbuz problem

```{r}
case_when(
  x %% 15 == 0 ~ "fizz buzz",
  x %% 3 == 0 ~ "fizz",
  x %% 5 == 0 ~ "buzz",
  TRUE ~ as.character(x)
)
```
 [1] "1"         "2"         "fizz"      "4"         "buzz"      "fizz"     
 [7] "7"         "8"         "fizz"      "buzz"      "11"        "fizz"     
[13] "13"        "14"        "fizz buzz" "16"        "17"        "fizz"     
[19] "19"        "buzz"      "fizz"      "22"        "23"        "fizz"     
[25] "buzz"      "26"        "fizz"      "28"        "29"        "fizz buzz"
[31] "31"        "32"        "fizz"      "34"        "buzz"      "fizz"     
[37] "37"        "38"        "fizz"      "buzz"      "41"        "fizz"     
[43] "43"        "44"        "fizz buzz" "46"        "47"        "fizz"     
[49] "49"        "buzz"     

7.2 Caveat

  • Like an if statement, the arguments are evaluated in order, so you must proceed from the most specific to the most general.
  • This won’t work:
```{r}
case_when(
  x %%  5 == 0 ~ "fizz",
  x %%  7 == 0 ~ "buzz",
  x %% 35 == 0 ~ "fizz buzz", # notice that for 35, fizz was printed instead of fizz buzz.
  TRUE ~ as.character(x),
)
```
 [1] "1"    "2"    "3"    "4"    "fizz" "6"    "buzz" "8"    "9"    "fizz"
[11] "11"   "12"   "13"   "buzz" "fizz" "16"   "17"   "18"   "19"   "fizz"
[21] "buzz" "22"   "23"   "24"   "fizz" "26"   "27"   "buzz" "29"   "fizz"
[31] "31"   "32"   "33"   "34"   "fizz" "36"   "37"   "38"   "39"   "fizz"
[41] "41"   "buzz" "43"   "44"   "fizz" "46"   "47"   "48"   "buzz" "fizz"

7.3 with mutate()

` - case_when() is particularly useful inside mutate() when you want to create a new variable that relies on a complex combination of existing variables

```{r}
starwars %>%
  select(name:mass, gender, species) %>%
  mutate(height_cat = case_when(height > 191 ~ "tall",
                                height < 167 ~ "short",
                                TRUE ~ "average") )

# using quantile()
starwars %>%
  select(name:mass, gender, species) %>%
  mutate(height_cat = case_when(height > quantile(height, 3/4, na.rm = TRUE) ~ "tall",
                                height < quantile(height, 1/4, na.rm = TRUE) ~ "short",
                                TRUE  ~ "average") )   

# But this can be even more complicated... 
starwars %>%
  select(name:mass, gender, species) %>%
  mutate(type = case_when(height >= 200 | mass >= 200 ~ "large",
                          species == "Droid"        ~ "robot",
                          TRUE                      ~ "other" ) )
```
# A tibble: 87 × 6
   name               height  mass gender    species height_cat
   <chr>               <int> <dbl> <chr>     <chr>   <chr>     
 1 Luke Skywalker        172    77 masculine Human   average   
 2 C-3PO                 167    75 masculine Droid   average   
 3 R2-D2                  96    32 masculine Droid   short     
 4 Darth Vader           202   136 masculine Human   tall      
 5 Leia Organa           150    49 feminine  Human   short     
 6 Owen Lars             178   120 masculine Human   average   
 7 Beru Whitesun Lars    165    75 feminine  Human   short     
 8 R5-D4                  97    32 masculine Droid   short     
 9 Biggs Darklighter     183    84 masculine Human   average   
10 Obi-Wan Kenobi        182    77 masculine Human   average   
# ℹ 77 more rows
# A tibble: 87 × 6
   name               height  mass gender    species height_cat
   <chr>               <int> <dbl> <chr>     <chr>   <chr>     
 1 Luke Skywalker        172    77 masculine Human   average   
 2 C-3PO                 167    75 masculine Droid   average   
 3 R2-D2                  96    32 masculine Droid   short     
 4 Darth Vader           202   136 masculine Human   tall      
 5 Leia Organa           150    49 feminine  Human   short     
 6 Owen Lars             178   120 masculine Human   average   
 7 Beru Whitesun Lars    165    75 feminine  Human   short     
 8 R5-D4                  97    32 masculine Droid   short     
 9 Biggs Darklighter     183    84 masculine Human   average   
10 Obi-Wan Kenobi        182    77 masculine Human   average   
# ℹ 77 more rows
# A tibble: 87 × 6
   name               height  mass gender    species type 
   <chr>               <int> <dbl> <chr>     <chr>   <chr>
 1 Luke Skywalker        172    77 masculine Human   other
 2 C-3PO                 167    75 masculine Droid   robot
 3 R2-D2                  96    32 masculine Droid   robot
 4 Darth Vader           202   136 masculine Human   large
 5 Leia Organa           150    49 feminine  Human   other
 6 Owen Lars             178   120 masculine Human   other
 7 Beru Whitesun Lars    165    75 feminine  Human   other
 8 R5-D4                  97    32 masculine Droid   robot
 9 Biggs Darklighter     183    84 masculine Human   other
10 Obi-Wan Kenobi        182    77 masculine Human   other
# ℹ 77 more rows

8 rownames_to_column()

tibble::rownames_to_columns() - You many occasionally see data sets where there are row names. - This is allowed but not common with data frames as row names are removed when using […] - Tidy data (a tibble) does not use row_names so they are stripped when coerced to a tibble - Generally, it is best to avoid row names, because they are basically a character column with different semantics than every other column. - To detect if a data frame has row_names use has_rownames()

```{r}
head(state.x77)
#view(state.x77)
class(state.x77)
str(state.x77)
has_rownames(state.x77) # doesn't recognize rownames since the data is matrix.
has_rownames(as_tibble(state.x77)) # doesn't recognize it either since tibble removes the row names.
has_rownames(as.data.frame(state.x77)) # yes it does.

state.x77 %>% 
  as.data.frame() %>% 
  rownames_to_column("state") %>% 
  #str() %>% 
  as_tibble()
```
           Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
California      21198   5114        1.1    71.71   10.3    62.6    20 156361
Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766
[1] "matrix" "array" 
 num [1:50, 1:8] 3615 365 2212 2110 21198 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
  ..$ : chr [1:8] "Population" "Income" "Illiteracy" "Life Exp" ...
[1] FALSE
[1] FALSE
[1] TRUE
# A tibble: 50 × 9
   state   Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
   <chr>        <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
 1 Alabama       3615   3624        2.1       69.0   15.1      41.3    20  50708
 2 Alaska         365   6315        1.5       69.3   11.3      66.7   152 566432
 3 Arizona       2212   4530        1.8       70.6    7.8      58.1    15 113417
 4 Arkans…       2110   3378        1.9       70.7   10.1      39.9    65  51945
 5 Califo…      21198   5114        1.1       71.7   10.3      62.6    20 156361
 6 Colora…       2541   4884        0.7       72.1    6.8      63.9   166 103766
 7 Connec…       3100   5348        1.1       72.5    3.1      56     139   4862
 8 Delawa…        579   4809        0.9       70.1    6.2      54.6   103   1982
 9 Florida       8277   4815        1.3       70.7   10.7      52.6    11  54090
10 Georgia       4931   4091        2         68.5   13.9      40.6    60  58073
# ℹ 40 more rows

8.1 Exercise

Check if the mtcars data set has row names. If so, convert the rownames to a column named car and convert to a tibble ::: {.cell}

```{r}
head(mtcars)
class(mtcars)
has_rownames(mtcars)

mtcars %>% 
  rownames_to_column("car") %>% 
  as_tibble()
```
# A tibble: 6 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
4  21.4     6   258   110  3.08  3.22  19.4     1     0     3     1
5  18.7     8   360   175  3.15  3.44  17.0     0     0     3     2
6  18.1     6   225   105  2.76  3.46  20.2     1     0     3     1
[1] "tbl_df"     "tbl"        "data.frame"
[1] FALSE
# A tibble: 32 × 12
   car     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 1      21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 2      21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 3      22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 4      21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 5      18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 6      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 7      14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 8      24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 9      22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 10     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

:::

9 distinct()

  • distinct() subsets only unique/distinct rows from a data frame.
  • Rows are a subset of the input but appear in the same order.
  • Columns are not modified by default
  • Groups are not modified.
  • Data frame attributes are preserved.
```{r}
starwars %>% 
  distinct(homeworld) 

starwars %>% 
  distinct(homeworld, species) # this is the same as below

starwars %>% 
  count(homeworld, species) # it has n additionally, so it can be more useful

library(gapminder)
data("gapminder")

# how many countries are in each continent?

gapminder %>% 
  count(continent) # incorrect as there are 12 years of data for each country

gapminder %>% 
  count(country, continent) # still incorrect

gapminder %>% 
  distinct(country, continent) %>% 
  count(continent) # correct
```
# A tibble: 49 × 1
   homeworld 
   <chr>     
 1 Tatooine  
 2 Naboo     
 3 Alderaan  
 4 Stewjon   
 5 Eriadu    
 6 Kashyyyk  
 7 Corellia  
 8 Rodia     
 9 Nal Hutta 
10 Bestine IV
# ℹ 39 more rows
# A tibble: 57 × 2
   homeworld species
   <chr>     <chr>  
 1 Tatooine  Human  
 2 Tatooine  Droid  
 3 Naboo     Droid  
 4 Alderaan  Human  
 5 Stewjon   Human  
 6 Eriadu    Human  
 7 Kashyyyk  Wookiee
 8 Corellia  Human  
 9 Rodia     Rodian 
10 Nal Hutta Hutt   
# ℹ 47 more rows
# A tibble: 57 × 3
   homeworld      species       n
   <chr>          <chr>     <int>
 1 Alderaan       Human         3
 2 Aleen Minor    Aleena        1
 3 Bespin         Human         1
 4 Bestine IV     <NA>          1
 5 Cato Neimoidia Neimodian     1
 6 Cerea          Cerean        1
 7 Champala       Chagrian      1
 8 Chandrila      Human         1
 9 Concord Dawn   Human         1
10 Corellia       Human         2
# ℹ 47 more rows
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa      624
2 Americas    300
3 Asia        396
4 Europe      360
5 Oceania      24
# A tibble: 142 × 3
   country     continent     n
   <fct>       <fct>     <int>
 1 Afghanistan Asia         12
 2 Albania     Europe       12
 3 Algeria     Africa       12
 4 Angola      Africa       12
 5 Argentina   Americas     12
 6 Australia   Oceania      12
 7 Austria     Europe       12
 8 Bahrain     Asia         12
 9 Bangladesh  Asia         12
10 Belgium     Europe       12
# ℹ 132 more rows
# A tibble: 5 × 2
  continent     n
  <fct>     <int>
1 Africa       52
2 Americas     25
3 Asia         33
4 Europe       30
5 Oceania       2

10 row-wise operation with rowwise()

  • Before version 1.0, dplyr did not have special capabilities for operating on subsets of rows. You had to use for-loops for operating across rows or subsets of rows.

  • dplyr 1.0 added the new verb rowwise() to create multiple one-row data frames out of an existing data frame

  • These row-wise data frames are “virtual” subsets of the original data frame - - You can operate on each subset data frame as if it were its own data frame.

  • We will discuss a common use case: computing aggregates across multiple columns within a row

10.1 Creating row-wise data frames

  • Row-wise operations require a special type of grouping where each group consists of a single row.
  • You create this grouping using rowwise()
  • Like group_by(), rowwise() doesn’t really do anything itself; it just changes how the other dplyr verbs work.
  • For example, compare the results of mutate() in the following code:
```{r}
fruits <- tribble(
  ~"fruit", ~"height_1", ~"height_2", ~"height_3", ~"width", ~"weight",
  "Banana", 4, 4.2, 3.5, 1, 0.5,
  "Strawberry", 1, .9, 1.2, 1, .25,
  "Pineapple", 18, 17.7, 19.2, 6, 3)
fruits

# mean across all values in all rows
fruits %>% 
  mutate(height_mean = mean(c(height_1, height_2, height_3))) 

# mean across all values in each row
fruits %>% 
  rowwise(fruit) %>% 
  mutate(height_mean = mean(c(height_1, height_2, height_3))) # must use c()

fruits %>% 
  rowwise(fruit) %>% 
  mutate(height_mean = mean(height_1:height_3)) # Why is the mean different from above?

# Using c_across: See the next section
fruits %>% 
  rowwise(fruit) %>% 
  mutate(height_mean = mean(c_across(height_1:height_3))) # correct

fruits %>% 
  rowwise(fruit) %>% 
  mutate(height_mean = mean(c_across(contains("height")))) # correct
```
# A tibble: 3 × 6
  fruit      height_1 height_2 height_3 width weight
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>
1 Banana            4      4.2      3.5     1   0.5 
2 Strawberry        1      0.9      1.2     1   0.25
3 Pineapple        18     17.7     19.2     6   3   
# A tibble: 3 × 7
  fruit      height_1 height_2 height_3 width weight height_mean
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>       <dbl>
1 Banana            4      4.2      3.5     1   0.5         7.74
2 Strawberry        1      0.9      1.2     1   0.25        7.74
3 Pineapple        18     17.7     19.2     6   3           7.74
# A tibble: 3 × 7
# Rowwise:  fruit
  fruit      height_1 height_2 height_3 width weight height_mean
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>       <dbl>
1 Banana            4      4.2      3.5     1   0.5         3.9 
2 Strawberry        1      0.9      1.2     1   0.25        1.03
3 Pineapple        18     17.7     19.2     6   3          18.3 
# A tibble: 3 × 7
# Rowwise:  fruit
  fruit      height_1 height_2 height_3 width weight height_mean
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>       <dbl>
1 Banana            4      4.2      3.5     1   0.5          4  
2 Strawberry        1      0.9      1.2     1   0.25         1  
3 Pineapple        18     17.7     19.2     6   3           18.5
# A tibble: 3 × 7
# Rowwise:  fruit
  fruit      height_1 height_2 height_3 width weight height_mean
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>       <dbl>
1 Banana            4      4.2      3.5     1   0.5         3.9 
2 Strawberry        1      0.9      1.2     1   0.25        1.03
3 Pineapple        18     17.7     19.2     6   3          18.3 
# A tibble: 3 × 7
# Rowwise:  fruit
  fruit      height_1 height_2 height_3 width weight height_mean
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>       <dbl>
1 Banana            4      4.2      3.5     1   0.5         3.9 
2 Strawberry        1      0.9      1.2     1   0.25        1.03
3 Pineapple        18     17.7     19.2     6   3          18.3 

10.2 Per-row Summary Statistics

  • dplyr::summarize() makes it really easy to summarize values across the rows within one column.
  • We can combine rowwise() and summarize() to make it easy to summarize values across columns within one row.
  • Let’s say we want to compute the sum of w, x, y, and z for each row.
  • We start by making a row-wise data frame:
  • We then use mutate() to add a new column to each row, or.
  • Just use summarise() to return the summary:
```{r}
df <- tibble(id = 1:6, w = 10:15, x = 20:25, y = 30:35, z = 40:45)
df

rf <- df %>% rowwise(id)

# mutate to add new column for each row
rf %>% mutate(total = sum(c(w, x, y, z)))

# summarize without mutate
rf %>% summarise(total = sum(c(w, x, y, z)), 
                 .groups= "drop") # the same as ungroup()
```
# A tibble: 6 × 5
     id     w     x     y     z
  <int> <int> <int> <int> <int>
1     1    10    20    30    40
2     2    11    21    31    41
3     3    12    22    32    42
4     4    13    23    33    43
5     5    14    24    34    44
6     6    15    25    35    45
# A tibble: 6 × 6
# Rowwise:  id
     id     w     x     y     z total
  <int> <int> <int> <int> <int> <int>
1     1    10    20    30    40   100
2     2    11    21    31    41   104
3     3    12    22    32    42   108
4     4    13    23    33    43   112
5     5    14    24    34    44   116
6     6    15    25    35    45   120
# A tibble: 6 × 2
     id total
  <int> <int>
1     1   100
2     2   104
3     3   108
4     4   112
5     5   116
6     6   120

10.3 Using c_across

  • If you have a lot of variables, you can use c_across() to succinctly select many variables (c_across() uses tidy select helpers)
  • The where(is.numeric()) ensures we only select numeric columns
```{r}
rf %>% summarize(total = sum(across(w:z)), .groups = "drop")

rf %>% summarize(total = sum(c_across(w:z)), .groups = "drop")

df %>% 
  rowwise(id) %>% 
  summarize(total = sum(c_across(where(is.numeric))), 
            .groups = "drop")

rf %>% mutate(total = sum(across(w:z)))
rf %>% mutate(mean = mean(c_across(w:z)))

rf %>% mutate(total = sum(c_across(w:z)))

df %>% 
  rowwise(id) %>% 
  mutate(total = sum(c_across(where(is.numeric)))
         )

# If we want to use our fruits example... 
fruits %>% 
  rowwise(fruit) %>% 
  mutate(height_mean = mean(c_across(contains("height"))))
```
# A tibble: 6 × 2
     id total
  <int> <int>
1     1   100
2     2   104
3     3   108
4     4   112
5     5   116
6     6   120
# A tibble: 6 × 2
     id total
  <int> <int>
1     1   100
2     2   104
3     3   108
4     4   112
5     5   116
6     6   120
# A tibble: 6 × 2
     id total
  <int> <int>
1     1   100
2     2   104
3     3   108
4     4   112
5     5   116
6     6   120
# A tibble: 6 × 6
# Rowwise:  id
     id     w     x     y     z total
  <int> <int> <int> <int> <int> <int>
1     1    10    20    30    40   100
2     2    11    21    31    41   104
3     3    12    22    32    42   108
4     4    13    23    33    43   112
5     5    14    24    34    44   116
6     6    15    25    35    45   120
# A tibble: 6 × 6
# Rowwise:  id
     id     w     x     y     z  mean
  <int> <int> <int> <int> <int> <dbl>
1     1    10    20    30    40    25
2     2    11    21    31    41    26
3     3    12    22    32    42    27
4     4    13    23    33    43    28
5     5    14    24    34    44    29
6     6    15    25    35    45    30
# A tibble: 6 × 6
# Rowwise:  id
     id     w     x     y     z total
  <int> <int> <int> <int> <int> <int>
1     1    10    20    30    40   100
2     2    11    21    31    41   104
3     3    12    22    32    42   108
4     4    13    23    33    43   112
5     5    14    24    34    44   116
6     6    15    25    35    45   120
# A tibble: 6 × 6
# Rowwise:  id
     id     w     x     y     z total
  <int> <int> <int> <int> <int> <int>
1     1    10    20    30    40   100
2     2    11    21    31    41   104
3     3    12    22    32    42   108
4     4    13    23    33    43   112
5     5    14    24    34    44   116
6     6    15    25    35    45   120
# A tibble: 3 × 7
# Rowwise:  fruit
  fruit      height_1 height_2 height_3 width weight height_mean
  <chr>         <dbl>    <dbl>    <dbl> <dbl>  <dbl>       <dbl>
1 Banana            4      4.2      3.5     1   0.5         3.9 
2 Strawberry        1      0.9      1.2     1   0.25        1.03
3 Pineapple        18     17.7     19.2     6   3          18.3 

10.4 Using both c_across and across

  • so c_across is a rowwise version of the function we learned earlier, across.

  • You could combine c_across with column-wise across to compute the proportion of the total for each column:

```{r}
rf %>% #our row-wise data frame
  mutate(total = sum(c_across(w:z))) %>% #total each row
  ungroup() %>% # ungroup the rows
  mutate(across(w:z, ~ .x / total)) # the .x represents each column
```
# A tibble: 6 × 6
     id     w     x     y     z total
  <int> <dbl> <dbl> <dbl> <dbl> <int>
1     1 0.1   0.2   0.3   0.4     100
2     2 0.106 0.202 0.298 0.394   104
3     3 0.111 0.204 0.296 0.389   108
4     4 0.116 0.205 0.295 0.384   112
5     5 0.121 0.207 0.293 0.379   116
6     6 0.125 0.208 0.292 0.375   120

10.5 Exercise

  • Let’s create a new variable for the starwars data frame with the maximum of the height, mass, or birth year for each starwars character. Who has the maximum of all the characters?
```{r}
starwars %>% 
  filter(!is.na(height), !is.na(mass), !is.na(birth_year)) %>%
  rowwise() %>% 
  mutate(max_a = max(height, mass, birth_year, na.rm = TRUE)) %>% 
  relocate(max_a) %>% # move the variable to the first column
  ungroup() %>% 
  select(name, where(is.numeric)) %>% 
  slice_max(max_a) # slice() lets you index rows by their (integer) locations. 

# alt.
starwars %>% 
  filter(!is.na(height), !is.na(mass)) %>%
  rowwise() %>% 
  mutate(max_a = max(c(height, mass, birth_year), na.rm = TRUE)) %>% 
  relocate(max_a) %>% 
  ungroup() %>% 
  select(name, where(is.numeric)) %>% 
  filter(max_a == max(max_a))
```
# A tibble: 1 × 5
  name                  max_a height  mass birth_year
  <chr>                 <dbl>  <int> <dbl>      <dbl>
1 Jabba Desilijic Tiure  1358    175  1358        600
# A tibble: 1 × 5
  name                  max_a height  mass birth_year
  <chr>                 <dbl>  <int> <dbl>      <dbl>
1 Jabba Desilijic Tiure  1358    175  1358        600

11 References