M05-1-Principles-Data Wrangling with Tidyverse in R

Author

Jae Jung

Published

March 4, 2025

1 Overview

1.1 Learning Outcomes

After taking this workshop, participants should be able to do following:

  • Describe the concept of Data Wrangling.
  • Describe how Tibbles are different from data frames
  • Explain how to convert wide or long data to “Tidy” data
  • Explain how to merge relational data sets using join functions. (Next module)
  • Explain how to use grouped mutates and filter together.
  • Be familiar with major dplyr functions for transforming data.
  • Create a new variable with mutate() and case_when().
  • Use the pipe operator to shape the data to prepare for analysis and visualization

1.2 The textbook chapters to cover:

  • Ch3: Data Transformation
  • Ch5: Data Tidying
  • Ch13: Numbers

2 Introduction

2.1 Library and Packages

```{r}
#install.packages("tidyverse")
library('tidyverse')
```

2.2 Tibbles vs. Data Frames

  • Tibbles
    • New data frame type
    • More modern
    • Columns can be lists in tibble.
    • Column name can have non-standard names
```{r}
class(mtcars)
cars.tib <- as_tibble(mtcars)
class(cars.tib)
typeof(cars.tib)

#convert to data frame for legacy code
mtcars.df <- as.data.frame(cars.tib)
class(mtcars.df)
typeof(mtcars.df)
```
[1] "data.frame"
[1] "tbl_df"     "tbl"        "data.frame"
[1] "list"
[1] "data.frame"
[1] "list"

2.3 Two main differences between tibble and data frames

  • printing and subsetting

2.3.1 printing

```{r}
print(cars.tib) # 10 rows only with adjustable columns

# if you want more than 10 rows or all columns, do the adjustments.
data() #look up built-in data
print(starwars)
print(starwars, n = 15, width = Inf)
```
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# A tibble: 87 × 14
   name               height  mass hair_color    skin_color  eye_color
   <chr>               <int> <dbl> <chr>         <chr>       <chr>    
 1 Luke Skywalker        172    77 blond         fair        blue     
 2 C-3PO                 167    75 <NA>          gold        yellow   
 3 R2-D2                  96    32 <NA>          white, blue red      
 4 Darth Vader           202   136 none          white       yellow   
 5 Leia Organa           150    49 brown         light       brown    
 6 Owen Lars             178   120 brown, grey   light       blue     
 7 Beru Whitesun Lars    165    75 brown         light       blue     
 8 R5-D4                  97    32 <NA>          white, red  red      
 9 Biggs Darklighter     183    84 black         light       brown    
10 Obi-Wan Kenobi        182    77 auburn, white fair        blue-gray
11 Anakin Skywalker      188    84 blond         fair        blue     
12 Wilhuff Tarkin        180    NA auburn, grey  fair        blue     
13 Chewbacca             228   112 brown         unknown     blue     
14 Han Solo              180    80 brown         fair        brown    
15 Greedo                173    74 <NA>          green       black    
   birth_year sex    gender    homeworld species films     vehicles  starships
        <dbl> <chr>  <chr>     <chr>     <chr>   <list>    <list>    <list>   
 1       19   male   masculine Tatooine  Human   <chr [5]> <chr [2]> <chr [2]>
 2      112   none   masculine Tatooine  Droid   <chr [6]> <chr [0]> <chr [0]>
 3       33   none   masculine Naboo     Droid   <chr [7]> <chr [0]> <chr [0]>
 4       41.9 male   masculine Tatooine  Human   <chr [4]> <chr [0]> <chr [1]>
 5       19   female feminine  Alderaan  Human   <chr [5]> <chr [1]> <chr [0]>
 6       52   male   masculine Tatooine  Human   <chr [3]> <chr [0]> <chr [0]>
 7       47   female feminine  Tatooine  Human   <chr [3]> <chr [0]> <chr [0]>
 8       NA   none   masculine Tatooine  Droid   <chr [1]> <chr [0]> <chr [0]>
 9       24   male   masculine Tatooine  Human   <chr [1]> <chr [0]> <chr [1]>
10       57   male   masculine Stewjon   Human   <chr [6]> <chr [1]> <chr [5]>
11       41.9 male   masculine Tatooine  Human   <chr [3]> <chr [2]> <chr [3]>
12       64   male   masculine Eriadu    Human   <chr [2]> <chr [0]> <chr [0]>
13      200   male   masculine Kashyyyk  Wookiee <chr [5]> <chr [1]> <chr [2]>
14       29   male   masculine Corellia  Human   <chr [4]> <chr [0]> <chr [2]>
15       44   male   masculine Rodia     Rodian  <chr [1]> <chr [0]> <chr [0]>
# ℹ 72 more rows

2.4 subsetting

```{r}
#column
cars.tib[1] #returns a tibble
cars.tib[[1]] #returns a vector

#row
cars.tib[1,] #returns a tibble
```
# A tibble: 32 × 1
     mpg
   <dbl>
 1  21  
 2  21  
 3  22.8
 4  21.4
 5  18.7
 6  18.1
 7  14.3
 8  24.4
 9  22.8
10  19.2
# ℹ 22 more rows
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
# A tibble: 1 × 11
    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    21     6   160   110   3.9  2.62  16.5     0     1     4     4

3 What is Tidyverse?

  • It is meta package
  • It has many packages essential to data wrangling and visualization.
  • Created by Hadley Wickham, author of R for Data Science
    • ggplot2 for visualization
    • dplyr for data transformation
    • tidyr for reshaping data to make it tidy

3.1 Initial Data Preparation and Exploration

```{r}
#install.packages("tidyverse")
library('tidyverse')
library(nycflights13)
```

3.2 Pipe operator: %>% or |>

```{r}
cars.tib %>%
  filter(mpg > 25) %>%
  select(drat)

# what percent of the cars in the data are manual as opposed to automatic?
cars.tib %>%
  pull(am) %>% # same as cars.tib[['am']]
  mean(na.rm = TRUE) %>%
  scales::percent(accuracy = .1)
## pull() is a dplyr function that says "give me back this one variable instead of a data set"

round(mean(cars.tib[['am']], na.rm = TRUE)*100, 1) # base R equivalent
```
# A tibble: 6 × 1
   drat
  <dbl>
1  4.08
2  4.93
3  4.22
4  4.08
5  4.43
6  3.77
[1] "40.6%"
[1] 40.6

4 Data Wrangling with dplyr

Data Transformation with dplyr :: CHEAT SHEET

4.1 Core of Data Wrangling

```{r}
#install.packages('vtable')
library(vtable)
vtable(cars.tib)
t(cars.tib) # transpose
t(t(cars.tib))
```
cars.tib
Name Class Values
mpg numeric Num: 10.4 to 33.9
cyl numeric Num: 4 to 8
disp numeric Num: 71.1 to 472
hp numeric Num: 52 to 335
drat numeric Num: 2.76 to 4.93
wt numeric Num: 1.513 to 5.424
qsec numeric Num: 14.5 to 22.9
vs numeric Num: 0 to 1
am numeric Num: 0 to 1
gear numeric Num: 3 to 5
carb numeric Num: 1 to 8
       [,1]    [,2]   [,3]    [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
mpg   21.00  21.000  22.80  21.400  18.70  18.10  14.30  24.40  22.80  19.20
cyl    6.00   6.000   4.00   6.000   8.00   6.00   8.00   4.00   4.00   6.00
disp 160.00 160.000 108.00 258.000 360.00 225.00 360.00 146.70 140.80 167.60
hp   110.00 110.000  93.00 110.000 175.00 105.00 245.00  62.00  95.00 123.00
drat   3.90   3.900   3.85   3.080   3.15   2.76   3.21   3.69   3.92   3.92
wt     2.62   2.875   2.32   3.215   3.44   3.46   3.57   3.19   3.15   3.44
qsec  16.46  17.020  18.61  19.440  17.02  20.22  15.84  20.00  22.90  18.30
vs     0.00   0.000   1.00   1.000   0.00   1.00   0.00   1.00   1.00   1.00
am     1.00   1.000   1.00   0.000   0.00   0.00   0.00   0.00   0.00   0.00
gear   4.00   4.000   4.00   3.000   3.00   3.00   3.00   4.00   4.00   4.00
carb   4.00   4.000   1.00   1.000   2.00   1.00   4.00   2.00   2.00   4.00
      [,11]  [,12]  [,13]  [,14]  [,15]   [,16]   [,17] [,18]  [,19]  [,20]
mpg   17.80  16.40  17.30  15.20  10.40  10.400  14.700 32.40 30.400 33.900
cyl    6.00   8.00   8.00   8.00   8.00   8.000   8.000  4.00  4.000  4.000
disp 167.60 275.80 275.80 275.80 472.00 460.000 440.000 78.70 75.700 71.100
hp   123.00 180.00 180.00 180.00 205.00 215.000 230.000 66.00 52.000 65.000
drat   3.92   3.07   3.07   3.07   2.93   3.000   3.230  4.08  4.930  4.220
wt     3.44   4.07   3.73   3.78   5.25   5.424   5.345  2.20  1.615  1.835
qsec  18.90  17.40  17.60  18.00  17.98  17.820  17.420 19.47 18.520 19.900
vs     1.00   0.00   0.00   0.00   0.00   0.000   0.000  1.00  1.000  1.000
am     0.00   0.00   0.00   0.00   0.00   0.000   0.000  1.00  1.000  1.000
gear   4.00   3.00   3.00   3.00   3.00   3.000   3.000  4.00  4.000  4.000
carb   4.00   3.00   3.00   3.00   4.00   4.000   4.000  1.00  2.000  1.000
       [,21]  [,22]   [,23]  [,24]   [,25]  [,26]  [,27]   [,28]  [,29]  [,30]
mpg   21.500  15.50  15.200  13.30  19.200 27.300  26.00  30.400  15.80  19.70
cyl    4.000   8.00   8.000   8.00   8.000  4.000   4.00   4.000   8.00   6.00
disp 120.100 318.00 304.000 350.00 400.000 79.000 120.30  95.100 351.00 145.00
hp    97.000 150.00 150.000 245.00 175.000 66.000  91.00 113.000 264.00 175.00
drat   3.700   2.76   3.150   3.73   3.080  4.080   4.43   3.770   4.22   3.62
wt     2.465   3.52   3.435   3.84   3.845  1.935   2.14   1.513   3.17   2.77
qsec  20.010  16.87  17.300  15.41  17.050 18.900  16.70  16.900  14.50  15.50
vs     1.000   0.00   0.000   0.00   0.000  1.000   0.00   1.000   0.00   0.00
am     0.000   0.00   0.000   0.00   0.000  1.000   1.00   1.000   1.00   1.00
gear   3.000   3.00   3.000   3.00   3.000  4.000   5.00   5.000   5.00   5.00
carb   1.000   2.00   2.000   4.00   2.000  1.000   2.00   2.000   4.00   6.00
      [,31]  [,32]
mpg   15.00  21.40
cyl    8.00   4.00
disp 301.00 121.00
hp   335.00 109.00
drat   3.54   4.11
wt     3.57   2.78
qsec  14.60  18.60
vs     0.00   1.00
am     1.00   1.00
gear   5.00   4.00
carb   8.00   2.00
       mpg cyl  disp  hp drat    wt  qsec vs am gear carb
 [1,] 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
 [2,] 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
 [3,] 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
 [4,] 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
 [5,] 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
 [6,] 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
 [7,] 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
 [8,] 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
 [9,] 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
[10,] 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
[11,] 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
[12,] 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
[13,] 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
[14,] 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
[15,] 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
[16,] 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
[17,] 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
[18,] 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
[19,] 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
[20,] 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
[21,] 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
[22,] 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
[23,] 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
[24,] 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
[25,] 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
[26,] 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
[27,] 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
[28,] 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
[29,] 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
[30,] 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
[31,] 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
[32,] 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

4.2 Stages of Data Wrangling

  1. From records to data
  2. From data to tidy data
  3. From tidy data to data for analysis.

5 From Data to Tidy data

5.1 Three conditions for Tidy Data

  1. Each variable must have its own unique column
  2. Each observation must have its own unique row
  3. Each value must have its own cell.

5.2 tidyr package

Note

tidy package is the main package for tidying data.

  • pivot_longer()
  • pivot_wider()
  • separate()
  • unite()

5.3 Wide data to long data:

```{r}
# examples
# data() # to see all data available in R
relig_income # wide data
billboard  # wide data
```
# A tibble: 18 × 11
   religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
   <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
 1 Agnostic      27        34        60        81        76       137        122
 2 Atheist       12        27        37        52        35        70         73
 3 Buddhist      27        21        30        34        33        58         62
 4 Catholic     418       617       732       670       638      1116        949
 5 Don’t k…      15        14        15        11        10        35         21
 6 Evangel…     575       869      1064       982       881      1486        949
 7 Hindu          1         9         7         9        11        34         47
 8 Histori…     228       244       236       238       197       223        131
 9 Jehovah…      20        27        24        24        21        30         15
10 Jewish        19        19        25        25        30        95         69
11 Mainlin…     289       495       619       655       651      1107        939
12 Mormon        29        40        48        51        56       112         85
13 Muslim         6         7         9        10         9        23         16
14 Orthodox      13        17        23        32        32        47         38
15 Other C…       9         7        11        13        13        14         18
16 Other F…      20        33        40        46        49        63         46
17 Other W…       5         2         3         4         2         7          3
18 Unaffil…     217       299       374       365       341       528        407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>
# A tibble: 317 × 79
   artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
   <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
 2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
 3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
 4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
 5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
 6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
 7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
 8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
 9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
#   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
#   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
#   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
#   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
#   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

5.3.1 syntax:

pivot_longer(
  data,
  cols,
  ...,
  cols_vary = "fastest",
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = NULL,
  names_transform = NULL,
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = NULL,
  values_transform = NULL
)

5.3.2 pivot_longer() example 1

```{r pivot_longer example}
wide_pop <- read_csv("https://raw.githubusercontent.com/jsgriffin96/r_workshop_4/master/data/wide-population.csv")
wide_pop

## convert to a tidy data set
tidy_pop <- pivot_longer(wide_pop, cols = c('2018','2019','2020'), names_to = 'year', values_to = 'population')
tidy_pop
skimr::skim(tidy_pop)

# Following the syntax
pivot_longer(wide_pop, '2018':'2020', names_to = 'year', values_to = 'population')

# Using the piping operator
wide_pop %>% 
  pivot_longer('2018':'2020', names_to = "year", values_to = "pop")
```
# A tibble: 51 × 4
   State                  `2018`   `2019`   `2020`
   <chr>                   <dbl>    <dbl>    <dbl>
 1 Alabama               4874486  4887681  4903185
 2 Alaska                 739700   735139   731545
 3 Arizona               7044008  7158024  7278717
 4 Arkansas              3001345  3009733  3017804
 5 California           39358497 39461588 39512223
 6 Colorado              5611885  5691287  5758736
 7 Connecticut           3573297  3571520  3565287
 8 Delaware               956823   965479   973764
 9 District of Columbia   694906   701547   705749
10 Florida              20963613 21244317 21477737
# ℹ 41 more rows
# A tibble: 153 × 3
   State    year  population
   <chr>    <chr>      <dbl>
 1 Alabama  2018     4874486
 2 Alabama  2019     4887681
 3 Alabama  2020     4903185
 4 Alaska   2018      739700
 5 Alaska   2019      735139
 6 Alaska   2020      731545
 7 Arizona  2018     7044008
 8 Arizona  2019     7158024
 9 Arizona  2020     7278717
10 Arkansas 2018     3001345
# ℹ 143 more rows
Data summary
Name tidy_pop
Number of rows 153
Number of columns 3
_______________________
Column type frequency:
character 2
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
State 0 1 4 20 0 51 0
year 0 1 4 4 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
population 0 1 6404657 7277281 577601 1787065 4461153 7423362 39512223 ▇▂▁▁▁
# A tibble: 153 × 3
   State    year  population
   <chr>    <chr>      <dbl>
 1 Alabama  2018     4874486
 2 Alabama  2019     4887681
 3 Alabama  2020     4903185
 4 Alaska   2018      739700
 5 Alaska   2019      735139
 6 Alaska   2020      731545
 7 Arizona  2018     7044008
 8 Arizona  2019     7158024
 9 Arizona  2020     7278717
10 Arkansas 2018     3001345
# ℹ 143 more rows
# A tibble: 153 × 3
   State    year      pop
   <chr>    <chr>   <dbl>
 1 Alabama  2018  4874486
 2 Alabama  2019  4887681
 3 Alabama  2020  4903185
 4 Alaska   2018   739700
 5 Alaska   2019   735139
 6 Alaska   2020   731545
 7 Arizona  2018  7044008
 8 Arizona  2019  7158024
 9 Arizona  2020  7278717
10 Arkansas 2018  3001345
# ℹ 143 more rows

5.3.3 pivot_longer() example 2

```{r}
# Song ranking
billboard %>%
  pivot_longer(cols = starts_with('wk'), 
              names_to = 'week',
              names_prefix = 'wk', 
              values_to = 'chart_position',
              values_drop_na = TRUE 
             ) 
```
# A tibble: 5,307 × 5
   artist  track                   date.entered week  chart_position
   <chr>   <chr>                   <date>       <chr>          <dbl>
 1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   1                 87
 2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   2                 82
 3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   3                 72
 4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   4                 77
 5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   5                 87
 6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   6                 94
 7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   7                 99
 8 2Ge+her The Hardest Part Of ... 2000-09-02   1                 91
 9 2Ge+her The Hardest Part Of ... 2000-09-02   2                 87
10 2Ge+her The Hardest Part Of ... 2000-09-02   3                 92
# ℹ 5,297 more rows

5.4 Long data to wide data

5.4.1 Syntax

pivot_wider(
  data,
  ...,
  id_cols = NULL,
  id_expand = FALSE,
  names_from = name,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_vary = "fastest",
  names_expand = FALSE,
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL,
  values_fn = NULL,
  unused_fn = NULL
)

5.4.2 pivot_wider() Example

```{r pivot_wider Example}
long_states <- read_csv("https://raw.githubusercontent.com/jsgriffin96/r_workshop_4/master/data/long-population.csv")
long_states # why is this structure not tidy? 

## convert to a tidy data
tidy_states <- pivot_wider(long_states, names_from = condition, values_from = measure)
tidy_states
```
# A tibble: 153 × 3
   State    condition           measure
   <chr>    <chr>                 <dbl>
 1 Alabama  population      29782365   
 2 Alabama  employment_rate        0.61
 3 Alabama  number_colleges     1898   
 4 Alaska   population       7843489   
 5 Alaska   employment_rate        0.89
 6 Alaska   number_colleges     1375   
 7 Arizona  population      13881560   
 8 Arizona  employment_rate        0.29
 9 Arizona  number_colleges      258   
10 Arkansas population       6550014   
# ℹ 143 more rows
# A tibble: 51 × 4
   State                population employment_rate number_colleges
   <chr>                     <dbl>           <dbl>           <dbl>
 1 Alabama                29782365            0.61            1898
 2 Alaska                  7843489            0.89            1375
 3 Arizona                13881560            0.29             258
 4 Arkansas                6550014            0.41            1580
 5 California             26297974            0.89             130
 6 Colorado               26857590            0.45            1205
 7 Connecticut             2153453            0.62            1783
 8 Delaware               12239097            0.27             964
 9 District of Columbia   11872901            0.75            1856
10 Florida                 8139044            0.33             625
# ℹ 41 more rows
```{r pivot_wider Alternatives}
# following syntax closely
pivot_wider(long_states, id_cols = State, names_from = condition, values_from = measure)

# using the piping operator
long_states %>% 
  pivot_wider(id_cols = State, names_from = condition, values_from = measure)
```
# A tibble: 51 × 4
   State                population employment_rate number_colleges
   <chr>                     <dbl>           <dbl>           <dbl>
 1 Alabama                29782365            0.61            1898
 2 Alaska                  7843489            0.89            1375
 3 Arizona                13881560            0.29             258
 4 Arkansas                6550014            0.41            1580
 5 California             26297974            0.89             130
 6 Colorado               26857590            0.45            1205
 7 Connecticut             2153453            0.62            1783
 8 Delaware               12239097            0.27             964
 9 District of Columbia   11872901            0.75            1856
10 Florida                 8139044            0.33             625
# ℹ 41 more rows
# A tibble: 51 × 4
   State                population employment_rate number_colleges
   <chr>                     <dbl>           <dbl>           <dbl>
 1 Alabama                29782365            0.61            1898
 2 Alaska                  7843489            0.89            1375
 3 Arizona                13881560            0.29             258
 4 Arkansas                6550014            0.41            1580
 5 California             26297974            0.89             130
 6 Colorado               26857590            0.45            1205
 7 Connecticut             2153453            0.62            1783
 8 Delaware               12239097            0.27             964
 9 District of Columbia   11872901            0.75            1856
10 Florida                 8139044            0.33             625
# ℹ 41 more rows

5.5 Separating data

5.5.1 Syntax

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+",
  remove = TRUE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
  ...
)

5.5.2 separate() example

```{r separate() example}
state_rate <- read_csv("https://raw.githubusercontent.com/jsgriffin96/r_workshop_4/master/data/state_rate.csv")
state_rate
# rate column has both deaths and population in one cell, which makes it untidy.

tidy_rate <- separate(state_rate, col = rate, into = c('deaths', 'population'))
tidy_rate
```
# A tibble: 51 × 2
   state                rate          
   <chr>                <chr>         
 1 Alabama              21782/4903185 
 2 Alaska               31460/731545  
 3 Arizona              29814/7278717 
 4 Arkansas             10602/3017804 
 5 California           45850/39512223
 6 Colorado             23366/5758736 
 7 Connecticut          34784/3565287 
 8 Delaware             16342/973764  
 9 District of Columbia 9276/705749   
10 Florida              9680/21477737 
# ℹ 41 more rows
# A tibble: 51 × 3
   state                deaths population
   <chr>                <chr>  <chr>     
 1 Alabama              21782  4903185   
 2 Alaska               31460  731545    
 3 Arizona              29814  7278717   
 4 Arkansas             10602  3017804   
 5 California           45850  39512223  
 6 Colorado             23366  5758736   
 7 Connecticut          34784  3565287   
 8 Delaware             16342  973764    
 9 District of Columbia 9276   705749    
10 Florida              9680   21477737  
# ℹ 41 more rows
```{r separate() Alternatives}
separate(data = state_rate, rate, into = c('deaths', 'population'), sep = "/")

state_rate %>% 
  separate(rate, into = c("deaths", "population"))
```
# A tibble: 51 × 3
   state                deaths population
   <chr>                <chr>  <chr>     
 1 Alabama              21782  4903185   
 2 Alaska               31460  731545    
 3 Arizona              29814  7278717   
 4 Arkansas             10602  3017804   
 5 California           45850  39512223  
 6 Colorado             23366  5758736   
 7 Connecticut          34784  3565287   
 8 Delaware             16342  973764    
 9 District of Columbia 9276   705749    
10 Florida              9680   21477737  
# ℹ 41 more rows
# A tibble: 51 × 3
   state                deaths population
   <chr>                <chr>  <chr>     
 1 Alabama              21782  4903185   
 2 Alaska               31460  731545    
 3 Arizona              29814  7278717   
 4 Arkansas             10602  3017804   
 5 California           45850  39512223  
 6 Colorado             23366  5758736   
 7 Connecticut          34784  3565287   
 8 Delaware             16342  973764    
 9 District of Columbia 9276   705749    
10 Florida              9680   21477737  
# ℹ 41 more rows

5.6 Uniting data

5.6.1 Syntax

  • from tidyr
unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)

5.6.2 unite() example

```{r unite() example}
untidy_rate <- unite(tidy_rate, col = rate, c(deaths, population), sep='/')
untidy_rate # successful but untidy data though
```
# A tibble: 51 × 2
   state                rate          
   <chr>                <chr>         
 1 Alabama              21782/4903185 
 2 Alaska               31460/731545  
 3 Arizona              29814/7278717 
 4 Arkansas             10602/3017804 
 5 California           45850/39512223
 6 Colorado             23366/5758736 
 7 Connecticut          34784/3565287 
 8 Delaware             16342/973764  
 9 District of Columbia 9276/705749   
10 Florida              9680/21477737 
# ℹ 41 more rows
```{r unite() Alterantives}
unite(data = tidy_rate, rate, deaths, population, sep='/')

tidy_rate %>% 
  unite(col = "rate", deaths, population, sep = "/")
```
# A tibble: 51 × 2
   state                rate          
   <chr>                <chr>         
 1 Alabama              21782/4903185 
 2 Alaska               31460/731545  
 3 Arizona              29814/7278717 
 4 Arkansas             10602/3017804 
 5 California           45850/39512223
 6 Colorado             23366/5758736 
 7 Connecticut          34784/3565287 
 8 Delaware             16342/973764  
 9 District of Columbia 9276/705749   
10 Florida              9680/21477737 
# ℹ 41 more rows
# A tibble: 51 × 2
   state                rate          
   <chr>                <chr>         
 1 Alabama              21782/4903185 
 2 Alaska               31460/731545  
 3 Arizona              29814/7278717 
 4 Arkansas             10602/3017804 
 5 California           45850/39512223
 6 Colorado             23366/5758736 
 7 Connecticut          34784/3565287 
 8 Delaware             16342/973764  
 9 District of Columbia 9276/705749   
10 Florida              9680/21477737 
# ℹ 41 more rows

6 Data transformation with dplyr (From Tidy Data to Your Analysis)

Data Transformation: Hackman’s book: Ch5

  • Dplyr basics
  • Filter rows with filter()
  • Arrange rows with arrange()
  • Select columns with select()
  • Add new variables with mutate()
  • Grouped summaries with summarise()
  • Grouped mutates and filters
  • pull()
  • case_when(): part of the programming tool

dplyr:: cheat sheet https://nyu-cdsc.github.io/learningr/assets/data-transformation.pdf

6.1 flights data

```{r}
#install.packages("nycflights13")
library(nycflights13)

# nycflights13
nycflights13::flights
#?flights
```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

6.1.1 Variable Description

  • Data frame with columns
  • year, month, day: Date of departure.
  • dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local tz.
  • sched_dep_time, sched_arr_time:
    • Scheduled departure and arrival times (format HHMM or HMM), local tz. dep_delay, arr_delay:
    • Departure and arrival delays, in minutes.
    • Negative times represent early departures/arrivals.
  • carrier: Two letter carrier abbreviation. See airlines to get name.
  • flight: Flight number.
  • tailnum: Plane tail number. See planes for additional metadata.
  • origin, dest: Origin and destination. See airports for additional metadata.
  • air_time: Amount of time spent in the air, in minutes.
  • distance: Distance between airports, in miles.
  • hour, minute: Time of scheduled departure broken into hour and minutes.
  • time_hour:
    • Scheduled date and hour of the flight as a POSIXct date.
    • Along with origin, can be used to join flights data to weather data.
Sorce

Source RITA, Bureau of transportation statistics https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236

6.2 filter(): filter rows

```{r fiter()}
cars <- mtcars
summary(cars)
#?mtcars

# choose all the cases with cylinder size of 8 and automatic gear.
# [, 2] cyl Number of cylinders
# [, 9] am  Transmission (0 = automatic, 1 = manual)
v8_automatic <- filter(cars, cyl == 8, am == 0)
v8_automatic

# R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:
(v8_automatic <- filter(cars, cyl == 8, am == 0))
```
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

6.2.1 Comparisons

  • To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
  • R provides the standard suite:
    • >, >=, <, <=, !=(not equal), and == (equal).
```{r}
filter(flights, month == 1) # error: use ==

flights |> 
  count(year, month == 1)
```
# A tibble: 27,004 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 2 × 3
   year `month == 1`      n
  <int> <lgl>         <int>
1  2013 FALSE        309772
2  2013 TRUE          27004

6.2.1.1 Floting point numbers

Caution

Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near():

```{r floating point numbers}
# common problem you might encounter when using ==: floating point numbers.

sqrt(2) ^ 2 == 2
near(sqrt(2)^2, 2) #This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance. It is from dplyr

1/49 * 49 == 1
near(1/49 * 49, 1)
```
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE

6.2.2 Logical operators

filter() is used often with logical conditions

  • A lot of programming in general is based on writing logical conditions that check whether something is true
  • In R, if the condition is true, it returns TRUE, which turns into 1 if you do a calculation with it.
  • If false, it returns FALSE, which turns into 0.
  • tip: ifelse() is rarely what you want, and ifelse(condition, TRUE, FALSE) is redundant
```{r}
filter(flights, month == 11 | month == 12) #|>  # the same as below
  #count(month)
filter(flights, month %in% c(11, 12)) 
```
# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 55,403 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    11     1        5           2359         6      352            345
 2  2013    11     1       35           2250       105      123           2356
 3  2013    11     1      455            500        -5      641            651
 4  2013    11     1      539            545        -6      856            827
 5  2013    11     1      542            545        -3      831            855
 6  2013    11     1      549            600       -11      912            923
 7  2013    11     1      550            600       -10      705            659
 8  2013    11     1      554            600        -6      659            701
 9  2013    11     1      554            600        -6      826            827
10  2013    11     1      554            600        -6      749            751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
Tip

De Morgan’s law:

  • !(x & y) is the same as !x | !y, and
  • !(x | y) is the same as !x & !y.
  • For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r}
df <- flights
filter(df, !(dep_delay > 120 | arr_delay > 120)) # same as below

filter(df, dep_delay <= 120 & arr_delay <= 120) # the same as below.
filter(df, dep_delay <= 120, arr_delay <= 120) # the same as above
```
# A tibble: 316,050 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 316,040 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 316,050 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 316,040 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 316,050 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 316,040 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

6.2.3 Missing values

Important
  • filter() only includes rows where the condition is TRUE;
  • it excludes both FALSE and NA values.
  • If you want to preserve missing values, ask for them explicitly:
```{r}
nrow(df)
head(is.na(df))

df1 <- tibble(x = c(1, NA, 3))
filter(df1, x > 1)

filter(df1, is.na(x) | x > 1)
```
[1] 336776
      year month   day dep_time sched_dep_time dep_delay arr_time
[1,] FALSE FALSE FALSE    FALSE          FALSE     FALSE    FALSE
[2,] FALSE FALSE FALSE    FALSE          FALSE     FALSE    FALSE
[3,] FALSE FALSE FALSE    FALSE          FALSE     FALSE    FALSE
[4,] FALSE FALSE FALSE    FALSE          FALSE     FALSE    FALSE
[5,] FALSE FALSE FALSE    FALSE          FALSE     FALSE    FALSE
[6,] FALSE FALSE FALSE    FALSE          FALSE     FALSE    FALSE
     sched_arr_time arr_delay carrier flight tailnum origin  dest air_time
[1,]          FALSE     FALSE   FALSE  FALSE   FALSE  FALSE FALSE    FALSE
[2,]          FALSE     FALSE   FALSE  FALSE   FALSE  FALSE FALSE    FALSE
[3,]          FALSE     FALSE   FALSE  FALSE   FALSE  FALSE FALSE    FALSE
[4,]          FALSE     FALSE   FALSE  FALSE   FALSE  FALSE FALSE    FALSE
[5,]          FALSE     FALSE   FALSE  FALSE   FALSE  FALSE FALSE    FALSE
[6,]          FALSE     FALSE   FALSE  FALSE   FALSE  FALSE FALSE    FALSE
     distance  hour minute time_hour
[1,]    FALSE FALSE  FALSE     FALSE
[2,]    FALSE FALSE  FALSE     FALSE
[3,]    FALSE FALSE  FALSE     FALSE
[4,]    FALSE FALSE  FALSE     FALSE
[5,]    FALSE FALSE  FALSE     FALSE
[6,]    FALSE FALSE  FALSE     FALSE
# A tibble: 1 × 1
      x
  <dbl>
1     3
# A tibble: 2 × 1
      x
  <dbl>
1    NA
2     3

6.2.4 %in%

a %in% c(b, c, d, e, f) checks whether a is any of the values b, c, d, e, or f. Works for text too!

```{r}
filter(flights, month == 9 | month == 10 | month == 11 | month == 12) # the same as below
filter(flights, month %in% c(9, 10, 11, 12)) 
```
# A tibble: 111,866 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10     1      447            500       -13      614            648
 2  2013    10     1      522            517         5      735            757
 3  2013    10     1      536            545        -9      809            855
 4  2013    10     1      539            545        -6      801            827
 5  2013    10     1      539            545        -6      917            933
 6  2013    10     1      544            550        -6      912            932
 7  2013    10     1      549            600       -11      653            716
 8  2013    10     1      550            600       -10      648            700
 9  2013    10     1      550            600       -10      649            659
10  2013    10     1      551            600        -9      727            730
# ℹ 111,856 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 111,866 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013    10     1      447            500       -13      614            648
 2  2013    10     1      522            517         5      735            757
 3  2013    10     1      536            545        -9      809            855
 4  2013    10     1      539            545        -6      801            827
 5  2013    10     1      539            545        -6      917            933
 6  2013    10     1      544            550        -6      912            932
 7  2013    10     1      549            600       -11      653            716
 8  2013    10     1      550            600       -10      648            700
 9  2013    10     1      550            600       -10      649            659
10  2013    10     1      551            600        -9      727            730
# ℹ 111,856 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

6.3 arrange(): arrange rows

Note
  • arrange() sorts the data. That’s it! Give it the column names and it will sort the data by those columns.
  • It’s often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate
  • There are also some data manipulation tricks that rely on the position of the data
  • If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
```{r}
flights %>% 
  arrange(year, month, day)

flights %>% 
  arrange(desc(dep_delay)) # arrange rows by descending order of dep_delay
```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     9      641            900      1301     1242           1530
 2  2013     6    15     1432           1935      1137     1607           2120
 3  2013     1    10     1121           1635      1126     1239           1810
 4  2013     9    20     1139           1845      1014     1457           2210
 5  2013     7    22      845           1600      1005     1044           1815
 6  2013     4    10     1100           1900       960     1342           2211
 7  2013     3    17     2321            810       911      135           1020
 8  2013     6    27      959           1900       899     1236           2226
 9  2013     7    22     2257            759       898      121           1026
10  2013    12     5      756           1700       896     1058           2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
```{r}
df2 <- tibble(x = c(5, 2, NA))
arrange(df2, x) # Missing values are always sorted at the end
arrange(df2, desc(x)) # Missing values are always sorted at the end
```
# A tibble: 3 × 1
      x
  <dbl>
1     2
2     5
3    NA
# A tibble: 3 × 1
      x
  <dbl>
1     5
2     2
3    NA
```{r}
cars <- as_tibble(mtcars)
arrange(cars, mpg)
arrange(cars, desc(mpg))
cars <- arrange(cars, cyl, -mpg) # sort by ascending order of cyl and then descending order of mpg
cars # confirm the operation
```
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  10.4     8  472    205  2.93  5.25  18.0     0     0     3     4
 2  10.4     8  460    215  3     5.42  17.8     0     0     3     4
 3  13.3     8  350    245  3.73  3.84  15.4     0     0     3     4
 4  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 5  14.7     8  440    230  3.23  5.34  17.4     0     0     3     4
 6  15       8  301    335  3.54  3.57  14.6     0     1     5     8
 7  15.2     8  276.   180  3.07  3.78  18       0     0     3     3
 8  15.2     8  304    150  3.15  3.44  17.3     0     0     3     2
 9  15.5     8  318    150  2.76  3.52  16.9     0     0     3     2
10  15.8     8  351    264  4.22  3.17  14.5     0     1     5     4
# ℹ 22 more rows
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
 7  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 8  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 9  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
# ℹ 22 more rows
# A tibble: 32 × 11
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
 7  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 8  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 9  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
# ℹ 22 more rows

6.4 select(): select columns;

  • starts_with()
  • ends_with()
  • contains()
  • any_of()
```{r}
select(cars, mpg, cyl, hp, wt)

flights %>% 
  select(year, month, day, dep_time) %>% 
  select(year:day)

select(flights, year:day)

select(flights, -c(year:day))

select(flights, starts_with("sched"))

select(flights, ends_with("time"))

select(flights, contains("time"))

rename(flights, tail_num = tailnum)

select(flights, time_hour, air_time, everything()) #everything helper is good when we want to move some variable to the front of the data frame.

vars <- c("year", "month", "day", "dep_delay", "arr_delay")
# any_of(vars)
select(flights, any_of(vars))
```
# A tibble: 32 × 4
     mpg   cyl    hp    wt
   <dbl> <dbl> <dbl> <dbl>
 1  33.9     4    65  1.84
 2  32.4     4    66  2.2 
 3  30.4     4    52  1.62
 4  30.4     4   113  1.51
 5  27.3     4    66  1.94
 6  26       4    91  2.14
 7  24.4     4    62  3.19
 8  22.8     4    93  2.32
 9  22.8     4    95  3.15
10  21.5     4    97  2.46
# ℹ 22 more rows
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows
# A tibble: 336,776 × 3
    year month   day
   <int> <int> <int>
 1  2013     1     1
 2  2013     1     1
 3  2013     1     1
 4  2013     1     1
 5  2013     1     1
 6  2013     1     1
 7  2013     1     1
 8  2013     1     1
 9  2013     1     1
10  2013     1     1
# ℹ 336,766 more rows
# A tibble: 336,776 × 16
   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
 1      517            515         2      830            819        11 UA     
 2      533            529         4      850            830        20 UA     
 3      542            540         2      923            850        33 AA     
 4      544            545        -1     1004           1022       -18 B6     
 5      554            600        -6      812            837       -25 DL     
 6      554            558        -4      740            728        12 UA     
 7      555            600        -5      913            854        19 B6     
 8      557            600        -3      709            723       -14 EV     
 9      557            600        -3      838            846        -8 B6     
10      558            600        -2      753            745         8 AA     
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 2
   sched_dep_time sched_arr_time
            <int>          <int>
 1            515            819
 2            529            830
 3            540            850
 4            545           1022
 5            600            837
 6            558            728
 7            600            854
 8            600            723
 9            600            846
10            600            745
# ℹ 336,766 more rows
# A tibble: 336,776 × 5
   dep_time sched_dep_time arr_time sched_arr_time air_time
      <int>          <int>    <int>          <int>    <dbl>
 1      517            515      830            819      227
 2      533            529      850            830      227
 3      542            540      923            850      160
 4      544            545     1004           1022      183
 5      554            600      812            837      116
 6      554            558      740            728      150
 7      555            600      913            854      158
 8      557            600      709            723       53
 9      557            600      838            846      140
10      558            600      753            745      138
# ℹ 336,766 more rows
# A tibble: 336,776 × 6
   dep_time sched_dep_time arr_time sched_arr_time air_time time_hour          
      <int>          <int>    <int>          <int>    <dbl> <dttm>             
 1      517            515      830            819      227 2013-01-01 05:00:00
 2      533            529      850            830      227 2013-01-01 05:00:00
 3      542            540      923            850      160 2013-01-01 05:00:00
 4      544            545     1004           1022      183 2013-01-01 05:00:00
 5      554            600      812            837      116 2013-01-01 06:00:00
 6      554            558      740            728      150 2013-01-01 05:00:00
 7      555            600      913            854      158 2013-01-01 06:00:00
 8      557            600      709            723       53 2013-01-01 06:00:00
 9      557            600      838            846      140 2013-01-01 06:00:00
10      558            600      753            745      138 2013-01-01 06:00:00
# ℹ 336,766 more rows
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 19
   time_hour           air_time  year month   day dep_time sched_dep_time
   <dttm>                 <dbl> <int> <int> <int>    <int>          <int>
 1 2013-01-01 05:00:00      227  2013     1     1      517            515
 2 2013-01-01 05:00:00      227  2013     1     1      533            529
 3 2013-01-01 05:00:00      160  2013     1     1      542            540
 4 2013-01-01 05:00:00      183  2013     1     1      544            545
 5 2013-01-01 06:00:00      116  2013     1     1      554            600
 6 2013-01-01 05:00:00      150  2013     1     1      554            558
 7 2013-01-01 06:00:00      158  2013     1     1      555            600
 8 2013-01-01 06:00:00       53  2013     1     1      557            600
 9 2013-01-01 06:00:00      140  2013     1     1      557            600
10 2013-01-01 06:00:00      138  2013     1     1      558            600
# ℹ 336,766 more rows
# ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
#   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
#   dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>
# A tibble: 336,776 × 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ℹ 336,766 more rows

6.5 mutate(): Add new variables

  • Create a new variable and add it to the data set.
  • It will not save the column to the data until the data is assigned to a name.
  • You can assign multiple variables in the same mutate() call, separated by commas (,)
```{r}
mutate(cars, powerToWeight = hp/wt)

flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
                     )
flights_sml

mutate(flights_sml,
        gain = dep_delay - arr_delay,
        speed = distance / air_time * 60 # speed expressed in the number of miles per an hour.
)

flights_sml <- flights_sml |> 
  mutate(gain = dep_delay - arr_delay,
         speed = distance / air_time * 60)

flights_sml
```
# A tibble: 32 × 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
 2  32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
 3  30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
 4  30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
 5  27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
 6  26       4 120.     91  4.43  2.14  16.7     0     1     5     2
 7  24.4     4 147.     62  3.69  3.19  20       1     0     4     2
 8  22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
 9  22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
10  21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
# ℹ 22 more rows
# ℹ 1 more variable: powerToWeight <dbl>
# A tibble: 336,776 × 7
    year month   day dep_delay arr_delay distance air_time
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
 1  2013     1     1         2        11     1400      227
 2  2013     1     1         4        20     1416      227
 3  2013     1     1         2        33     1089      160
 4  2013     1     1        -1       -18     1576      183
 5  2013     1     1        -6       -25      762      116
 6  2013     1     1        -4        12      719      150
 7  2013     1     1        -5        19     1065      158
 8  2013     1     1        -3       -14      229       53
 9  2013     1     1        -3        -8      944      140
10  2013     1     1        -2         8      733      138
# ℹ 336,766 more rows
# A tibble: 336,776 × 9
    year month   day dep_delay arr_delay distance air_time  gain speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
 1  2013     1     1         2        11     1400      227    -9  370.
 2  2013     1     1         4        20     1416      227   -16  374.
 3  2013     1     1         2        33     1089      160   -31  408.
 4  2013     1     1        -1       -18     1576      183    17  517.
 5  2013     1     1        -6       -25      762      116    19  394.
 6  2013     1     1        -4        12      719      150   -16  288.
 7  2013     1     1        -5        19     1065      158   -24  404.
 8  2013     1     1        -3       -14      229       53    11  259.
 9  2013     1     1        -3        -8      944      140     5  405.
10  2013     1     1        -2         8      733      138   -10  319.
# ℹ 336,766 more rows
# A tibble: 336,776 × 9
    year month   day dep_delay arr_delay distance air_time  gain speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
 1  2013     1     1         2        11     1400      227    -9  370.
 2  2013     1     1         4        20     1416      227   -16  374.
 3  2013     1     1         2        33     1089      160   -31  408.
 4  2013     1     1        -1       -18     1576      183    17  517.
 5  2013     1     1        -6       -25      762      116    19  394.
 6  2013     1     1        -4        12      719      150   -16  288.
 7  2013     1     1        -5        19     1065      158   -24  404.
 8  2013     1     1        -3       -14      229       53    11  259.
 9  2013     1     1        -3        -8      944      140     5  405.
10  2013     1     1        -2         8      733      138   -10  319.
# ℹ 336,766 more rows

6.5.1 transmute()

  • add new variable but drop all existing variable.
```{r}
transmute(flights_sml,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
) -> flights_sml_cal

flights_sml_cal
```
# A tibble: 336,776 × 3
    gain hours gain_per_hour
   <dbl> <dbl>         <dbl>
 1    -9 3.78          -2.38
 2   -16 3.78          -4.23
 3   -31 2.67         -11.6 
 4    17 3.05           5.57
 5    19 1.93           9.83
 6   -16 2.5           -6.4 
 7   -24 2.63          -9.11
 8    11 0.883         12.5 
 9     5 2.33           2.14
10   -10 2.3           -4.35
# ℹ 336,766 more rows

6.5.2 %/% vs. %%

  • %/% is the integral division operator
    • example: 5 %/% 3 will be equal to 1 as it is the quotient obtained on the integral division of 5 by 3
  • %% is the modulo operator that gives a remainder after integral division.
    • In the above example, the remainder 2 is the outcome of the operator.
```{r}
transmute(flights,
  dep_time, # Actual departure times (format HHMM or HMM), local tz.
  hour = dep_time %/% 100, # quotient after division
  minute = dep_time %% 100) # remainder
```
# A tibble: 336,776 × 3
   dep_time  hour minute
      <int> <dbl>  <dbl>
 1      517     5     17
 2      533     5     33
 3      542     5     42
 4      544     5     44
 5      554     5     54
 6      554     5     54
 7      555     5     55
 8      557     5     57
 9      557     5     57
10      558     5     58
# ℹ 336,766 more rows

6.6 summarise(): grouped summaries

  • It doesn’t add the new variable to the data set.
  • Used with group_by()
```{r}
summarise(cars, avgMPG = mean(mpg)) #not that useful

by_cyl <- group_by(cars, cyl) #group by cylinders

summarise(by_cyl, avgMPG = mean(mpg)) #summarise function now gives output by cylinders

cars |> 
  group_by(cyl) |> 
  summarise(avgMPG = mean(mpg))
```
# A tibble: 1 × 1
  avgMPG
   <dbl>
1   20.1
# A tibble: 3 × 2
    cyl avgMPG
  <dbl>  <dbl>
1     4   26.7
2     6   19.7
3     8   15.1
# A tibble: 3 × 2
    cyl avgMPG
  <dbl>  <dbl>
1     4   26.7
2     6   19.7
3     8   15.1

7 group_by() and surmmarize() with %>%

  • group_by() turns the dataset into a grouped data set, splitting each combination of the grouping variables

  • Calculations like mutate() or (up next) summarize() or (if you want to get fancy) group_map() then process the data separately by each group

  • It will maintain this grouping structure until you re-group_by() it, or ungroup() it, or summarize() it (which removes one of the grouping variables)

  • How is this useful in preparing data? Remember, we want to look at where information is and think about how we can get it where we need it to be

  • group_by() helps us move information from one row to another in a key variable - otherwise a difficult move!

  • It can also let us change the observation level with summarize()

  • Tip: n() gives the number of rows in the group - handy! and row_number() gives the row number within its group of that observation

7.1 Example 1

```{r 4.6.1 video}
cars |> 
  group_by(cyl) |> 
  summarise(avgMPG = mean(mpg))

cars %>% 
  group_by(cyl) %>% 
  summarise(avgMPG = mean(mpg),
            n = n()) %>% 
  mutate(id = row_number()) %>%  # Add unique row number to a data frame
  relocate(id) # move a column to the first column in the data frame.
```
# A tibble: 3 × 2
    cyl avgMPG
  <dbl>  <dbl>
1     4   26.7
2     6   19.7
3     8   15.1
# A tibble: 3 × 4
     id   cyl avgMPG     n
  <int> <dbl>  <dbl> <int>
1     1     4   26.7    11
2     2     6   19.7     7
3     3     8   15.1    14

7.2 Example 2

```{r 4.6.1. Extra}
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

flights %>% 
  group_by(year, month, day) %>% 
  summarise(delay = mean(dep_delay, na.rm = T),
            n = n()
            ) %>% 
  mutate(row_id = row_number()) %>% 
  relocate(row_id)
```
# A tibble: 1 × 1
  delay
  <dbl>
1  12.6
# A tibble: 365 × 6
# Groups:   year, month [12]
   row_id  year month   day delay     n
    <int> <int> <int> <int> <dbl> <int>
 1      1  2013     1     1 11.5    842
 2      2  2013     1     2 13.9    943
 3      3  2013     1     3 11.0    914
 4      4  2013     1     4  8.95   915
 5      5  2013     1     5  5.73   720
 6      6  2013     1     6  7.15   832
 7      7  2013     1     7  5.42   933
 8      8  2013     1     8  2.55   899
 9      9  2013     1     9  2.28   902
10     10  2013     1    10  2.84   932
# ℹ 355 more rows

8 Grouped mutates and filters

  • Grouped mutates and filters (combine what we did before)

8.1 Example 1

```{r 4.7.1 video}
as_tibble(mtcars) %>%
  select(mpg, cyl, hp, wt) %>% #select only these cols
  mutate(ptw = hp/wt) %>%       #calculate power to weight ratio
  filter(mpg > 18, ptw > 40)            #filter by power to weight and mpg

# Operation example w/o piping
# This method requires saving each outcome as a data to use in the next step, which is inefficient.
narrow_cars <- select(as_tibble(mtcars), mpg, cyl, hp, wt)
narrow_cars <- mutate(narrow_cars, ptw=hp/wt)
narrow_cars <- filter(narrow_cars, mpg>18, ptw > 40)  
narrow_cars
```
# A tibble: 7 × 5
    mpg   cyl    hp    wt   ptw
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   110  2.62  42.0
2  22.8     4    93  2.32  40.1
3  18.7     8   175  3.44  50.9
4  19.2     8   175  3.84  45.5
5  26       4    91  2.14  42.5
6  30.4     4   113  1.51  74.7
7  19.7     6   175  2.77  63.2
# A tibble: 7 × 5
    mpg   cyl    hp    wt   ptw
  <dbl> <dbl> <dbl> <dbl> <dbl>
1  21       6   110  2.62  42.0
2  22.8     4    93  2.32  40.1
3  18.7     8   175  3.44  50.9
4  19.2     8   175  3.84  45.5
5  26       4    91  2.14  42.5
6  30.4     4   113  1.51  74.7
7  19.7     6   175  2.77  63.2

8.2 Example 2

```{r 4.7.1 Extra}
# Find the worst members of each group:
flights_sml %>% 
  group_by(year, month, day) %>%
  filter(rank(desc(arr_delay)) < 10)

# Find all groups bigger than a threshold:
popular_dests <- flights %>% 
  group_by(dest) %>% 
  filter(n() > 365) 

popular_dests

# Standardize to compute per group metrics:
popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(arr_delay_sum = sum(arr_delay),
         prop_delay = arr_delay / sum(arr_delay),
         n = n()) %>% 
  select(year:day, dest, arr_delay, arr_delay_sum, prop_delay, n)
```
# A tibble: 3,306 × 9
# Groups:   year, month, day [365]
    year month   day dep_delay arr_delay distance air_time  gain speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
 1  2013     1     1       853       851      184       41     2  269.
 2  2013     1     1       290       338     1134      213   -48  319.
 3  2013     1     1       260       263      266       46    -3  347.
 4  2013     1     1       157       174      213       60   -17  213 
 5  2013     1     1       216       222      708      121    -6  351.
 6  2013     1     1       255       250      589      115     5  307.
 7  2013     1     1       285       246     1085      146    39  446.
 8  2013     1     1       192       191      199       44     1  271.
 9  2013     1     1       379       456     1092      222   -77  295.
10  2013     1     2       224       207      550       94    17  351.
# ℹ 3,296 more rows
# A tibble: 332,577 × 19
# Groups:   dest [77]
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 332,567 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 131,106 × 8
# Groups:   dest [77]
    year month   day dest  arr_delay arr_delay_sum prop_delay     n
   <int> <int> <int> <chr>     <dbl>         <dbl>      <dbl> <int>
 1  2013     1     1 IAH          11         99391  0.000111   2883
 2  2013     1     1 IAH          20         99391  0.000201   2883
 3  2013     1     1 MIA          33        140424  0.000235   3855
 4  2013     1     1 ORD          12        283046  0.0000424  6198
 5  2013     1     1 FLL          19        202605  0.0000938  5212
 6  2013     1     1 ORD           8        283046  0.0000283  6198
 7  2013     1     1 LAX           7        203226  0.0000344  5967
 8  2013     1     1 DFW          31        110009  0.000282   2876
 9  2013     1     1 ATL          12        300299  0.0000400  7946
10  2013     1     1 DTW          16        138258  0.000116   3257
# ℹ 131,096 more rows

9 Demonstration of DWV with piping

9.1 Example 1

  • Wrangling and Visualization
```{r wrangling & Viz}
as_tibble(mtcars) %>%
  select(mpg, cyl, hp, wt) %>% #select only these cols
  mutate(ptw = hp/wt) %>%       #calculate power to weight ratio
  filter(mpg > 10, ptw > 35) %>%   #filter by power to weight and mpg
  group_by(cyl) %>% 
  summarise(ptw_mean = mean(ptw)) %>% # create mean of ptw by cylinder type
  ggplot(aes(x = factor(cyl), y = ptw_mean))+ 
  geom_col(aes(fill = factor(cyl)), show.legend = FALSE) # map color to cylinder
```

9.2 Example 2

```{r}
# data
flights

# wrangling data
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = T),
    delay = mean(arr_delay, na.rm = T)
  ) %>% 
  filter(count > 20, dest != "HNL")
```
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 96 × 4
   dest  count  dist delay
   <chr> <int> <dbl> <dbl>
 1 ABQ     254 1826   4.38
 2 ACK     265  199   4.85
 3 ALB     439  143  14.4 
 4 ATL   17215  757. 11.3 
 5 AUS    2439 1514.  6.02
 6 AVL     275  584.  8.00
 7 BDL     443  116   7.05
 8 BGR     375  378   8.03
 9 BHM     297  866. 16.9 
10 BNA    6333  758. 11.8 
# ℹ 86 more rows

9.3 Example 3:

  • adding viz to wrangled data
```{r}
flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = T),
    delay = mean(arr_delay, na.rm = T)
  ) %>% 
  filter(count > 20, dest != "HNL") %>% 
  ggplot(aes(dist, delay, size = count)) +
  geom_point(alpha = 1/3) +
  geom_smooth(method = lm, se = FALSE, show.legend = FALSE) +
  scale_size_continuous(labels = scales::comma) +
  labs(x = "Distance in Miles Flown",
       y = "Airplane Arrival Delay in Minutes",
       title = "Airline Arrival Delay vs. Distance",
       subtitle = "Airplans that fly longer distance tend to dely less")
```

Insights from the Data

The farther the distance, the less the delay.

10 Missing values

  • Why are the two results different?
```{r}
flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean_delay = mean(dep_delay),
            n = n())

flights %>% 
  group_by(year, month, day) %>% 
  summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
            n = n())
```
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day mean_delay     n
   <int> <int> <int>      <dbl> <int>
 1  2013     1     1         NA   842
 2  2013     1     2         NA   943
 3  2013     1     3         NA   914
 4  2013     1     4         NA   915
 5  2013     1     5         NA   720
 6  2013     1     6         NA   832
 7  2013     1     7         NA   933
 8  2013     1     8         NA   899
 9  2013     1     9         NA   902
10  2013     1    10         NA   932
# ℹ 355 more rows
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day mean_delay     n
   <int> <int> <int>      <dbl> <int>
 1  2013     1     1      11.5    842
 2  2013     1     2      13.9    943
 3  2013     1     3      11.0    914
 4  2013     1     4       8.95   915
 5  2013     1     5       5.73   720
 6  2013     1     6       7.15   832
 7  2013     1     7       5.42   933
 8  2013     1     8       2.55   899
 9  2013     1     9       2.28   902
10  2013     1    10       2.84   932
# ℹ 355 more rows
  • Mean value after NA are removed.
```{r}
not_cancelled <- flights %>% 
  filter(!is.na(dep_delay) & !is.na(arr_delay)) 

flights %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(mean_delay = mean(dep_delay))
```
# A tibble: 327,346 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 327,336 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day mean_delay
   <int> <int> <int>      <dbl>
 1  2013     1     1      11.4 
 2  2013     1     2      13.7 
 3  2013     1     3      10.9 
 4  2013     1     4       8.97
 5  2013     1     5       5.73
 6  2013     1     6       7.15
 7  2013     1     7       5.42
 8  2013     1     8       2.56
 9  2013     1     9       2.30
10  2013     1    10       2.84
# ℹ 355 more rows

11 Counts

  • Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).
```{r}
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(delay = mean(arr_delay),
            count = n()
  )

delays |> 
  arrange(desc(delay))

# there are some planes that have an average delay of 5 hours (300 minutes)!
ggplot(data = delays, mapping = aes(x = delay)) + 
  geom_freqpoly(binwidth = 10) 

# Get n always
delays <- not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )
delays

## the variation decreases as the sample size increases.
ggplot(data = delays, mapping = aes(x = delay, y = n)) + 
  geom_point(alpha = 1/10) 

# Filtering out means with small sample size
delays %>% 
  filter(n > 25) %>% 
  ggplot(mapping = aes(x = delay, y = n)) + 
    geom_point(alpha = 1/10)
```
# A tibble: 4,037 × 3
   tailnum delay count
   <chr>   <dbl> <int>
 1 N844MH   320      1
 2 N911DA   294      1
 3 N922EV   276      1
 4 N587NW   264      1
 5 N851NW   219      1
 6 N928DN   201      1
 7 N7715E   188      1
 8 N654UA   185      1
 9 N665MQ   175.     6
10 N427SW   157      1
# ℹ 4,027 more rows
# A tibble: 4,037 × 3
   tailnum  delay     n
   <chr>    <dbl> <int>
 1 D942DN  31.5       4
 2 N0EGMQ   9.98    352
 3 N10156  12.7     145
 4 N102UW   2.94     48
 5 N103US  -6.93     46
 6 N104UW   1.80     46
 7 N10575  20.7     269
 8 N105UW  -0.267    45
 9 N107US  -5.73     41
10 N108UW  -1.25     60
# ℹ 4,027 more rows

12 Useful summary functions

12.1 logical subsetting

  • Median: the median is a value where 50% of x is above it, and 50% is below it.
  • Logical subsetting
```{r}
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    avg_delay1 = mean(arr_delay),
    avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
           )
```
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day avg_delay1 avg_delay2
   <int> <int> <int>      <dbl>      <dbl>
 1  2013     1     1     12.7         32.5
 2  2013     1     2     12.7         32.0
 3  2013     1     3      5.73        27.7
 4  2013     1     4     -1.93        28.3
 5  2013     1     5     -1.53        22.6
 6  2013     1     6      4.24        24.4
 7  2013     1     7     -4.95        27.8
 8  2013     1     8     -3.23        20.8
 9  2013     1     9     -0.264       25.6
10  2013     1    10     -5.90        27.3
# ℹ 355 more rows

12.2 Measures of spread: sd(x), IQR(x), mad(x)

  • Measures of spread: sd(x), IQR(x), mad(x).
```{r}
# Why is distance to some destinations more variable than to others?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(distance_sd = sd(distance)) %>% 
  arrange(desc(distance_sd))
```
# A tibble: 104 × 2
   dest  distance_sd
   <chr>       <dbl>
 1 EGE         10.5 
 2 SAN         10.4 
 3 SFO         10.2 
 4 HNL         10.0 
 5 SEA          9.98
 6 LAS          9.91
 7 PDX          9.87
 8 PHX          9.86
 9 LAX          9.66
10 IND          9.46
# ℹ 94 more rows

12.3 Measures of rank: min(), quantile(), max()

  • Measures of rank: min(x), quantile(x, 0.25), max(x).
```{r}
# When do the first and last flights leave each day?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first = min(dep_time),
    last = max(dep_time)
  )
```
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day first  last
   <int> <int> <int> <int> <int>
 1  2013     1     1   517  2356
 2  2013     1     2    42  2354
 3  2013     1     3    32  2349
 4  2013     1     4    25  2358
 5  2013     1     5    14  2357
 6  2013     1     6    16  2355
 7  2013     1     7    49  2359
 8  2013     1     8   454  2351
 9  2013     1     9     2  2252
10  2013     1    10     3  2320
# ℹ 355 more rows

12.4 Measures of position: first(), nth(), last()

  • Measures of position: first(x), nth(x, 2), last(x).
  • These work similarly to x[1], x[2], and x[length(x)] but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements).
```{r}
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(
    first_dep = first(dep_time), 
    last_dep = last(dep_time),
    `2nd_dep` = nth(dep_time, 2)
  )
```
# A tibble: 365 × 6
# Groups:   year, month [12]
    year month   day first_dep last_dep `2nd_dep`
   <int> <int> <int>     <int>    <int>     <int>
 1  2013     1     1       517     2356       533
 2  2013     1     2        42     2354       126
 3  2013     1     3        32     2349        50
 4  2013     1     4        25     2358       106
 5  2013     1     5        14     2357        37
 6  2013     1     6        16     2355       458
 7  2013     1     7        49     2359       454
 8  2013     1     8       454     2351       524
 9  2013     1     9         2     2252         8
10  2013     1    10         3     2320        16
# ℹ 355 more rows

12.5 sum(!is.na(x)) / n_distinct(x)

  • To count the number of non-missing values, use sum(!is.na(x)).
  • To count the number of distinct (unique) values, use n_distinct(x).
```{r}
# Which destinations have the most carriers?
not_cancelled %>% 
  group_by(dest) %>% 
  summarise(carriers = n_distinct(carrier)) %>% 
  arrange(desc(carriers))

glimpse(not_cancelled)
skimr::skim(not_cancelled)

# Get n always
not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = n()
  )

# n() is the same as sum(!is.na(x)) when there is no missing data for x, which is arr_delay here.
 not_cancelled %>% 
  group_by(tailnum) %>% 
  summarise(
    delay = mean(arr_delay, na.rm = TRUE),
    n = sum(!is.na(arr_delay))
  )
```
# A tibble: 104 × 2
   dest  carriers
   <chr>    <int>
 1 ATL          7
 2 BOS          7
 3 CLT          7
 4 ORD          7
 5 TPA          7
 6 AUS          6
 7 DCA          6
 8 DTW          6
 9 IAD          6
10 MSP          6
# ℹ 94 more rows
Rows: 327,346
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Data summary
Name not_cancelled
Number of rows 327346
Number of columns 19
_______________________
Column type frequency:
character 4
numeric 14
POSIXct 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
carrier 0 1 2 2 0 16 0
tailnum 0 1 5 6 0 4037 0
origin 0 1 3 3 0 3 0
dest 0 1 3 3 0 104 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2013.00 0.00 2013 2013 2013 2013 2013 ▁▁▇▁▁
month 0 1 6.56 3.41 1 4 7 10 12 ▇▆▆▆▇
day 0 1 15.74 8.78 1 8 16 23 31 ▇▇▇▇▆
dep_time 0 1 1348.79 488.32 1 907 1400 1744 2400 ▁▇▆▇▃
sched_dep_time 0 1 1340.34 467.41 500 905 1355 1729 2359 ▇▆▇▇▃
dep_delay 0 1 12.56 40.07 -43 -5 -2 11 1301 ▇▁▁▁▁
arr_time 0 1 1501.91 532.89 1 1104 1535 1940 2400 ▁▃▇▇▇
sched_arr_time 0 1 1532.79 497.98 1 1122 1554 1944 2359 ▁▃▇▇▇
arr_delay 0 1 6.90 44.63 -86 -17 -5 14 1272 ▇▁▁▁▁
flight 0 1 1943.10 1621.52 1 544 1467 3412 8500 ▇▃▃▁▁
air_time 0 1 150.69 93.69 20 82 129 192 695 ▇▂▂▁▁
distance 0 1 1048.37 735.91 80 509 888 1389 4983 ▇▃▂▁▁
hour 0 1 13.14 4.66 5 9 13 17 23 ▇▆▆▇▃
minute 0 1 26.23 19.30 0 8 29 44 59 ▇▃▆▃▅

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
time_hour 0 1 2013-01-01 05:00:00 2013-12-31 23:00:00 2013-07-04 09:00:00 6922
# A tibble: 4,037 × 3
   tailnum  delay     n
   <chr>    <dbl> <int>
 1 D942DN  31.5       4
 2 N0EGMQ   9.98    352
 3 N10156  12.7     145
 4 N102UW   2.94     48
 5 N103US  -6.93     46
 6 N104UW   1.80     46
 7 N10575  20.7     269
 8 N105UW  -0.267    45
 9 N107US  -5.73     41
10 N108UW  -1.25     60
# ℹ 4,027 more rows
# A tibble: 4,037 × 3
   tailnum  delay     n
   <chr>    <dbl> <int>
 1 D942DN  31.5       4
 2 N0EGMQ   9.98    352
 3 N10156  12.7     145
 4 N102UW   2.94     48
 5 N103US  -6.93     46
 6 N104UW   1.80     46
 7 N10575  20.7     269
 8 N105UW  -0.267    45
 9 N107US  -5.73     41
10 N108UW  -1.25     60
# ℹ 4,027 more rows

12.6 count() = grouped_by followed by summarize() for n()

  • count() is the same as grouped_by() and summarize() for n()
```{r}
not_cancelled %>% 
  count(dest)

# same as above
not_cancelled %>% 
  group_by(dest) %>% 
  summarize(n = n())

# add percentage
not_cancelled %>% 
  count(dest) |> 
  mutate(percent = n/sum(n))
```
# A tibble: 104 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     264
 3 ALB     418
 4 ANC       8
 5 ATL   16837
 6 AUS    2411
 7 AVL     261
 8 BDL     412
 9 BGR     358
10 BHM     269
# ℹ 94 more rows
# A tibble: 104 × 2
   dest      n
   <chr> <int>
 1 ABQ     254
 2 ACK     264
 3 ALB     418
 4 ANC       8
 5 ATL   16837
 6 AUS    2411
 7 AVL     261
 8 BDL     412
 9 BGR     358
10 BHM     269
# ℹ 94 more rows
# A tibble: 104 × 3
   dest      n   percent
   <chr> <int>     <dbl>
 1 ABQ     254 0.000776 
 2 ACK     264 0.000806 
 3 ALB     418 0.00128  
 4 ANC       8 0.0000244
 5 ATL   16837 0.0514   
 6 AUS    2411 0.00737  
 7 AVL     261 0.000797 
 8 BDL     412 0.00126  
 9 BGR     358 0.00109  
10 BHM     269 0.000822 
# ℹ 94 more rows

12.7 count() with “wt =” argument

  • You can optionally provide a weight variable.
```{r}
# For example, you could use this to "count" (sum) the total number of miles a plane flew:
not_cancelled %>% 
  count(tailnum, wt = distance)

not_cancelled %>% 
  count(tailnum)
```
# A tibble: 4,037 × 2
   tailnum      n
   <chr>    <dbl>
 1 D942DN    3418
 2 N0EGMQ  239143
 3 N10156  109664
 4 N102UW   25722
 5 N103US   24619
 6 N104UW   24616
 7 N10575  139903
 8 N105UW   23618
 9 N107US   21677
10 N108UW   32070
# ℹ 4,027 more rows
# A tibble: 4,037 × 2
   tailnum     n
   <chr>   <int>
 1 D942DN      4
 2 N0EGMQ    352
 3 N10156    145
 4 N102UW     48
 5 N103US     46
 6 N104UW     46
 7 N10575    269
 8 N105UW     45
 9 N107US     41
10 N108UW     60
# ℹ 4,027 more rows

12.8 sum(x) of TRUE / mean(x) of TRUE

  • Sum(x) of TRUE’s: Number of TRUE’s in X
  • mean(x) of TRUE’s: Proportion of TRUE’s in X
```{r}
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(n_early = sum(dep_time < 500),
            n = n())

# What proportion of flights are delayed by more than an hour?
not_cancelled %>% 
  group_by(year, month, day) %>% 
  summarise(hour_prop = mean(arr_delay > 60),
            n = n())
```
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day n_early     n
   <int> <int> <int>   <int> <int>
 1  2013     1     1       0   831
 2  2013     1     2       3   928
 3  2013     1     3       4   900
 4  2013     1     4       3   908
 5  2013     1     5       3   717
 6  2013     1     6       2   829
 7  2013     1     7       2   930
 8  2013     1     8       1   892
 9  2013     1     9       3   893
10  2013     1    10       3   929
# ℹ 355 more rows
# A tibble: 365 × 5
# Groups:   year, month [12]
    year month   day hour_prop     n
   <int> <int> <int>     <dbl> <int>
 1  2013     1     1    0.0722   831
 2  2013     1     2    0.0851   928
 3  2013     1     3    0.0567   900
 4  2013     1     4    0.0396   908
 5  2013     1     5    0.0349   717
 6  2013     1     6    0.0470   829
 7  2013     1     7    0.0333   930
 8  2013     1     8    0.0213   892
 9  2013     1     9    0.0202   893
10  2013     1    10    0.0183   929
# ℹ 355 more rows

12.9 Grouping by multiple variables

```{r}
daily <- group_by(flights, year, month, day)
(per_day   <- summarise(daily, flights = n()))

# sum of daily total flights over the month
(per_month <- summarise(per_day, flights = sum(flights))) #same as below

group_by(flights, year, month) %>% 
  summarise(n = n())

# sum of monthly total flights over the year.
(per_year  <- summarise(per_month, flights = sum(flights))) # same as below

group_by(flights, year) %>% 
  summarise(n = n())
```
# A tibble: 365 × 4
# Groups:   year, month [12]
    year month   day flights
   <int> <int> <int>   <int>
 1  2013     1     1     842
 2  2013     1     2     943
 3  2013     1     3     914
 4  2013     1     4     915
 5  2013     1     5     720
 6  2013     1     6     832
 7  2013     1     7     933
 8  2013     1     8     899
 9  2013     1     9     902
10  2013     1    10     932
# ℹ 355 more rows
# A tibble: 12 × 3
# Groups:   year [1]
    year month flights
   <int> <int>   <int>
 1  2013     1   27004
 2  2013     2   24951
 3  2013     3   28834
 4  2013     4   28330
 5  2013     5   28796
 6  2013     6   28243
 7  2013     7   29425
 8  2013     8   29327
 9  2013     9   27574
10  2013    10   28889
11  2013    11   27268
12  2013    12   28135
# A tibble: 12 × 3
# Groups:   year [1]
    year month     n
   <int> <int> <int>
 1  2013     1 27004
 2  2013     2 24951
 3  2013     3 28834
 4  2013     4 28330
 5  2013     5 28796
 6  2013     6 28243
 7  2013     7 29425
 8  2013     8 29327
 9  2013     9 27574
10  2013    10 28889
11  2013    11 27268
12  2013    12 28135
# A tibble: 1 × 2
   year flights
  <int>   <int>
1  2013  336776
# A tibble: 1 × 2
   year      n
  <int>  <int>
1  2013 336776

12.9.1 Ungrouping

```{r}
daily %>% 
  ungroup() %>%             # no longer grouped by date
  summarise(flights = n())  # all flights
```
# A tibble: 1 × 1
  flights
    <int>
1  336776

12.10 case_when

  • A function that comes in handy a lot when using mutate() to create a categorical variable is case_when(), which is sort of like ifelse() except it can cleanly handle way more than one condition

  • Provide case_when() with a series of if ~ then conditions, separated by commas, and it will go through the ifs one by one for each observation until it finds a fitting one.

  • As soon as it finds one, it stops looking, so you can assume anyone that satisfied an earlier condition doesn’t count any more. Also, you can have the last if be TRUE to give a value for anyone who hasn’t been caught yet.

```{r}
state.x77
state <- as_tibble(state.x77)
skimr::skim(state)

state %>% 
  mutate(pop_bracket = case_when(
    Population >= quantile(Population, 3/4) ~ ">=75th percentile",
    Population >= quantile(Population, 2/4) ~ "50th - 75th percentile",
    Population >= quantile(Population, 1/4) ~ "25th - 50th percentile",
    TRUE                                    ~ "< 25th percentile")
    )
```
               Population Income Illiteracy Life Exp Murder HS Grad Frost
Alabama              3615   3624        2.1    69.05   15.1    41.3    20
Alaska                365   6315        1.5    69.31   11.3    66.7   152
Arizona              2212   4530        1.8    70.55    7.8    58.1    15
Arkansas             2110   3378        1.9    70.66   10.1    39.9    65
California          21198   5114        1.1    71.71   10.3    62.6    20
Colorado             2541   4884        0.7    72.06    6.8    63.9   166
Connecticut          3100   5348        1.1    72.48    3.1    56.0   139
Delaware              579   4809        0.9    70.06    6.2    54.6   103
Florida              8277   4815        1.3    70.66   10.7    52.6    11
Georgia              4931   4091        2.0    68.54   13.9    40.6    60
Hawaii                868   4963        1.9    73.60    6.2    61.9     0
Idaho                 813   4119        0.6    71.87    5.3    59.5   126
Illinois            11197   5107        0.9    70.14   10.3    52.6   127
Indiana              5313   4458        0.7    70.88    7.1    52.9   122
Iowa                 2861   4628        0.5    72.56    2.3    59.0   140
Kansas               2280   4669        0.6    72.58    4.5    59.9   114
Kentucky             3387   3712        1.6    70.10   10.6    38.5    95
Louisiana            3806   3545        2.8    68.76   13.2    42.2    12
Maine                1058   3694        0.7    70.39    2.7    54.7   161
Maryland             4122   5299        0.9    70.22    8.5    52.3   101
Massachusetts        5814   4755        1.1    71.83    3.3    58.5   103
Michigan             9111   4751        0.9    70.63   11.1    52.8   125
Minnesota            3921   4675        0.6    72.96    2.3    57.6   160
Mississippi          2341   3098        2.4    68.09   12.5    41.0    50
Missouri             4767   4254        0.8    70.69    9.3    48.8   108
Montana               746   4347        0.6    70.56    5.0    59.2   155
Nebraska             1544   4508        0.6    72.60    2.9    59.3   139
Nevada                590   5149        0.5    69.03   11.5    65.2   188
New Hampshire         812   4281        0.7    71.23    3.3    57.6   174
New Jersey           7333   5237        1.1    70.93    5.2    52.5   115
New Mexico           1144   3601        2.2    70.32    9.7    55.2   120
New York            18076   4903        1.4    70.55   10.9    52.7    82
North Carolina       5441   3875        1.8    69.21   11.1    38.5    80
North Dakota          637   5087        0.8    72.78    1.4    50.3   186
Ohio                10735   4561        0.8    70.82    7.4    53.2   124
Oklahoma             2715   3983        1.1    71.42    6.4    51.6    82
Oregon               2284   4660        0.6    72.13    4.2    60.0    44
Pennsylvania        11860   4449        1.0    70.43    6.1    50.2   126
Rhode Island          931   4558        1.3    71.90    2.4    46.4   127
South Carolina       2816   3635        2.3    67.96   11.6    37.8    65
South Dakota          681   4167        0.5    72.08    1.7    53.3   172
Tennessee            4173   3821        1.7    70.11   11.0    41.8    70
Texas               12237   4188        2.2    70.90   12.2    47.4    35
Utah                 1203   4022        0.6    72.90    4.5    67.3   137
Vermont               472   3907        0.6    71.64    5.5    57.1   168
Virginia             4981   4701        1.4    70.08    9.5    47.8    85
Washington           3559   4864        0.6    71.72    4.3    63.5    32
West Virginia        1799   3617        1.4    69.48    6.7    41.6   100
Wisconsin            4589   4468        0.7    72.48    3.0    54.5   149
Wyoming               376   4566        0.6    70.29    6.9    62.9   173
                 Area
Alabama         50708
Alaska         566432
Arizona        113417
Arkansas        51945
California     156361
Colorado       103766
Connecticut      4862
Delaware         1982
Florida         54090
Georgia         58073
Hawaii           6425
Idaho           82677
Illinois        55748
Indiana         36097
Iowa            55941
Kansas          81787
Kentucky        39650
Louisiana       44930
Maine           30920
Maryland         9891
Massachusetts    7826
Michigan        56817
Minnesota       79289
Mississippi     47296
Missouri        68995
Montana        145587
Nebraska        76483
Nevada         109889
New Hampshire    9027
New Jersey       7521
New Mexico     121412
New York        47831
North Carolina  48798
North Dakota    69273
Ohio            40975
Oklahoma        68782
Oregon          96184
Pennsylvania    44966
Rhode Island     1049
South Carolina  30225
South Dakota    75955
Tennessee       41328
Texas          262134
Utah            82096
Vermont          9267
Virginia        39780
Washington      66570
West Virginia   24070
Wisconsin       54464
Wyoming         97203
Data summary
Name state
Number of rows 50
Number of columns 8
_______________________
Column type frequency:
numeric 8
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Population 0 1 4246.42 4464.49 365.00 1079.50 2838.50 4968.50 21198.0 ▇▂▁▁▁
Income 0 1 4435.80 614.47 3098.00 3992.75 4519.00 4813.50 6315.0 ▃▅▇▂▁
Illiteracy 0 1 1.17 0.61 0.50 0.62 0.95 1.58 2.8 ▇▃▂▂▁
Life Exp 0 1 70.88 1.34 67.96 70.12 70.67 71.89 73.6 ▃▃▇▅▅
Murder 0 1 7.38 3.69 1.40 4.35 6.85 10.67 15.1 ▆▇▃▇▂
HS Grad 0 1 53.11 8.08 37.80 48.05 53.25 59.15 67.3 ▅▂▇▆▃
Frost 0 1 104.46 51.98 0.00 66.25 114.50 139.75 188.0 ▅▃▅▇▆
Area 0 1 70735.88 85327.30 1049.00 36985.25 54277.00 81162.50 566432.0 ▇▁▁▁▁
# A tibble: 50 × 9
   Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost   Area
        <dbl>  <dbl>      <dbl>      <dbl>  <dbl>     <dbl> <dbl>  <dbl>
 1       3615   3624        2.1       69.0   15.1      41.3    20  50708
 2        365   6315        1.5       69.3   11.3      66.7   152 566432
 3       2212   4530        1.8       70.6    7.8      58.1    15 113417
 4       2110   3378        1.9       70.7   10.1      39.9    65  51945
 5      21198   5114        1.1       71.7   10.3      62.6    20 156361
 6       2541   4884        0.7       72.1    6.8      63.9   166 103766
 7       3100   5348        1.1       72.5    3.1      56     139   4862
 8        579   4809        0.9       70.1    6.2      54.6   103   1982
 9       8277   4815        1.3       70.7   10.7      52.6    11  54090
10       4931   4091        2         68.5   13.9      40.6    60  58073
# ℹ 40 more rows
# ℹ 1 more variable: pop_bracket <chr>

13 Applying Tidyverse way of DWV to M01-1 Example

In M01-1, there was a data visualization sample with the following question. Would city mileage differ by cylinder size of vehicles? If so, how?

Insights?: The data visualization revealed the following: city mileage gets better as car’s cylinder gets fewer.

13.1 Without using piping

```{r M01-Principle Data Viz}
mpg

g <- ggplot(mpg, aes(cty))
g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
  labs(title = "Density Plot",
       subtitle = "City Mileage Grouped by Number of Cylinders",
       caption = "Source: mpg dataset",
       x= "city Mileage",
       fill="# Cylinders") +
  theme_classic()
```
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

13.2 Using Piping

Use this way more often

```{r tidyverse way}
mpg # use mpg data set, not mpg column of the mtcars datset.

mpg %>% 
  ggplot(aes(cty))+
  geom_density(aes(fill=factor(cyl)), alpha=0.8) +
  labs(title = "Density Plot",
       subtitle = "City Mileage Grouped by Number of Cylinders",
       caption = "Source: mpg dataset",
       x= "city Mileage",
       fill="# Cylinders"
       ) +
  theme_classic()
```
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# ℹ 224 more rows

Tip

Try to use piping all the time from now on. It is the tidyverse way of wrangling.

14 Missing data

Topics: - Dealing with missing data - Detecting missing data - Replacing missing data with mean of the column, so-called mean replacement

```{r}
airQ.df <- airquality
class(airQ.df)
typeof(airQ.df)
summary(airQ.df)
skimr::skim(airQ.df)

colSums(is.na(airQ.df))

# mean replacement
OzoneMean <- mean(airQ.df$Ozone) # doesn't work due to NA's
OzoneMean <- mean(airQ.df$Ozone, na.rm = TRUE)

is.na(airQ.df["Ozone"])

airQ.df['Ozone'][is.na(airQ.df['Ozone'])] <- OzoneMean
airQ.df
skimr::skim(airQ.df)
```
[1] "data.frame"
[1] "list"
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               
Data summary
Name airQ.df
Number of rows 153
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Ozone 37 0.76 42.13 32.99 1.0 18.00 31.5 63.25 168.0 ▇▃▂▁▁
Solar.R 7 0.95 185.93 90.06 7.0 115.75 205.0 258.75 334.0 ▅▃▅▇▅
Wind 0 1.00 9.96 3.52 1.7 7.40 9.7 11.50 20.7 ▂▇▇▃▁
Temp 0 1.00 77.88 9.47 56.0 72.00 79.0 85.00 97.0 ▂▃▇▇▃
Month 0 1.00 6.99 1.42 5.0 6.00 7.0 8.00 9.0 ▇▇▇▇▇
Day 0 1.00 15.80 8.86 1.0 8.00 16.0 23.00 31.0 ▇▇▇▇▆
  Ozone Solar.R    Wind    Temp   Month     Day 
     37       7       0       0       0       0 
       Ozone
  [1,] FALSE
  [2,] FALSE
  [3,] FALSE
  [4,] FALSE
  [5,]  TRUE
  [6,] FALSE
  [7,] FALSE
  [8,] FALSE
  [9,] FALSE
 [10,]  TRUE
 [11,] FALSE
 [12,] FALSE
 [13,] FALSE
 [14,] FALSE
 [15,] FALSE
 [16,] FALSE
 [17,] FALSE
 [18,] FALSE
 [19,] FALSE
 [20,] FALSE
 [21,] FALSE
 [22,] FALSE
 [23,] FALSE
 [24,] FALSE
 [25,]  TRUE
 [26,]  TRUE
 [27,]  TRUE
 [28,] FALSE
 [29,] FALSE
 [30,] FALSE
 [31,] FALSE
 [32,]  TRUE
 [33,]  TRUE
 [34,]  TRUE
 [35,]  TRUE
 [36,]  TRUE
 [37,]  TRUE
 [38,] FALSE
 [39,]  TRUE
 [40,] FALSE
 [41,] FALSE
 [42,]  TRUE
 [43,]  TRUE
 [44,] FALSE
 [45,]  TRUE
 [46,]  TRUE
 [47,] FALSE
 [48,] FALSE
 [49,] FALSE
 [50,] FALSE
 [51,] FALSE
 [52,]  TRUE
 [53,]  TRUE
 [54,]  TRUE
 [55,]  TRUE
 [56,]  TRUE
 [57,]  TRUE
 [58,]  TRUE
 [59,]  TRUE
 [60,]  TRUE
 [61,]  TRUE
 [62,] FALSE
 [63,] FALSE
 [64,] FALSE
 [65,]  TRUE
 [66,] FALSE
 [67,] FALSE
 [68,] FALSE
 [69,] FALSE
 [70,] FALSE
 [71,] FALSE
 [72,]  TRUE
 [73,] FALSE
 [74,] FALSE
 [75,]  TRUE
 [76,] FALSE
 [77,] FALSE
 [78,] FALSE
 [79,] FALSE
 [80,] FALSE
 [81,] FALSE
 [82,] FALSE
 [83,]  TRUE
 [84,]  TRUE
 [85,] FALSE
 [86,] FALSE
 [87,] FALSE
 [88,] FALSE
 [89,] FALSE
 [90,] FALSE
 [91,] FALSE
 [92,] FALSE
 [93,] FALSE
 [94,] FALSE
 [95,] FALSE
 [96,] FALSE
 [97,] FALSE
 [98,] FALSE
 [99,] FALSE
[100,] FALSE
[101,] FALSE
[102,]  TRUE
[103,]  TRUE
[104,] FALSE
[105,] FALSE
[106,] FALSE
[107,]  TRUE
[108,] FALSE
[109,] FALSE
[110,] FALSE
[111,] FALSE
[112,] FALSE
[113,] FALSE
[114,] FALSE
[115,]  TRUE
[116,] FALSE
[117,] FALSE
[118,] FALSE
[119,]  TRUE
[120,] FALSE
[121,] FALSE
[122,] FALSE
[123,] FALSE
[124,] FALSE
[125,] FALSE
[126,] FALSE
[127,] FALSE
[128,] FALSE
[129,] FALSE
[130,] FALSE
[131,] FALSE
[132,] FALSE
[133,] FALSE
[134,] FALSE
[135,] FALSE
[136,] FALSE
[137,] FALSE
[138,] FALSE
[139,] FALSE
[140,] FALSE
[141,] FALSE
[142,] FALSE
[143,] FALSE
[144,] FALSE
[145,] FALSE
[146,] FALSE
[147,] FALSE
[148,] FALSE
[149,] FALSE
[150,]  TRUE
[151,] FALSE
[152,] FALSE
[153,] FALSE
        Ozone Solar.R Wind Temp Month Day
1    41.00000     190  7.4   67     5   1
2    36.00000     118  8.0   72     5   2
3    12.00000     149 12.6   74     5   3
4    18.00000     313 11.5   62     5   4
5    42.12931      NA 14.3   56     5   5
6    28.00000      NA 14.9   66     5   6
7    23.00000     299  8.6   65     5   7
8    19.00000      99 13.8   59     5   8
9     8.00000      19 20.1   61     5   9
10   42.12931     194  8.6   69     5  10
11    7.00000      NA  6.9   74     5  11
12   16.00000     256  9.7   69     5  12
13   11.00000     290  9.2   66     5  13
14   14.00000     274 10.9   68     5  14
15   18.00000      65 13.2   58     5  15
16   14.00000     334 11.5   64     5  16
17   34.00000     307 12.0   66     5  17
18    6.00000      78 18.4   57     5  18
19   30.00000     322 11.5   68     5  19
20   11.00000      44  9.7   62     5  20
21    1.00000       8  9.7   59     5  21
22   11.00000     320 16.6   73     5  22
23    4.00000      25  9.7   61     5  23
24   32.00000      92 12.0   61     5  24
25   42.12931      66 16.6   57     5  25
26   42.12931     266 14.9   58     5  26
27   42.12931      NA  8.0   57     5  27
28   23.00000      13 12.0   67     5  28
29   45.00000     252 14.9   81     5  29
30  115.00000     223  5.7   79     5  30
31   37.00000     279  7.4   76     5  31
32   42.12931     286  8.6   78     6   1
33   42.12931     287  9.7   74     6   2
34   42.12931     242 16.1   67     6   3
35   42.12931     186  9.2   84     6   4
36   42.12931     220  8.6   85     6   5
37   42.12931     264 14.3   79     6   6
38   29.00000     127  9.7   82     6   7
39   42.12931     273  6.9   87     6   8
40   71.00000     291 13.8   90     6   9
41   39.00000     323 11.5   87     6  10
42   42.12931     259 10.9   93     6  11
43   42.12931     250  9.2   92     6  12
44   23.00000     148  8.0   82     6  13
45   42.12931     332 13.8   80     6  14
46   42.12931     322 11.5   79     6  15
47   21.00000     191 14.9   77     6  16
48   37.00000     284 20.7   72     6  17
49   20.00000      37  9.2   65     6  18
50   12.00000     120 11.5   73     6  19
51   13.00000     137 10.3   76     6  20
52   42.12931     150  6.3   77     6  21
53   42.12931      59  1.7   76     6  22
54   42.12931      91  4.6   76     6  23
55   42.12931     250  6.3   76     6  24
56   42.12931     135  8.0   75     6  25
57   42.12931     127  8.0   78     6  26
58   42.12931      47 10.3   73     6  27
59   42.12931      98 11.5   80     6  28
60   42.12931      31 14.9   77     6  29
61   42.12931     138  8.0   83     6  30
62  135.00000     269  4.1   84     7   1
63   49.00000     248  9.2   85     7   2
64   32.00000     236  9.2   81     7   3
65   42.12931     101 10.9   84     7   4
66   64.00000     175  4.6   83     7   5
67   40.00000     314 10.9   83     7   6
68   77.00000     276  5.1   88     7   7
69   97.00000     267  6.3   92     7   8
70   97.00000     272  5.7   92     7   9
71   85.00000     175  7.4   89     7  10
72   42.12931     139  8.6   82     7  11
73   10.00000     264 14.3   73     7  12
74   27.00000     175 14.9   81     7  13
75   42.12931     291 14.9   91     7  14
76    7.00000      48 14.3   80     7  15
77   48.00000     260  6.9   81     7  16
78   35.00000     274 10.3   82     7  17
79   61.00000     285  6.3   84     7  18
80   79.00000     187  5.1   87     7  19
81   63.00000     220 11.5   85     7  20
82   16.00000       7  6.9   74     7  21
83   42.12931     258  9.7   81     7  22
84   42.12931     295 11.5   82     7  23
85   80.00000     294  8.6   86     7  24
86  108.00000     223  8.0   85     7  25
87   20.00000      81  8.6   82     7  26
88   52.00000      82 12.0   86     7  27
89   82.00000     213  7.4   88     7  28
90   50.00000     275  7.4   86     7  29
91   64.00000     253  7.4   83     7  30
92   59.00000     254  9.2   81     7  31
93   39.00000      83  6.9   81     8   1
94    9.00000      24 13.8   81     8   2
95   16.00000      77  7.4   82     8   3
96   78.00000      NA  6.9   86     8   4
97   35.00000      NA  7.4   85     8   5
98   66.00000      NA  4.6   87     8   6
99  122.00000     255  4.0   89     8   7
100  89.00000     229 10.3   90     8   8
101 110.00000     207  8.0   90     8   9
102  42.12931     222  8.6   92     8  10
103  42.12931     137 11.5   86     8  11
104  44.00000     192 11.5   86     8  12
105  28.00000     273 11.5   82     8  13
106  65.00000     157  9.7   80     8  14
107  42.12931      64 11.5   79     8  15
108  22.00000      71 10.3   77     8  16
109  59.00000      51  6.3   79     8  17
110  23.00000     115  7.4   76     8  18
111  31.00000     244 10.9   78     8  19
112  44.00000     190 10.3   78     8  20
113  21.00000     259 15.5   77     8  21
114   9.00000      36 14.3   72     8  22
115  42.12931     255 12.6   75     8  23
116  45.00000     212  9.7   79     8  24
117 168.00000     238  3.4   81     8  25
118  73.00000     215  8.0   86     8  26
119  42.12931     153  5.7   88     8  27
120  76.00000     203  9.7   97     8  28
121 118.00000     225  2.3   94     8  29
122  84.00000     237  6.3   96     8  30
123  85.00000     188  6.3   94     8  31
124  96.00000     167  6.9   91     9   1
125  78.00000     197  5.1   92     9   2
126  73.00000     183  2.8   93     9   3
127  91.00000     189  4.6   93     9   4
128  47.00000      95  7.4   87     9   5
129  32.00000      92 15.5   84     9   6
130  20.00000     252 10.9   80     9   7
131  23.00000     220 10.3   78     9   8
132  21.00000     230 10.9   75     9   9
133  24.00000     259  9.7   73     9  10
134  44.00000     236 14.9   81     9  11
135  21.00000     259 15.5   76     9  12
136  28.00000     238  6.3   77     9  13
137   9.00000      24 10.9   71     9  14
138  13.00000     112 11.5   71     9  15
139  46.00000     237  6.9   78     9  16
140  18.00000     224 13.8   67     9  17
141  13.00000      27 10.3   76     9  18
142  24.00000     238 10.3   68     9  19
143  16.00000     201  8.0   82     9  20
144  13.00000     238 12.6   64     9  21
145  23.00000      14  9.2   71     9  22
146  36.00000     139 10.3   81     9  23
147   7.00000      49 10.3   69     9  24
148  14.00000      20 16.6   63     9  25
149  30.00000     193  6.9   70     9  26
150  42.12931     145 13.2   77     9  27
151  14.00000     191 14.3   75     9  28
152  18.00000     131  8.0   76     9  29
153  20.00000     223 11.5   68     9  30
Data summary
Name airQ.df
Number of rows 153
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Ozone 0 1.00 42.13 28.69 1.0 21.00 42.13 46.00 168.0 ▇▇▂▁▁
Solar.R 7 0.95 185.93 90.06 7.0 115.75 205.00 258.75 334.0 ▅▃▅▇▅
Wind 0 1.00 9.96 3.52 1.7 7.40 9.70 11.50 20.7 ▂▇▇▃▁
Temp 0 1.00 77.88 9.47 56.0 72.00 79.00 85.00 97.0 ▂▃▇▇▃
Month 0 1.00 6.99 1.42 5.0 6.00 7.00 8.00 9.0 ▇▇▇▇▇
Day 0 1.00 15.80 8.86 1.0 8.00 16.00 23.00 31.0 ▇▇▇▇▆

14.1 Mean replacement

```{r mean replacement extra not covered in the video}
SolarMean <- mean(airQ.df$Solar.R, na.rm = TRUE)

airQ.df$Solar.R [is.na(airQ.df$Solar.R)] <- SolarMean # treat the column as a vector
airQ.df['Solar.R'][is.na(airQ.df['Solar.R'])] <- SolarMean # treat the column as a data frame but the same result

library(skimr)
skim(airQ.df)
```
Data summary
Name airQ.df
Number of rows 153
Number of columns 6
_______________________
Column type frequency:
numeric 6
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Ozone 0 1 42.13 28.69 1.0 21.0 42.13 46.0 168.0 ▇▇▂▁▁
Solar.R 0 1 185.93 87.96 7.0 120.0 194.00 256.0 334.0 ▅▃▆▇▅
Wind 0 1 9.96 3.52 1.7 7.4 9.70 11.5 20.7 ▂▇▇▃▁
Temp 0 1 77.88 9.47 56.0 72.0 79.00 85.0 97.0 ▂▃▇▇▃
Month 0 1 6.99 1.42 5.0 6.0 7.00 8.0 9.0 ▇▇▇▇▇
Day 0 1 15.80 8.86 1.0 8.0 16.00 23.0 31.0 ▇▇▇▇▆

15 Resources