M05-1-Principles-Data Wrangling with Tidyverse in R
1 Overview
1.1 Learning Outcomes
After taking this workshop, participants should be able to do following:
- Describe the concept of Data Wrangling.
- Describe how Tibbles are different from data frames
- Explain how to convert wide or long data to “Tidy” data
- Explain how to merge relational data sets using join functions. (Next module)
- Explain how to use grouped mutates and filter together.
- Be familiar with major dplyr functions for transforming data.
- Create a new variable with mutate() and case_when().
- Use the pipe operator to shape the data to prepare for analysis and visualization
1.2 The textbook chapters to cover:
- Ch3: Data Transformation
- Ch5: Data Tidying
- Ch13: Numbers
2 Introduction
2.1 Library and Packages
2.2 Tibbles vs. Data Frames
- Tibbles
- New data frame type
- More modern
- Columns can be lists in tibble.
- Column name can have non-standard names
2.3 Two main differences between tibble and data frames
- printing and subsetting
2.3.1 printing
```{r}
print(cars.tib) # 10 rows only with adjustable columns
# if you want more than 10 rows or all columns, do the adjustments.
data() #look up built-in data
print(starwars)
print(starwars, n = 15, width = Inf)
```# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color
<chr> <int> <dbl> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond fair blue
2 C-3PO 167 75 <NA> gold yellow
3 R2-D2 96 32 <NA> white, blue red
4 Darth Vader 202 136 none white yellow
5 Leia Organa 150 49 brown light brown
6 Owen Lars 178 120 brown, grey light blue
7 Beru Whitesun Lars 165 75 brown light blue
8 R5-D4 97 32 <NA> white, red red
9 Biggs Darklighter 183 84 black light brown
10 Obi-Wan Kenobi 182 77 auburn, white fair blue-gray
11 Anakin Skywalker 188 84 blond fair blue
12 Wilhuff Tarkin 180 NA auburn, grey fair blue
13 Chewbacca 228 112 brown unknown blue
14 Han Solo 180 80 brown fair brown
15 Greedo 173 74 <NA> green black
birth_year sex gender homeworld species films vehicles starships
<dbl> <chr> <chr> <chr> <chr> <list> <list> <list>
1 19 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>
2 112 none masculine Tatooine Droid <chr [6]> <chr [0]> <chr [0]>
3 33 none masculine Naboo Droid <chr [7]> <chr [0]> <chr [0]>
4 41.9 male masculine Tatooine Human <chr [4]> <chr [0]> <chr [1]>
5 19 female feminine Alderaan Human <chr [5]> <chr [1]> <chr [0]>
6 52 male masculine Tatooine Human <chr [3]> <chr [0]> <chr [0]>
7 47 female feminine Tatooine Human <chr [3]> <chr [0]> <chr [0]>
8 NA none masculine Tatooine Droid <chr [1]> <chr [0]> <chr [0]>
9 24 male masculine Tatooine Human <chr [1]> <chr [0]> <chr [1]>
10 57 male masculine Stewjon Human <chr [6]> <chr [1]> <chr [5]>
11 41.9 male masculine Tatooine Human <chr [3]> <chr [2]> <chr [3]>
12 64 male masculine Eriadu Human <chr [2]> <chr [0]> <chr [0]>
13 200 male masculine Kashyyyk Wookiee <chr [5]> <chr [1]> <chr [2]>
14 29 male masculine Corellia Human <chr [4]> <chr [0]> <chr [2]>
15 44 male masculine Rodia Rodian <chr [1]> <chr [0]> <chr [0]>
# ℹ 72 more rows
2.4 subsetting
```{r}
#column
cars.tib[1] #returns a tibble
cars.tib[[1]] #returns a vector
#row
cars.tib[1,] #returns a tibble
```# A tibble: 32 × 1
mpg
<dbl>
1 21
2 21
3 22.8
4 21.4
5 18.7
6 18.1
7 14.3
8 24.4
9 22.8
10 19.2
# ℹ 22 more rows
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
# A tibble: 1 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
3 What is Tidyverse?
- It is meta package
- It has many packages essential to data wrangling and visualization.
- Created by Hadley Wickham, author of R for Data Science
- ggplot2 for visualization
- dplyr for data transformation
- tidyr for reshaping data to make it tidy
3.1 Initial Data Preparation and Exploration
3.2 Pipe operator: %>% or |>
- Much easier to read and understand when you many arguments to pipe through
- https://r4ds.had.co.nz/pipes.html
```{r}
cars.tib %>%
filter(mpg > 25) %>%
select(drat)
# what percent of the cars in the data are manual as opposed to automatic?
cars.tib %>%
pull(am) %>% # same as cars.tib[['am']]
mean(na.rm = TRUE) %>%
scales::percent(accuracy = .1)
## pull() is a dplyr function that says "give me back this one variable instead of a data set"
round(mean(cars.tib[['am']], na.rm = TRUE)*100, 1) # base R equivalent
```# A tibble: 6 × 1
drat
<dbl>
1 4.08
2 4.93
3 4.22
4 4.08
5 4.43
6 3.77
[1] "40.6%"
[1] 40.6
4 Data Wrangling with dplyr
Data Transformation with dplyr :: CHEAT SHEET
4.1 Core of Data Wrangling
```{r}
#install.packages('vtable')
library(vtable)
vtable(cars.tib)
t(cars.tib) # transpose
t(t(cars.tib))
```| Name | Class | Values |
|---|---|---|
| mpg | numeric | Num: 10.4 to 33.9 |
| cyl | numeric | Num: 4 to 8 |
| disp | numeric | Num: 71.1 to 472 |
| hp | numeric | Num: 52 to 335 |
| drat | numeric | Num: 2.76 to 4.93 |
| wt | numeric | Num: 1.513 to 5.424 |
| qsec | numeric | Num: 14.5 to 22.9 |
| vs | numeric | Num: 0 to 1 |
| am | numeric | Num: 0 to 1 |
| gear | numeric | Num: 3 to 5 |
| carb | numeric | Num: 1 to 8 |
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
mpg 21.00 21.000 22.80 21.400 18.70 18.10 14.30 24.40 22.80 19.20
cyl 6.00 6.000 4.00 6.000 8.00 6.00 8.00 4.00 4.00 6.00
disp 160.00 160.000 108.00 258.000 360.00 225.00 360.00 146.70 140.80 167.60
hp 110.00 110.000 93.00 110.000 175.00 105.00 245.00 62.00 95.00 123.00
drat 3.90 3.900 3.85 3.080 3.15 2.76 3.21 3.69 3.92 3.92
wt 2.62 2.875 2.32 3.215 3.44 3.46 3.57 3.19 3.15 3.44
qsec 16.46 17.020 18.61 19.440 17.02 20.22 15.84 20.00 22.90 18.30
vs 0.00 0.000 1.00 1.000 0.00 1.00 0.00 1.00 1.00 1.00
am 1.00 1.000 1.00 0.000 0.00 0.00 0.00 0.00 0.00 0.00
gear 4.00 4.000 4.00 3.000 3.00 3.00 3.00 4.00 4.00 4.00
carb 4.00 4.000 1.00 1.000 2.00 1.00 4.00 2.00 2.00 4.00
[,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
mpg 17.80 16.40 17.30 15.20 10.40 10.400 14.700 32.40 30.400 33.900
cyl 6.00 8.00 8.00 8.00 8.00 8.000 8.000 4.00 4.000 4.000
disp 167.60 275.80 275.80 275.80 472.00 460.000 440.000 78.70 75.700 71.100
hp 123.00 180.00 180.00 180.00 205.00 215.000 230.000 66.00 52.000 65.000
drat 3.92 3.07 3.07 3.07 2.93 3.000 3.230 4.08 4.930 4.220
wt 3.44 4.07 3.73 3.78 5.25 5.424 5.345 2.20 1.615 1.835
qsec 18.90 17.40 17.60 18.00 17.98 17.820 17.420 19.47 18.520 19.900
vs 1.00 0.00 0.00 0.00 0.00 0.000 0.000 1.00 1.000 1.000
am 0.00 0.00 0.00 0.00 0.00 0.000 0.000 1.00 1.000 1.000
gear 4.00 3.00 3.00 3.00 3.00 3.000 3.000 4.00 4.000 4.000
carb 4.00 3.00 3.00 3.00 4.00 4.000 4.000 1.00 2.000 1.000
[,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
mpg 21.500 15.50 15.200 13.30 19.200 27.300 26.00 30.400 15.80 19.70
cyl 4.000 8.00 8.000 8.00 8.000 4.000 4.00 4.000 8.00 6.00
disp 120.100 318.00 304.000 350.00 400.000 79.000 120.30 95.100 351.00 145.00
hp 97.000 150.00 150.000 245.00 175.000 66.000 91.00 113.000 264.00 175.00
drat 3.700 2.76 3.150 3.73 3.080 4.080 4.43 3.770 4.22 3.62
wt 2.465 3.52 3.435 3.84 3.845 1.935 2.14 1.513 3.17 2.77
qsec 20.010 16.87 17.300 15.41 17.050 18.900 16.70 16.900 14.50 15.50
vs 1.000 0.00 0.000 0.00 0.000 1.000 0.00 1.000 0.00 0.00
am 0.000 0.00 0.000 0.00 0.000 1.000 1.00 1.000 1.00 1.00
gear 3.000 3.00 3.000 3.00 3.000 4.000 5.00 5.000 5.00 5.00
carb 1.000 2.00 2.000 4.00 2.000 1.000 2.00 2.000 4.00 6.00
[,31] [,32]
mpg 15.00 21.40
cyl 8.00 4.00
disp 301.00 121.00
hp 335.00 109.00
drat 3.54 4.11
wt 3.57 2.78
qsec 14.60 18.60
vs 0.00 1.00
am 1.00 1.00
gear 5.00 4.00
carb 8.00 2.00
mpg cyl disp hp drat wt qsec vs am gear carb
[1,] 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
[2,] 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
[3,] 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
[4,] 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
[5,] 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
[6,] 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
[7,] 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
[8,] 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
[9,] 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
[10,] 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
[11,] 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
[12,] 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
[13,] 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
[14,] 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
[15,] 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
[16,] 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
[17,] 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
[18,] 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
[19,] 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
[20,] 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
[21,] 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
[22,] 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
[23,] 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
[24,] 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
[25,] 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
[26,] 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
[27,] 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
[28,] 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
[29,] 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
[30,] 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
[31,] 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
[32,] 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
4.2 Stages of Data Wrangling
- From records to data
- From data to tidy data
- From tidy data to data for analysis.
5 From Data to Tidy data
5.1 Three conditions for Tidy Data
- Each variable must have its own unique column
- Each observation must have its own unique row
- Each value must have its own cell.
5.2 tidyr package
tidy package is the main package for tidying data.
- pivot_longer()
- pivot_wider()
- separate()
- unite()
5.3 Wide data to long data:
```{r}
# examples
# data() # to see all data available in R
relig_income # wide data
billboard # wide data
```# A tibble: 18 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Don’t k… 15 14 15 11 10 35 21
6 Evangel… 575 869 1064 982 881 1486 949
7 Hindu 1 9 7 9 11 34 47
8 Histori… 228 244 236 238 197 223 131
9 Jehovah… 20 27 24 24 21 30 15
10 Jewish 19 19 25 25 30 95 69
11 Mainlin… 289 495 619 655 651 1107 939
12 Mormon 29 40 48 51 56 112 85
13 Muslim 6 7 9 10 9 23 16
14 Orthodox 13 17 23 32 32 47 38
15 Other C… 9 7 11 13 13 14 18
16 Other F… 20 33 40 46 49 63 46
17 Other W… 5 2 3 4 2 7 3
18 Unaffil… 217 299 374 365 341 528 407
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
# A tibble: 317 × 79
artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
<chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
# ℹ 307 more rows
# ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
# wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
# wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
# wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
# wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
# wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
5.3.1 syntax:
pivot_longer(
data,
cols,
...,
cols_vary = "fastest",
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
names_ptypes = NULL,
names_transform = NULL,
names_repair = "check_unique",
values_to = "value",
values_drop_na = FALSE,
values_ptypes = NULL,
values_transform = NULL
)
5.3.2 pivot_longer() example 1
```{r pivot_longer example}
wide_pop <- read_csv("https://raw.githubusercontent.com/jsgriffin96/r_workshop_4/master/data/wide-population.csv")
wide_pop
## convert to a tidy data set
tidy_pop <- pivot_longer(wide_pop, cols = c('2018','2019','2020'), names_to = 'year', values_to = 'population')
tidy_pop
skimr::skim(tidy_pop)
# Following the syntax
pivot_longer(wide_pop, '2018':'2020', names_to = 'year', values_to = 'population')
# Using the piping operator
wide_pop %>%
pivot_longer('2018':'2020', names_to = "year", values_to = "pop")
```# A tibble: 51 × 4
State `2018` `2019` `2020`
<chr> <dbl> <dbl> <dbl>
1 Alabama 4874486 4887681 4903185
2 Alaska 739700 735139 731545
3 Arizona 7044008 7158024 7278717
4 Arkansas 3001345 3009733 3017804
5 California 39358497 39461588 39512223
6 Colorado 5611885 5691287 5758736
7 Connecticut 3573297 3571520 3565287
8 Delaware 956823 965479 973764
9 District of Columbia 694906 701547 705749
10 Florida 20963613 21244317 21477737
# ℹ 41 more rows
# A tibble: 153 × 3
State year population
<chr> <chr> <dbl>
1 Alabama 2018 4874486
2 Alabama 2019 4887681
3 Alabama 2020 4903185
4 Alaska 2018 739700
5 Alaska 2019 735139
6 Alaska 2020 731545
7 Arizona 2018 7044008
8 Arizona 2019 7158024
9 Arizona 2020 7278717
10 Arkansas 2018 3001345
# ℹ 143 more rows
| Name | tidy_pop |
| Number of rows | 153 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| State | 0 | 1 | 4 | 20 | 0 | 51 | 0 |
| year | 0 | 1 | 4 | 4 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| population | 0 | 1 | 6404657 | 7277281 | 577601 | 1787065 | 4461153 | 7423362 | 39512223 | ▇▂▁▁▁ |
# A tibble: 153 × 3
State year population
<chr> <chr> <dbl>
1 Alabama 2018 4874486
2 Alabama 2019 4887681
3 Alabama 2020 4903185
4 Alaska 2018 739700
5 Alaska 2019 735139
6 Alaska 2020 731545
7 Arizona 2018 7044008
8 Arizona 2019 7158024
9 Arizona 2020 7278717
10 Arkansas 2018 3001345
# ℹ 143 more rows
# A tibble: 153 × 3
State year pop
<chr> <chr> <dbl>
1 Alabama 2018 4874486
2 Alabama 2019 4887681
3 Alabama 2020 4903185
4 Alaska 2018 739700
5 Alaska 2019 735139
6 Alaska 2020 731545
7 Arizona 2018 7044008
8 Arizona 2019 7158024
9 Arizona 2020 7278717
10 Arkansas 2018 3001345
# ℹ 143 more rows
5.3.3 pivot_longer() example 2
```{r}
# Song ranking
billboard %>%
pivot_longer(cols = starts_with('wk'),
names_to = 'week',
names_prefix = 'wk',
values_to = 'chart_position',
values_drop_na = TRUE
)
```# A tibble: 5,307 × 5
artist track date.entered week chart_position
<chr> <chr> <date> <chr> <dbl>
1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
# ℹ 5,297 more rows
5.4 Long data to wide data
5.4.1 Syntax
pivot_wider(
data,
...,
id_cols = NULL,
id_expand = FALSE,
names_from = name,
names_prefix = "",
names_sep = "_",
names_glue = NULL,
names_sort = FALSE,
names_vary = "fastest",
names_expand = FALSE,
names_repair = "check_unique",
values_from = value,
values_fill = NULL,
values_fn = NULL,
unused_fn = NULL
)
5.4.2 pivot_wider() Example
```{r pivot_wider Example}
long_states <- read_csv("https://raw.githubusercontent.com/jsgriffin96/r_workshop_4/master/data/long-population.csv")
long_states # why is this structure not tidy?
## convert to a tidy data
tidy_states <- pivot_wider(long_states, names_from = condition, values_from = measure)
tidy_states
```# A tibble: 153 × 3
State condition measure
<chr> <chr> <dbl>
1 Alabama population 29782365
2 Alabama employment_rate 0.61
3 Alabama number_colleges 1898
4 Alaska population 7843489
5 Alaska employment_rate 0.89
6 Alaska number_colleges 1375
7 Arizona population 13881560
8 Arizona employment_rate 0.29
9 Arizona number_colleges 258
10 Arkansas population 6550014
# ℹ 143 more rows
# A tibble: 51 × 4
State population employment_rate number_colleges
<chr> <dbl> <dbl> <dbl>
1 Alabama 29782365 0.61 1898
2 Alaska 7843489 0.89 1375
3 Arizona 13881560 0.29 258
4 Arkansas 6550014 0.41 1580
5 California 26297974 0.89 130
6 Colorado 26857590 0.45 1205
7 Connecticut 2153453 0.62 1783
8 Delaware 12239097 0.27 964
9 District of Columbia 11872901 0.75 1856
10 Florida 8139044 0.33 625
# ℹ 41 more rows
```{r pivot_wider Alternatives}
# following syntax closely
pivot_wider(long_states, id_cols = State, names_from = condition, values_from = measure)
# using the piping operator
long_states %>%
pivot_wider(id_cols = State, names_from = condition, values_from = measure)
```# A tibble: 51 × 4
State population employment_rate number_colleges
<chr> <dbl> <dbl> <dbl>
1 Alabama 29782365 0.61 1898
2 Alaska 7843489 0.89 1375
3 Arizona 13881560 0.29 258
4 Arkansas 6550014 0.41 1580
5 California 26297974 0.89 130
6 Colorado 26857590 0.45 1205
7 Connecticut 2153453 0.62 1783
8 Delaware 12239097 0.27 964
9 District of Columbia 11872901 0.75 1856
10 Florida 8139044 0.33 625
# ℹ 41 more rows
# A tibble: 51 × 4
State population employment_rate number_colleges
<chr> <dbl> <dbl> <dbl>
1 Alabama 29782365 0.61 1898
2 Alaska 7843489 0.89 1375
3 Arizona 13881560 0.29 258
4 Arkansas 6550014 0.41 1580
5 California 26297974 0.89 130
6 Colorado 26857590 0.45 1205
7 Connecticut 2153453 0.62 1783
8 Delaware 12239097 0.27 964
9 District of Columbia 11872901 0.75 1856
10 Florida 8139044 0.33 625
# ℹ 41 more rows
5.5 Separating data
5.5.1 Syntax
separate(
data,
col,
into,
sep = "[^[:alnum:]]+",
remove = TRUE,
convert = FALSE,
extra = "warn",
fill = "warn",
...
)
5.5.2 separate() example
```{r separate() example}
state_rate <- read_csv("https://raw.githubusercontent.com/jsgriffin96/r_workshop_4/master/data/state_rate.csv")
state_rate
# rate column has both deaths and population in one cell, which makes it untidy.
tidy_rate <- separate(state_rate, col = rate, into = c('deaths', 'population'))
tidy_rate
```# A tibble: 51 × 2
state rate
<chr> <chr>
1 Alabama 21782/4903185
2 Alaska 31460/731545
3 Arizona 29814/7278717
4 Arkansas 10602/3017804
5 California 45850/39512223
6 Colorado 23366/5758736
7 Connecticut 34784/3565287
8 Delaware 16342/973764
9 District of Columbia 9276/705749
10 Florida 9680/21477737
# ℹ 41 more rows
# A tibble: 51 × 3
state deaths population
<chr> <chr> <chr>
1 Alabama 21782 4903185
2 Alaska 31460 731545
3 Arizona 29814 7278717
4 Arkansas 10602 3017804
5 California 45850 39512223
6 Colorado 23366 5758736
7 Connecticut 34784 3565287
8 Delaware 16342 973764
9 District of Columbia 9276 705749
10 Florida 9680 21477737
# ℹ 41 more rows
```{r separate() Alternatives}
separate(data = state_rate, rate, into = c('deaths', 'population'), sep = "/")
state_rate %>%
separate(rate, into = c("deaths", "population"))
```# A tibble: 51 × 3
state deaths population
<chr> <chr> <chr>
1 Alabama 21782 4903185
2 Alaska 31460 731545
3 Arizona 29814 7278717
4 Arkansas 10602 3017804
5 California 45850 39512223
6 Colorado 23366 5758736
7 Connecticut 34784 3565287
8 Delaware 16342 973764
9 District of Columbia 9276 705749
10 Florida 9680 21477737
# ℹ 41 more rows
# A tibble: 51 × 3
state deaths population
<chr> <chr> <chr>
1 Alabama 21782 4903185
2 Alaska 31460 731545
3 Arizona 29814 7278717
4 Arkansas 10602 3017804
5 California 45850 39512223
6 Colorado 23366 5758736
7 Connecticut 34784 3565287
8 Delaware 16342 973764
9 District of Columbia 9276 705749
10 Florida 9680 21477737
# ℹ 41 more rows
5.6 Uniting data
5.6.1 Syntax
- from tidyr
unite(data, col, ..., sep = "_", remove = TRUE, na.rm = FALSE)
5.6.2 unite() example
```{r unite() example}
untidy_rate <- unite(tidy_rate, col = rate, c(deaths, population), sep='/')
untidy_rate # successful but untidy data though
```# A tibble: 51 × 2
state rate
<chr> <chr>
1 Alabama 21782/4903185
2 Alaska 31460/731545
3 Arizona 29814/7278717
4 Arkansas 10602/3017804
5 California 45850/39512223
6 Colorado 23366/5758736
7 Connecticut 34784/3565287
8 Delaware 16342/973764
9 District of Columbia 9276/705749
10 Florida 9680/21477737
# ℹ 41 more rows
```{r unite() Alterantives}
unite(data = tidy_rate, rate, deaths, population, sep='/')
tidy_rate %>%
unite(col = "rate", deaths, population, sep = "/")
```# A tibble: 51 × 2
state rate
<chr> <chr>
1 Alabama 21782/4903185
2 Alaska 31460/731545
3 Arizona 29814/7278717
4 Arkansas 10602/3017804
5 California 45850/39512223
6 Colorado 23366/5758736
7 Connecticut 34784/3565287
8 Delaware 16342/973764
9 District of Columbia 9276/705749
10 Florida 9680/21477737
# ℹ 41 more rows
# A tibble: 51 × 2
state rate
<chr> <chr>
1 Alabama 21782/4903185
2 Alaska 31460/731545
3 Arizona 29814/7278717
4 Arkansas 10602/3017804
5 California 45850/39512223
6 Colorado 23366/5758736
7 Connecticut 34784/3565287
8 Delaware 16342/973764
9 District of Columbia 9276/705749
10 Florida 9680/21477737
# ℹ 41 more rows
6 Data transformation with dplyr (From Tidy Data to Your Analysis)
Data Transformation: Hackman’s book: Ch5
- Dplyr basics
- Filter rows with filter()
- Arrange rows with arrange()
- Select columns with select()
- Add new variables with mutate()
- Grouped summaries with summarise()
- Grouped mutates and filters
- pull()
- case_when(): part of the programming tool
dplyr:: cheat sheet https://nyu-cdsc.github.io/learningr/assets/data-transformation.pdf
6.1 flights data
```{r}
#install.packages("nycflights13")
library(nycflights13)
# nycflights13
nycflights13::flights
#?flights
```# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
6.1.1 Variable Description
- Data frame with columns
-
year,month,day: Date of departure. -
dep_time,arr_time: Actual departure and arrival times (format HHMM or HMM), local tz. -
sched_dep_time,sched_arr_time:- Scheduled departure and arrival times (format HHMM or HMM), local tz.
dep_delay,arr_delay: - Departure and arrival delays, in minutes.
- Negative times represent early departures/arrivals.
- Scheduled departure and arrival times (format HHMM or HMM), local tz.
-
carrier: Two letter carrier abbreviation. See airlines to get name. -
flight: Flight number. -
tailnum: Plane tail number. See planes for additional metadata. -
origin,dest: Origin and destination. See airports for additional metadata. -
air_time: Amount of time spent in the air, in minutes. -
distance: Distance between airports, in miles. -
hour,minute: Time of scheduled departure broken into hour and minutes. -
time_hour:- Scheduled date and hour of the flight as a POSIXct date.
- Along with origin, can be used to join flights data to weather data.
Source RITA, Bureau of transportation statistics https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236
6.2 filter(): filter rows
```{r fiter()}
cars <- mtcars
summary(cars)
#?mtcars
# choose all the cases with cylinder size of 8 and automatic gear.
# [, 2] cyl Number of cylinders
# [, 9] am Transmission (0 = automatic, 1 = manual)
v8_automatic <- filter(cars, cyl == 8, am == 0)
v8_automatic
# R either prints out the results, or saves them to a variable. If you want to do both, you can wrap the assignment in parentheses:
(v8_automatic <- filter(cars, cyl == 8, am == 0))
``` mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
mpg cyl disp hp drat wt qsec vs am gear carb
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
6.2.1 Comparisons
- To use filtering effectively, you have to know how to select the observations that you want using the comparison operators.
- R provides the standard suite:
-
>,>=,<,<=,!=(not equal), and==(equal).
-
# A tibble: 27,004 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 26,994 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 2 × 3
year `month == 1` n
<int> <lgl> <int>
1 2013 FALSE 309772
2 2013 TRUE 27004
6.2.1.1 Floting point numbers
Computers use finite precision arithmetic (they obviously can’t store an infinite number of digits!) so remember that every number you see is an approximation. Instead of relying on ==, use near():
```{r floating point numbers}
# common problem you might encounter when using ==: floating point numbers.
sqrt(2) ^ 2 == 2
near(sqrt(2)^2, 2) #This is a safe way of comparing if two vectors of floating point numbers are (pairwise) equal. This is safer than using ==, because it has a built in tolerance. It is from dplyr
1/49 * 49 == 1
near(1/49 * 49, 1)
```[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
6.2.2 Logical operators
filter() is used often with logical conditions
- A lot of programming in general is based on writing logical conditions that check whether something is true
- In R, if the condition is true, it returns TRUE, which turns into 1 if you do a calculation with it.
- If false, it returns FALSE, which turns into 0.
- tip: ifelse() is rarely what you want, and ifelse(condition, TRUE, FALSE) is redundant
```{r}
filter(flights, month == 11 | month == 12) #|> # the same as below
#count(month)
filter(flights, month %in% c(11, 12))
```# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 55,403 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 11 1 5 2359 6 352 345
2 2013 11 1 35 2250 105 123 2356
3 2013 11 1 455 500 -5 641 651
4 2013 11 1 539 545 -6 856 827
5 2013 11 1 542 545 -3 831 855
6 2013 11 1 549 600 -11 912 923
7 2013 11 1 550 600 -10 705 659
8 2013 11 1 554 600 -6 659 701
9 2013 11 1 554 600 -6 826 827
10 2013 11 1 554 600 -6 749 751
# ℹ 55,393 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
De Morgan’s law:
-
!(x & y)is the same as!x | !y, and -
!(x | y)is the same as!x & !y.
- For example, if you wanted to find flights that weren’t delayed (on arrival or departure) by more than two hours, you could use either of the following two filters:
```{r}
df <- flights
filter(df, !(dep_delay > 120 | arr_delay > 120)) # same as below
filter(df, dep_delay <= 120 & arr_delay <= 120) # the same as below.
filter(df, dep_delay <= 120, arr_delay <= 120) # the same as above
```# A tibble: 316,050 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 316,040 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 316,050 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 316,040 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 316,050 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 316,040 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
6.2.3 Missing values
- filter() only includes rows where the condition is TRUE;
- it excludes both FALSE and NA values.
- If you want to preserve missing values, ask for them explicitly:
```{r}
nrow(df)
head(is.na(df))
df1 <- tibble(x = c(1, NA, 3))
filter(df1, x > 1)
filter(df1, is.na(x) | x > 1)
```[1] 336776
year month day dep_time sched_dep_time dep_delay arr_time
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
sched_arr_time arr_delay carrier flight tailnum origin dest air_time
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
distance hour minute time_hour
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
# A tibble: 1 × 1
x
<dbl>
1 3
# A tibble: 2 × 1
x
<dbl>
1 NA
2 3
6.2.4 %in%
a %in% c(b, c, d, e, f)checks whether a is any of the values b, c, d, e, or f. Works for text too!
```{r}
filter(flights, month == 9 | month == 10 | month == 11 | month == 12) # the same as below
filter(flights, month %in% c(9, 10, 11, 12))
```# A tibble: 111,866 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 10 1 447 500 -13 614 648
2 2013 10 1 522 517 5 735 757
3 2013 10 1 536 545 -9 809 855
4 2013 10 1 539 545 -6 801 827
5 2013 10 1 539 545 -6 917 933
6 2013 10 1 544 550 -6 912 932
7 2013 10 1 549 600 -11 653 716
8 2013 10 1 550 600 -10 648 700
9 2013 10 1 550 600 -10 649 659
10 2013 10 1 551 600 -9 727 730
# ℹ 111,856 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 111,866 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 10 1 447 500 -13 614 648
2 2013 10 1 522 517 5 735 757
3 2013 10 1 536 545 -9 809 855
4 2013 10 1 539 545 -6 801 827
5 2013 10 1 539 545 -6 917 933
6 2013 10 1 544 550 -6 912 932
7 2013 10 1 549 600 -11 653 716
8 2013 10 1 550 600 -10 648 700
9 2013 10 1 550 600 -10 649 659
10 2013 10 1 551 600 -9 727 730
# ℹ 111,856 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
6.3 arrange(): arrange rows
- arrange() sorts the data. That’s it! Give it the column names and it will sort the data by those columns.
- It’s often a good idea to sort your data before saving it (or looking at it) as it makes it easier to navigate
- There are also some data manipulation tricks that rely on the position of the data
- If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.
```{r}
flights %>%
arrange(year, month, day)
flights %>%
arrange(desc(dep_delay)) # arrange rows by descending order of dep_delay
```# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 9 641 900 1301 1242 1530
2 2013 6 15 1432 1935 1137 1607 2120
3 2013 1 10 1121 1635 1126 1239 1810
4 2013 9 20 1139 1845 1014 1457 2210
5 2013 7 22 845 1600 1005 1044 1815
6 2013 4 10 1100 1900 960 1342 2211
7 2013 3 17 2321 810 911 135 1020
8 2013 6 27 959 1900 899 1236 2226
9 2013 7 22 2257 759 898 121 1026
10 2013 12 5 756 1700 896 1058 2020
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
```{r}
df2 <- tibble(x = c(5, 2, NA))
arrange(df2, x) # Missing values are always sorted at the end
arrange(df2, desc(x)) # Missing values are always sorted at the end
```# A tibble: 3 × 1
x
<dbl>
1 2
2 5
3 NA
# A tibble: 3 × 1
x
<dbl>
1 5
2 2
3 NA
```{r}
cars <- as_tibble(mtcars)
arrange(cars, mpg)
arrange(cars, desc(mpg))
cars <- arrange(cars, cyl, -mpg) # sort by ascending order of cyl and then descending order of mpg
cars # confirm the operation
```# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
2 10.4 8 460 215 3 5.42 17.8 0 0 3 4
3 13.3 8 350 245 3.73 3.84 15.4 0 0 3 4
4 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
5 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
6 15 8 301 335 3.54 3.57 14.6 0 1 5 8
7 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
8 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 15.8 8 351 264 4.22 3.17 14.5 0 1 5 4
# ℹ 22 more rows
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
7 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
8 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# ℹ 22 more rows
# A tibble: 32 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
7 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
8 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# ℹ 22 more rows
6.4 select(): select columns;
- starts_with()
- ends_with()
- contains()
- any_of()
```{r}
select(cars, mpg, cyl, hp, wt)
flights %>%
select(year, month, day, dep_time) %>%
select(year:day)
select(flights, year:day)
select(flights, -c(year:day))
select(flights, starts_with("sched"))
select(flights, ends_with("time"))
select(flights, contains("time"))
rename(flights, tail_num = tailnum)
select(flights, time_hour, air_time, everything()) #everything helper is good when we want to move some variable to the front of the data frame.
vars <- c("year", "month", "day", "dep_delay", "arr_delay")
# any_of(vars)
select(flights, any_of(vars))
```# A tibble: 32 × 4
mpg cyl hp wt
<dbl> <dbl> <dbl> <dbl>
1 33.9 4 65 1.84
2 32.4 4 66 2.2
3 30.4 4 52 1.62
4 30.4 4 113 1.51
5 27.3 4 66 1.94
6 26 4 91 2.14
7 24.4 4 62 3.19
8 22.8 4 93 2.32
9 22.8 4 95 3.15
10 21.5 4 97 2.46
# ℹ 22 more rows
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ℹ 336,766 more rows
# A tibble: 336,776 × 3
year month day
<int> <int> <int>
1 2013 1 1
2 2013 1 1
3 2013 1 1
4 2013 1 1
5 2013 1 1
6 2013 1 1
7 2013 1 1
8 2013 1 1
9 2013 1 1
10 2013 1 1
# ℹ 336,766 more rows
# A tibble: 336,776 × 16
dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <dbl> <int> <int> <dbl> <chr>
1 517 515 2 830 819 11 UA
2 533 529 4 850 830 20 UA
3 542 540 2 923 850 33 AA
4 544 545 -1 1004 1022 -18 B6
5 554 600 -6 812 837 -25 DL
6 554 558 -4 740 728 12 UA
7 555 600 -5 913 854 19 B6
8 557 600 -3 709 723 -14 EV
9 557 600 -3 838 846 -8 B6
10 558 600 -2 753 745 8 AA
# ℹ 336,766 more rows
# ℹ 9 more variables: flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 2
sched_dep_time sched_arr_time
<int> <int>
1 515 819
2 529 830
3 540 850
4 545 1022
5 600 837
6 558 728
7 600 854
8 600 723
9 600 846
10 600 745
# ℹ 336,766 more rows
# A tibble: 336,776 × 5
dep_time sched_dep_time arr_time sched_arr_time air_time
<int> <int> <int> <int> <dbl>
1 517 515 830 819 227
2 533 529 850 830 227
3 542 540 923 850 160
4 544 545 1004 1022 183
5 554 600 812 837 116
6 554 558 740 728 150
7 555 600 913 854 158
8 557 600 709 723 53
9 557 600 838 846 140
10 558 600 753 745 138
# ℹ 336,766 more rows
# A tibble: 336,776 × 6
dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
<int> <int> <int> <int> <dbl> <dttm>
1 517 515 830 819 227 2013-01-01 05:00:00
2 533 529 850 830 227 2013-01-01 05:00:00
3 542 540 923 850 160 2013-01-01 05:00:00
4 544 545 1004 1022 183 2013-01-01 05:00:00
5 554 600 812 837 116 2013-01-01 06:00:00
6 554 558 740 728 150 2013-01-01 05:00:00
7 555 600 913 854 158 2013-01-01 06:00:00
8 557 600 709 723 53 2013-01-01 06:00:00
9 557 600 838 846 140 2013-01-01 06:00:00
10 558 600 753 745 138 2013-01-01 06:00:00
# ℹ 336,766 more rows
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tail_num <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 336,776 × 19
time_hour air_time year month day dep_time sched_dep_time
<dttm> <dbl> <int> <int> <int> <int> <int>
1 2013-01-01 05:00:00 227 2013 1 1 517 515
2 2013-01-01 05:00:00 227 2013 1 1 533 529
3 2013-01-01 05:00:00 160 2013 1 1 542 540
4 2013-01-01 05:00:00 183 2013 1 1 544 545
5 2013-01-01 06:00:00 116 2013 1 1 554 600
6 2013-01-01 05:00:00 150 2013 1 1 554 558
7 2013-01-01 06:00:00 158 2013 1 1 555 600
8 2013-01-01 06:00:00 53 2013 1 1 557 600
9 2013-01-01 06:00:00 140 2013 1 1 557 600
10 2013-01-01 06:00:00 138 2013 1 1 558 600
# ℹ 336,766 more rows
# ℹ 12 more variables: dep_delay <dbl>, arr_time <int>, sched_arr_time <int>,
# arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, distance <dbl>, hour <dbl>, minute <dbl>
# A tibble: 336,776 × 5
year month day dep_delay arr_delay
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 2 11
2 2013 1 1 4 20
3 2013 1 1 2 33
4 2013 1 1 -1 -18
5 2013 1 1 -6 -25
6 2013 1 1 -4 12
7 2013 1 1 -5 19
8 2013 1 1 -3 -14
9 2013 1 1 -3 -8
10 2013 1 1 -2 8
# ℹ 336,766 more rows
6.5 mutate(): Add new variables
- Create a new variable and add it to the data set.
- It will not save the column to the data until the data is assigned to a name.
- You can assign multiple variables in the same mutate() call, separated by commas (,)
```{r}
mutate(cars, powerToWeight = hp/wt)
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
flights_sml
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60 # speed expressed in the number of miles per an hour.
)
flights_sml <- flights_sml |>
mutate(gain = dep_delay - arr_delay,
speed = distance / air_time * 60)
flights_sml
```# A tibble: 32 × 12
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
4 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
5 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
6 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
7 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
8 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# ℹ 22 more rows
# ℹ 1 more variable: powerToWeight <dbl>
# A tibble: 336,776 × 7
year month day dep_delay arr_delay distance air_time
<int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227
2 2013 1 1 4 20 1416 227
3 2013 1 1 2 33 1089 160
4 2013 1 1 -1 -18 1576 183
5 2013 1 1 -6 -25 762 116
6 2013 1 1 -4 12 719 150
7 2013 1 1 -5 19 1065 158
8 2013 1 1 -3 -14 229 53
9 2013 1 1 -3 -8 944 140
10 2013 1 1 -2 8 733 138
# ℹ 336,766 more rows
# A tibble: 336,776 × 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 -9 370.
2 2013 1 1 4 20 1416 227 -16 374.
3 2013 1 1 2 33 1089 160 -31 408.
4 2013 1 1 -1 -18 1576 183 17 517.
5 2013 1 1 -6 -25 762 116 19 394.
6 2013 1 1 -4 12 719 150 -16 288.
7 2013 1 1 -5 19 1065 158 -24 404.
8 2013 1 1 -3 -14 229 53 11 259.
9 2013 1 1 -3 -8 944 140 5 405.
10 2013 1 1 -2 8 733 138 -10 319.
# ℹ 336,766 more rows
# A tibble: 336,776 × 9
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 2 11 1400 227 -9 370.
2 2013 1 1 4 20 1416 227 -16 374.
3 2013 1 1 2 33 1089 160 -31 408.
4 2013 1 1 -1 -18 1576 183 17 517.
5 2013 1 1 -6 -25 762 116 19 394.
6 2013 1 1 -4 12 719 150 -16 288.
7 2013 1 1 -5 19 1065 158 -24 404.
8 2013 1 1 -3 -14 229 53 11 259.
9 2013 1 1 -3 -8 944 140 5 405.
10 2013 1 1 -2 8 733 138 -10 319.
# ℹ 336,766 more rows
6.5.1 transmute()
- add new variable but drop all existing variable.
```{r}
transmute(flights_sml,
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours
) -> flights_sml_cal
flights_sml_cal
```# A tibble: 336,776 × 3
gain hours gain_per_hour
<dbl> <dbl> <dbl>
1 -9 3.78 -2.38
2 -16 3.78 -4.23
3 -31 2.67 -11.6
4 17 3.05 5.57
5 19 1.93 9.83
6 -16 2.5 -6.4
7 -24 2.63 -9.11
8 11 0.883 12.5
9 5 2.33 2.14
10 -10 2.3 -4.35
# ℹ 336,766 more rows
6.5.2 %/% vs. %%
- %/% is the integral division operator
- example: 5 %/% 3 will be equal to 1 as it is the quotient obtained on the integral division of 5 by 3
- %% is the modulo operator that gives a remainder after integral division.
- In the above example, the remainder 2 is the outcome of the operator.
```{r}
transmute(flights,
dep_time, # Actual departure times (format HHMM or HMM), local tz.
hour = dep_time %/% 100, # quotient after division
minute = dep_time %% 100) # remainder
```# A tibble: 336,776 × 3
dep_time hour minute
<int> <dbl> <dbl>
1 517 5 17
2 533 5 33
3 542 5 42
4 544 5 44
5 554 5 54
6 554 5 54
7 555 5 55
8 557 5 57
9 557 5 57
10 558 5 58
# ℹ 336,766 more rows
6.6 summarise(): grouped summaries
- It doesn’t add the new variable to the data set.
- Used with group_by()
```{r}
summarise(cars, avgMPG = mean(mpg)) #not that useful
by_cyl <- group_by(cars, cyl) #group by cylinders
summarise(by_cyl, avgMPG = mean(mpg)) #summarise function now gives output by cylinders
cars |>
group_by(cyl) |>
summarise(avgMPG = mean(mpg))
```# A tibble: 1 × 1
avgMPG
<dbl>
1 20.1
# A tibble: 3 × 2
cyl avgMPG
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
# A tibble: 3 × 2
cyl avgMPG
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
7 group_by() and surmmarize() with %>%
group_by() turns the dataset into a grouped data set, splitting each combination of the grouping variables
Calculations like mutate() or (up next) summarize() or (if you want to get fancy) group_map() then process the data separately by each group
It will maintain this grouping structure until you re-group_by() it, or ungroup() it, or summarize() it (which removes one of the grouping variables)
How is this useful in preparing data? Remember, we want to look at where information is and think about how we can get it where we need it to be
group_by() helps us move information from one row to another in a key variable - otherwise a difficult move!
It can also let us change the observation level with summarize()
Tip: n() gives the number of rows in the group - handy! and row_number() gives the row number within its group of that observation
7.1 Example 1
```{r 4.6.1 video}
cars |>
group_by(cyl) |>
summarise(avgMPG = mean(mpg))
cars %>%
group_by(cyl) %>%
summarise(avgMPG = mean(mpg),
n = n()) %>%
mutate(id = row_number()) %>% # Add unique row number to a data frame
relocate(id) # move a column to the first column in the data frame.
```# A tibble: 3 × 2
cyl avgMPG
<dbl> <dbl>
1 4 26.7
2 6 19.7
3 8 15.1
# A tibble: 3 × 4
id cyl avgMPG n
<int> <dbl> <dbl> <int>
1 1 4 26.7 11
2 2 6 19.7 7
3 3 8 15.1 14
7.2 Example 2
```{r 4.6.1. Extra}
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
flights %>%
group_by(year, month, day) %>%
summarise(delay = mean(dep_delay, na.rm = T),
n = n()
) %>%
mutate(row_id = row_number()) %>%
relocate(row_id)
```# A tibble: 1 × 1
delay
<dbl>
1 12.6
# A tibble: 365 × 6
# Groups: year, month [12]
row_id year month day delay n
<int> <int> <int> <int> <dbl> <int>
1 1 2013 1 1 11.5 842
2 2 2013 1 2 13.9 943
3 3 2013 1 3 11.0 914
4 4 2013 1 4 8.95 915
5 5 2013 1 5 5.73 720
6 6 2013 1 6 7.15 832
7 7 2013 1 7 5.42 933
8 8 2013 1 8 2.55 899
9 9 2013 1 9 2.28 902
10 10 2013 1 10 2.84 932
# ℹ 355 more rows
8 Grouped mutates and filters
- Grouped mutates and filters (combine what we did before)
8.1 Example 1
```{r 4.7.1 video}
as_tibble(mtcars) %>%
select(mpg, cyl, hp, wt) %>% #select only these cols
mutate(ptw = hp/wt) %>% #calculate power to weight ratio
filter(mpg > 18, ptw > 40) #filter by power to weight and mpg
# Operation example w/o piping
# This method requires saving each outcome as a data to use in the next step, which is inefficient.
narrow_cars <- select(as_tibble(mtcars), mpg, cyl, hp, wt)
narrow_cars <- mutate(narrow_cars, ptw=hp/wt)
narrow_cars <- filter(narrow_cars, mpg>18, ptw > 40)
narrow_cars
```# A tibble: 7 × 5
mpg cyl hp wt ptw
<dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 110 2.62 42.0
2 22.8 4 93 2.32 40.1
3 18.7 8 175 3.44 50.9
4 19.2 8 175 3.84 45.5
5 26 4 91 2.14 42.5
6 30.4 4 113 1.51 74.7
7 19.7 6 175 2.77 63.2
# A tibble: 7 × 5
mpg cyl hp wt ptw
<dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 110 2.62 42.0
2 22.8 4 93 2.32 40.1
3 18.7 8 175 3.44 50.9
4 19.2 8 175 3.84 45.5
5 26 4 91 2.14 42.5
6 30.4 4 113 1.51 74.7
7 19.7 6 175 2.77 63.2
8.2 Example 2
```{r 4.7.1 Extra}
# Find the worst members of each group:
flights_sml %>%
group_by(year, month, day) %>%
filter(rank(desc(arr_delay)) < 10)
# Find all groups bigger than a threshold:
popular_dests <- flights %>%
group_by(dest) %>%
filter(n() > 365)
popular_dests
# Standardize to compute per group metrics:
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(arr_delay_sum = sum(arr_delay),
prop_delay = arr_delay / sum(arr_delay),
n = n()) %>%
select(year:day, dest, arr_delay, arr_delay_sum, prop_delay, n)
```# A tibble: 3,306 × 9
# Groups: year, month, day [365]
year month day dep_delay arr_delay distance air_time gain speed
<int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013 1 1 853 851 184 41 2 269.
2 2013 1 1 290 338 1134 213 -48 319.
3 2013 1 1 260 263 266 46 -3 347.
4 2013 1 1 157 174 213 60 -17 213
5 2013 1 1 216 222 708 121 -6 351.
6 2013 1 1 255 250 589 115 5 307.
7 2013 1 1 285 246 1085 146 39 446.
8 2013 1 1 192 191 199 44 1 271.
9 2013 1 1 379 456 1092 222 -77 295.
10 2013 1 2 224 207 550 94 17 351.
# ℹ 3,296 more rows
# A tibble: 332,577 × 19
# Groups: dest [77]
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 332,567 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 131,106 × 8
# Groups: dest [77]
year month day dest arr_delay arr_delay_sum prop_delay n
<int> <int> <int> <chr> <dbl> <dbl> <dbl> <int>
1 2013 1 1 IAH 11 99391 0.000111 2883
2 2013 1 1 IAH 20 99391 0.000201 2883
3 2013 1 1 MIA 33 140424 0.000235 3855
4 2013 1 1 ORD 12 283046 0.0000424 6198
5 2013 1 1 FLL 19 202605 0.0000938 5212
6 2013 1 1 ORD 8 283046 0.0000283 6198
7 2013 1 1 LAX 7 203226 0.0000344 5967
8 2013 1 1 DFW 31 110009 0.000282 2876
9 2013 1 1 ATL 12 300299 0.0000400 7946
10 2013 1 1 DTW 16 138258 0.000116 3257
# ℹ 131,096 more rows
9 Demonstration of DWV with piping
9.1 Example 1
- Wrangling and Visualization
```{r wrangling & Viz}
as_tibble(mtcars) %>%
select(mpg, cyl, hp, wt) %>% #select only these cols
mutate(ptw = hp/wt) %>% #calculate power to weight ratio
filter(mpg > 10, ptw > 35) %>% #filter by power to weight and mpg
group_by(cyl) %>%
summarise(ptw_mean = mean(ptw)) %>% # create mean of ptw by cylinder type
ggplot(aes(x = factor(cyl), y = ptw_mean))+
geom_col(aes(fill = factor(cyl)), show.legend = FALSE) # map color to cylinder
```9.2 Example 2
```{r}
# data
flights
# wrangling data
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)
) %>%
filter(count > 20, dest != "HNL")
```# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 96 × 4
dest count dist delay
<chr> <int> <dbl> <dbl>
1 ABQ 254 1826 4.38
2 ACK 265 199 4.85
3 ALB 439 143 14.4
4 ATL 17215 757. 11.3
5 AUS 2439 1514. 6.02
6 AVL 275 584. 8.00
7 BDL 443 116 7.05
8 BGR 375 378 8.03
9 BHM 297 866. 16.9
10 BNA 6333 758. 11.8
# ℹ 86 more rows
9.3 Example 3:
- adding viz to wrangled data
```{r}
flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)
) %>%
filter(count > 20, dest != "HNL") %>%
ggplot(aes(dist, delay, size = count)) +
geom_point(alpha = 1/3) +
geom_smooth(method = lm, se = FALSE, show.legend = FALSE) +
scale_size_continuous(labels = scales::comma) +
labs(x = "Distance in Miles Flown",
y = "Airplane Arrival Delay in Minutes",
title = "Airline Arrival Delay vs. Distance",
subtitle = "Airplans that fly longer distance tend to dely less")
```The farther the distance, the less the delay.
10 Missing values
- Why are the two results different?
```{r}
flights %>%
group_by(year, month, day) %>%
summarise(mean_delay = mean(dep_delay),
n = n())
flights %>%
group_by(year, month, day) %>%
summarise(mean_delay = mean(dep_delay, na.rm = TRUE),
n = n())
```# A tibble: 365 × 5
# Groups: year, month [12]
year month day mean_delay n
<int> <int> <int> <dbl> <int>
1 2013 1 1 NA 842
2 2013 1 2 NA 943
3 2013 1 3 NA 914
4 2013 1 4 NA 915
5 2013 1 5 NA 720
6 2013 1 6 NA 832
7 2013 1 7 NA 933
8 2013 1 8 NA 899
9 2013 1 9 NA 902
10 2013 1 10 NA 932
# ℹ 355 more rows
# A tibble: 365 × 5
# Groups: year, month [12]
year month day mean_delay n
<int> <int> <int> <dbl> <int>
1 2013 1 1 11.5 842
2 2013 1 2 13.9 943
3 2013 1 3 11.0 914
4 2013 1 4 8.95 915
5 2013 1 5 5.73 720
6 2013 1 6 7.15 832
7 2013 1 7 5.42 933
8 2013 1 8 2.55 899
9 2013 1 9 2.28 902
10 2013 1 10 2.84 932
# ℹ 355 more rows
- Mean value after
NAare removed.
```{r}
not_cancelled <- flights %>%
filter(!is.na(dep_delay) & !is.na(arr_delay))
flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean_delay = mean(dep_delay))
```# A tibble: 327,346 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 327,336 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
# A tibble: 365 × 4
# Groups: year, month [12]
year month day mean_delay
<int> <int> <int> <dbl>
1 2013 1 1 11.4
2 2013 1 2 13.7
3 2013 1 3 10.9
4 2013 1 4 8.97
5 2013 1 5 5.73
6 2013 1 6 7.15
7 2013 1 7 5.42
8 2013 1 8 2.56
9 2013 1 9 2.30
10 2013 1 10 2.84
# ℹ 355 more rows
11 Counts
- Whenever you do any aggregation, it’s always a good idea to include either a count (n()), or a count of non-missing values (sum(!is.na(x))).
```{r}
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(delay = mean(arr_delay),
count = n()
)
delays |>
arrange(desc(delay))
# there are some planes that have an average delay of 5 hours (300 minutes)!
ggplot(data = delays, mapping = aes(x = delay)) +
geom_freqpoly(binwidth = 10)
# Get n always
delays <- not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
delays
## the variation decreases as the sample size increases.
ggplot(data = delays, mapping = aes(x = delay, y = n)) +
geom_point(alpha = 1/10)
# Filtering out means with small sample size
delays %>%
filter(n > 25) %>%
ggplot(mapping = aes(x = delay, y = n)) +
geom_point(alpha = 1/10)
```# A tibble: 4,037 × 3
tailnum delay count
<chr> <dbl> <int>
1 N844MH 320 1
2 N911DA 294 1
3 N922EV 276 1
4 N587NW 264 1
5 N851NW 219 1
6 N928DN 201 1
7 N7715E 188 1
8 N654UA 185 1
9 N665MQ 175. 6
10 N427SW 157 1
# ℹ 4,027 more rows
# A tibble: 4,037 × 3
tailnum delay n
<chr> <dbl> <int>
1 D942DN 31.5 4
2 N0EGMQ 9.98 352
3 N10156 12.7 145
4 N102UW 2.94 48
5 N103US -6.93 46
6 N104UW 1.80 46
7 N10575 20.7 269
8 N105UW -0.267 45
9 N107US -5.73 41
10 N108UW -1.25 60
# ℹ 4,027 more rows
12 Useful summary functions
12.1 logical subsetting
- Median: the median is a value where 50% of x is above it, and 50% is below it.
- Logical subsetting
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
avg_delay1 = mean(arr_delay),
avg_delay2 = mean(arr_delay[arr_delay > 0]) # the average positive delay
)
```# A tibble: 365 × 5
# Groups: year, month [12]
year month day avg_delay1 avg_delay2
<int> <int> <int> <dbl> <dbl>
1 2013 1 1 12.7 32.5
2 2013 1 2 12.7 32.0
3 2013 1 3 5.73 27.7
4 2013 1 4 -1.93 28.3
5 2013 1 5 -1.53 22.6
6 2013 1 6 4.24 24.4
7 2013 1 7 -4.95 27.8
8 2013 1 8 -3.23 20.8
9 2013 1 9 -0.264 25.6
10 2013 1 10 -5.90 27.3
# ℹ 355 more rows
12.2 Measures of spread: sd(x), IQR(x), mad(x)
- Measures of spread: sd(x), IQR(x), mad(x).
```{r}
# Why is distance to some destinations more variable than to others?
not_cancelled %>%
group_by(dest) %>%
summarise(distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
```# A tibble: 104 × 2
dest distance_sd
<chr> <dbl>
1 EGE 10.5
2 SAN 10.4
3 SFO 10.2
4 HNL 10.0
5 SEA 9.98
6 LAS 9.91
7 PDX 9.87
8 PHX 9.86
9 LAX 9.66
10 IND 9.46
# ℹ 94 more rows
12.3 Measures of rank: min(), quantile(), max()
- Measures of rank: min(x), quantile(x, 0.25), max(x).
```{r}
# When do the first and last flights leave each day?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first = min(dep_time),
last = max(dep_time)
)
```# A tibble: 365 × 5
# Groups: year, month [12]
year month day first last
<int> <int> <int> <int> <int>
1 2013 1 1 517 2356
2 2013 1 2 42 2354
3 2013 1 3 32 2349
4 2013 1 4 25 2358
5 2013 1 5 14 2357
6 2013 1 6 16 2355
7 2013 1 7 49 2359
8 2013 1 8 454 2351
9 2013 1 9 2 2252
10 2013 1 10 3 2320
# ℹ 355 more rows
12.4 Measures of position: first(), nth(), last()
- Measures of position: first(x), nth(x, 2), last(x).
- These work similarly to
x[1],x[2], andx[length(x)]but let you set a default value if that position does not exist (i.e. you’re trying to get the 3rd element from a group that only has two elements).
```{r}
not_cancelled %>%
group_by(year, month, day) %>%
summarise(
first_dep = first(dep_time),
last_dep = last(dep_time),
`2nd_dep` = nth(dep_time, 2)
)
```# A tibble: 365 × 6
# Groups: year, month [12]
year month day first_dep last_dep `2nd_dep`
<int> <int> <int> <int> <int> <int>
1 2013 1 1 517 2356 533
2 2013 1 2 42 2354 126
3 2013 1 3 32 2349 50
4 2013 1 4 25 2358 106
5 2013 1 5 14 2357 37
6 2013 1 6 16 2355 458
7 2013 1 7 49 2359 454
8 2013 1 8 454 2351 524
9 2013 1 9 2 2252 8
10 2013 1 10 3 2320 16
# ℹ 355 more rows
12.5 sum(!is.na(x)) / n_distinct(x)
- To count the number of non-missing values, use sum(!is.na(x)).
- To count the number of distinct (unique) values, use n_distinct(x).
```{r}
# Which destinations have the most carriers?
not_cancelled %>%
group_by(dest) %>%
summarise(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
glimpse(not_cancelled)
skimr::skim(not_cancelled)
# Get n always
not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = n()
)
# n() is the same as sum(!is.na(x)) when there is no missing data for x, which is arr_delay here.
not_cancelled %>%
group_by(tailnum) %>%
summarise(
delay = mean(arr_delay, na.rm = TRUE),
n = sum(!is.na(arr_delay))
)
```# A tibble: 104 × 2
dest carriers
<chr> <int>
1 ATL 7
2 BOS 7
3 CLT 7
4 ORD 7
5 TPA 7
6 AUS 6
7 DCA 6
8 DTW 6
9 IAD 6
10 MSP 6
# ℹ 94 more rows
Rows: 327,346
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
| Name | not_cancelled |
| Number of rows | 327346 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 14 |
| POSIXct | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| carrier | 0 | 1 | 2 | 2 | 0 | 16 | 0 |
| tailnum | 0 | 1 | 5 | 6 | 0 | 4037 | 0 |
| origin | 0 | 1 | 3 | 3 | 0 | 3 | 0 |
| dest | 0 | 1 | 3 | 3 | 0 | 104 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2013.00 | 0.00 | 2013 | 2013 | 2013 | 2013 | 2013 | ▁▁▇▁▁ |
| month | 0 | 1 | 6.56 | 3.41 | 1 | 4 | 7 | 10 | 12 | ▇▆▆▆▇ |
| day | 0 | 1 | 15.74 | 8.78 | 1 | 8 | 16 | 23 | 31 | ▇▇▇▇▆ |
| dep_time | 0 | 1 | 1348.79 | 488.32 | 1 | 907 | 1400 | 1744 | 2400 | ▁▇▆▇▃ |
| sched_dep_time | 0 | 1 | 1340.34 | 467.41 | 500 | 905 | 1355 | 1729 | 2359 | ▇▆▇▇▃ |
| dep_delay | 0 | 1 | 12.56 | 40.07 | -43 | -5 | -2 | 11 | 1301 | ▇▁▁▁▁ |
| arr_time | 0 | 1 | 1501.91 | 532.89 | 1 | 1104 | 1535 | 1940 | 2400 | ▁▃▇▇▇ |
| sched_arr_time | 0 | 1 | 1532.79 | 497.98 | 1 | 1122 | 1554 | 1944 | 2359 | ▁▃▇▇▇ |
| arr_delay | 0 | 1 | 6.90 | 44.63 | -86 | -17 | -5 | 14 | 1272 | ▇▁▁▁▁ |
| flight | 0 | 1 | 1943.10 | 1621.52 | 1 | 544 | 1467 | 3412 | 8500 | ▇▃▃▁▁ |
| air_time | 0 | 1 | 150.69 | 93.69 | 20 | 82 | 129 | 192 | 695 | ▇▂▂▁▁ |
| distance | 0 | 1 | 1048.37 | 735.91 | 80 | 509 | 888 | 1389 | 4983 | ▇▃▂▁▁ |
| hour | 0 | 1 | 13.14 | 4.66 | 5 | 9 | 13 | 17 | 23 | ▇▆▆▇▃ |
| minute | 0 | 1 | 26.23 | 19.30 | 0 | 8 | 29 | 44 | 59 | ▇▃▆▃▅ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| time_hour | 0 | 1 | 2013-01-01 05:00:00 | 2013-12-31 23:00:00 | 2013-07-04 09:00:00 | 6922 |
# A tibble: 4,037 × 3
tailnum delay n
<chr> <dbl> <int>
1 D942DN 31.5 4
2 N0EGMQ 9.98 352
3 N10156 12.7 145
4 N102UW 2.94 48
5 N103US -6.93 46
6 N104UW 1.80 46
7 N10575 20.7 269
8 N105UW -0.267 45
9 N107US -5.73 41
10 N108UW -1.25 60
# ℹ 4,027 more rows
# A tibble: 4,037 × 3
tailnum delay n
<chr> <dbl> <int>
1 D942DN 31.5 4
2 N0EGMQ 9.98 352
3 N10156 12.7 145
4 N102UW 2.94 48
5 N103US -6.93 46
6 N104UW 1.80 46
7 N10575 20.7 269
8 N105UW -0.267 45
9 N107US -5.73 41
10 N108UW -1.25 60
# ℹ 4,027 more rows
12.6 count() = grouped_by followed by summarize() for n()
- count() is the same as grouped_by() and summarize() for n()
```{r}
not_cancelled %>%
count(dest)
# same as above
not_cancelled %>%
group_by(dest) %>%
summarize(n = n())
# add percentage
not_cancelled %>%
count(dest) |>
mutate(percent = n/sum(n))
```# A tibble: 104 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 264
3 ALB 418
4 ANC 8
5 ATL 16837
6 AUS 2411
7 AVL 261
8 BDL 412
9 BGR 358
10 BHM 269
# ℹ 94 more rows
# A tibble: 104 × 2
dest n
<chr> <int>
1 ABQ 254
2 ACK 264
3 ALB 418
4 ANC 8
5 ATL 16837
6 AUS 2411
7 AVL 261
8 BDL 412
9 BGR 358
10 BHM 269
# ℹ 94 more rows
# A tibble: 104 × 3
dest n percent
<chr> <int> <dbl>
1 ABQ 254 0.000776
2 ACK 264 0.000806
3 ALB 418 0.00128
4 ANC 8 0.0000244
5 ATL 16837 0.0514
6 AUS 2411 0.00737
7 AVL 261 0.000797
8 BDL 412 0.00126
9 BGR 358 0.00109
10 BHM 269 0.000822
# ℹ 94 more rows
12.7 count() with “wt =” argument
- You can optionally provide a weight variable.
```{r}
# For example, you could use this to "count" (sum) the total number of miles a plane flew:
not_cancelled %>%
count(tailnum, wt = distance)
not_cancelled %>%
count(tailnum)
```# A tibble: 4,037 × 2
tailnum n
<chr> <dbl>
1 D942DN 3418
2 N0EGMQ 239143
3 N10156 109664
4 N102UW 25722
5 N103US 24619
6 N104UW 24616
7 N10575 139903
8 N105UW 23618
9 N107US 21677
10 N108UW 32070
# ℹ 4,027 more rows
# A tibble: 4,037 × 2
tailnum n
<chr> <int>
1 D942DN 4
2 N0EGMQ 352
3 N10156 145
4 N102UW 48
5 N103US 46
6 N104UW 46
7 N10575 269
8 N105UW 45
9 N107US 41
10 N108UW 60
# ℹ 4,027 more rows
12.8 sum(x) of TRUE / mean(x) of TRUE
- Sum(x) of TRUE’s: Number of TRUE’s in X
- mean(x) of TRUE’s: Proportion of TRUE’s in X
```{r}
# How many flights left before 5am? (these usually indicate delayed
# flights from the previous day)
not_cancelled %>%
group_by(year, month, day) %>%
summarise(n_early = sum(dep_time < 500),
n = n())
# What proportion of flights are delayed by more than an hour?
not_cancelled %>%
group_by(year, month, day) %>%
summarise(hour_prop = mean(arr_delay > 60),
n = n())
```# A tibble: 365 × 5
# Groups: year, month [12]
year month day n_early n
<int> <int> <int> <int> <int>
1 2013 1 1 0 831
2 2013 1 2 3 928
3 2013 1 3 4 900
4 2013 1 4 3 908
5 2013 1 5 3 717
6 2013 1 6 2 829
7 2013 1 7 2 930
8 2013 1 8 1 892
9 2013 1 9 3 893
10 2013 1 10 3 929
# ℹ 355 more rows
# A tibble: 365 × 5
# Groups: year, month [12]
year month day hour_prop n
<int> <int> <int> <dbl> <int>
1 2013 1 1 0.0722 831
2 2013 1 2 0.0851 928
3 2013 1 3 0.0567 900
4 2013 1 4 0.0396 908
5 2013 1 5 0.0349 717
6 2013 1 6 0.0470 829
7 2013 1 7 0.0333 930
8 2013 1 8 0.0213 892
9 2013 1 9 0.0202 893
10 2013 1 10 0.0183 929
# ℹ 355 more rows
12.9 Grouping by multiple variables
```{r}
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
# sum of daily total flights over the month
(per_month <- summarise(per_day, flights = sum(flights))) #same as below
group_by(flights, year, month) %>%
summarise(n = n())
# sum of monthly total flights over the year.
(per_year <- summarise(per_month, flights = sum(flights))) # same as below
group_by(flights, year) %>%
summarise(n = n())
```# A tibble: 365 × 4
# Groups: year, month [12]
year month day flights
<int> <int> <int> <int>
1 2013 1 1 842
2 2013 1 2 943
3 2013 1 3 914
4 2013 1 4 915
5 2013 1 5 720
6 2013 1 6 832
7 2013 1 7 933
8 2013 1 8 899
9 2013 1 9 902
10 2013 1 10 932
# ℹ 355 more rows
# A tibble: 12 × 3
# Groups: year [1]
year month flights
<int> <int> <int>
1 2013 1 27004
2 2013 2 24951
3 2013 3 28834
4 2013 4 28330
5 2013 5 28796
6 2013 6 28243
7 2013 7 29425
8 2013 8 29327
9 2013 9 27574
10 2013 10 28889
11 2013 11 27268
12 2013 12 28135
# A tibble: 12 × 3
# Groups: year [1]
year month n
<int> <int> <int>
1 2013 1 27004
2 2013 2 24951
3 2013 3 28834
4 2013 4 28330
5 2013 5 28796
6 2013 6 28243
7 2013 7 29425
8 2013 8 29327
9 2013 9 27574
10 2013 10 28889
11 2013 11 27268
12 2013 12 28135
# A tibble: 1 × 2
year flights
<int> <int>
1 2013 336776
# A tibble: 1 × 2
year n
<int> <int>
1 2013 336776
12.9.1 Ungrouping
12.10 case_when
A function that comes in handy a lot when using mutate() to create a categorical variable is case_when(), which is sort of like ifelse() except it can cleanly handle way more than one condition
Provide case_when() with a series of
if ~ thenconditions, separated by commas, and it will go through the ifs one by one for each observation until it finds a fitting one.As soon as it finds one, it stops looking, so you can assume anyone that satisfied an earlier condition doesn’t count any more. Also, you can have the last if be TRUE to give a value for anyone who hasn’t been caught yet.
```{r}
state.x77
state <- as_tibble(state.x77)
skimr::skim(state)
state %>%
mutate(pop_bracket = case_when(
Population >= quantile(Population, 3/4) ~ ">=75th percentile",
Population >= quantile(Population, 2/4) ~ "50th - 75th percentile",
Population >= quantile(Population, 1/4) ~ "25th - 50th percentile",
TRUE ~ "< 25th percentile")
)
``` Population Income Illiteracy Life Exp Murder HS Grad Frost
Alabama 3615 3624 2.1 69.05 15.1 41.3 20
Alaska 365 6315 1.5 69.31 11.3 66.7 152
Arizona 2212 4530 1.8 70.55 7.8 58.1 15
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65
California 21198 5114 1.1 71.71 10.3 62.6 20
Colorado 2541 4884 0.7 72.06 6.8 63.9 166
Connecticut 3100 5348 1.1 72.48 3.1 56.0 139
Delaware 579 4809 0.9 70.06 6.2 54.6 103
Florida 8277 4815 1.3 70.66 10.7 52.6 11
Georgia 4931 4091 2.0 68.54 13.9 40.6 60
Hawaii 868 4963 1.9 73.60 6.2 61.9 0
Idaho 813 4119 0.6 71.87 5.3 59.5 126
Illinois 11197 5107 0.9 70.14 10.3 52.6 127
Indiana 5313 4458 0.7 70.88 7.1 52.9 122
Iowa 2861 4628 0.5 72.56 2.3 59.0 140
Kansas 2280 4669 0.6 72.58 4.5 59.9 114
Kentucky 3387 3712 1.6 70.10 10.6 38.5 95
Louisiana 3806 3545 2.8 68.76 13.2 42.2 12
Maine 1058 3694 0.7 70.39 2.7 54.7 161
Maryland 4122 5299 0.9 70.22 8.5 52.3 101
Massachusetts 5814 4755 1.1 71.83 3.3 58.5 103
Michigan 9111 4751 0.9 70.63 11.1 52.8 125
Minnesota 3921 4675 0.6 72.96 2.3 57.6 160
Mississippi 2341 3098 2.4 68.09 12.5 41.0 50
Missouri 4767 4254 0.8 70.69 9.3 48.8 108
Montana 746 4347 0.6 70.56 5.0 59.2 155
Nebraska 1544 4508 0.6 72.60 2.9 59.3 139
Nevada 590 5149 0.5 69.03 11.5 65.2 188
New Hampshire 812 4281 0.7 71.23 3.3 57.6 174
New Jersey 7333 5237 1.1 70.93 5.2 52.5 115
New Mexico 1144 3601 2.2 70.32 9.7 55.2 120
New York 18076 4903 1.4 70.55 10.9 52.7 82
North Carolina 5441 3875 1.8 69.21 11.1 38.5 80
North Dakota 637 5087 0.8 72.78 1.4 50.3 186
Ohio 10735 4561 0.8 70.82 7.4 53.2 124
Oklahoma 2715 3983 1.1 71.42 6.4 51.6 82
Oregon 2284 4660 0.6 72.13 4.2 60.0 44
Pennsylvania 11860 4449 1.0 70.43 6.1 50.2 126
Rhode Island 931 4558 1.3 71.90 2.4 46.4 127
South Carolina 2816 3635 2.3 67.96 11.6 37.8 65
South Dakota 681 4167 0.5 72.08 1.7 53.3 172
Tennessee 4173 3821 1.7 70.11 11.0 41.8 70
Texas 12237 4188 2.2 70.90 12.2 47.4 35
Utah 1203 4022 0.6 72.90 4.5 67.3 137
Vermont 472 3907 0.6 71.64 5.5 57.1 168
Virginia 4981 4701 1.4 70.08 9.5 47.8 85
Washington 3559 4864 0.6 71.72 4.3 63.5 32
West Virginia 1799 3617 1.4 69.48 6.7 41.6 100
Wisconsin 4589 4468 0.7 72.48 3.0 54.5 149
Wyoming 376 4566 0.6 70.29 6.9 62.9 173
Area
Alabama 50708
Alaska 566432
Arizona 113417
Arkansas 51945
California 156361
Colorado 103766
Connecticut 4862
Delaware 1982
Florida 54090
Georgia 58073
Hawaii 6425
Idaho 82677
Illinois 55748
Indiana 36097
Iowa 55941
Kansas 81787
Kentucky 39650
Louisiana 44930
Maine 30920
Maryland 9891
Massachusetts 7826
Michigan 56817
Minnesota 79289
Mississippi 47296
Missouri 68995
Montana 145587
Nebraska 76483
Nevada 109889
New Hampshire 9027
New Jersey 7521
New Mexico 121412
New York 47831
North Carolina 48798
North Dakota 69273
Ohio 40975
Oklahoma 68782
Oregon 96184
Pennsylvania 44966
Rhode Island 1049
South Carolina 30225
South Dakota 75955
Tennessee 41328
Texas 262134
Utah 82096
Vermont 9267
Virginia 39780
Washington 66570
West Virginia 24070
Wisconsin 54464
Wyoming 97203
| Name | state |
| Number of rows | 50 |
| Number of columns | 8 |
| _______________________ | |
| Column type frequency: | |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Population | 0 | 1 | 4246.42 | 4464.49 | 365.00 | 1079.50 | 2838.50 | 4968.50 | 21198.0 | ▇▂▁▁▁ |
| Income | 0 | 1 | 4435.80 | 614.47 | 3098.00 | 3992.75 | 4519.00 | 4813.50 | 6315.0 | ▃▅▇▂▁ |
| Illiteracy | 0 | 1 | 1.17 | 0.61 | 0.50 | 0.62 | 0.95 | 1.58 | 2.8 | ▇▃▂▂▁ |
| Life Exp | 0 | 1 | 70.88 | 1.34 | 67.96 | 70.12 | 70.67 | 71.89 | 73.6 | ▃▃▇▅▅ |
| Murder | 0 | 1 | 7.38 | 3.69 | 1.40 | 4.35 | 6.85 | 10.67 | 15.1 | ▆▇▃▇▂ |
| HS Grad | 0 | 1 | 53.11 | 8.08 | 37.80 | 48.05 | 53.25 | 59.15 | 67.3 | ▅▂▇▆▃ |
| Frost | 0 | 1 | 104.46 | 51.98 | 0.00 | 66.25 | 114.50 | 139.75 | 188.0 | ▅▃▅▇▆ |
| Area | 0 | 1 | 70735.88 | 85327.30 | 1049.00 | 36985.25 | 54277.00 | 81162.50 | 566432.0 | ▇▁▁▁▁ |
# A tibble: 50 × 9
Population Income Illiteracy `Life Exp` Murder `HS Grad` Frost Area
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3615 3624 2.1 69.0 15.1 41.3 20 50708
2 365 6315 1.5 69.3 11.3 66.7 152 566432
3 2212 4530 1.8 70.6 7.8 58.1 15 113417
4 2110 3378 1.9 70.7 10.1 39.9 65 51945
5 21198 5114 1.1 71.7 10.3 62.6 20 156361
6 2541 4884 0.7 72.1 6.8 63.9 166 103766
7 3100 5348 1.1 72.5 3.1 56 139 4862
8 579 4809 0.9 70.1 6.2 54.6 103 1982
9 8277 4815 1.3 70.7 10.7 52.6 11 54090
10 4931 4091 2 68.5 13.9 40.6 60 58073
# ℹ 40 more rows
# ℹ 1 more variable: pop_bracket <chr>
13 Applying Tidyverse way of DWV to M01-1 Example
In M01-1, there was a data visualization sample with the following question. Would city mileage differ by cylinder size of vehicles? If so, how?
Insights?: The data visualization revealed the following: city mileage gets better as car’s cylinder gets fewer.
13.1 Without using piping
```{r M01-Principle Data Viz}
mpg
g <- ggplot(mpg, aes(cty))
g + geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title = "Density Plot",
subtitle = "City Mileage Grouped by Number of Cylinders",
caption = "Source: mpg dataset",
x= "city Mileage",
fill="# Cylinders") +
theme_classic()
```# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
13.2 Using Piping
Use this way more often
```{r tidyverse way}
mpg # use mpg data set, not mpg column of the mtcars datset.
mpg %>%
ggplot(aes(cty))+
geom_density(aes(fill=factor(cyl)), alpha=0.8) +
labs(title = "Density Plot",
subtitle = "City Mileage Grouped by Number of Cylinders",
caption = "Source: mpg dataset",
x= "city Mileage",
fill="# Cylinders"
) +
theme_classic()
```# A tibble: 234 × 11
manufacturer model displ year cyl trans drv cty hwy fl class
<chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
3 audi a4 2 2008 4 manu… f 20 31 p comp…
4 audi a4 2 2008 4 auto… f 21 30 p comp…
5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
# ℹ 224 more rows
Try to use piping all the time from now on. It is the tidyverse way of wrangling.
14 Missing data
Topics: - Dealing with missing data - Detecting missing data - Replacing missing data with mean of the column, so-called mean replacement
```{r}
airQ.df <- airquality
class(airQ.df)
typeof(airQ.df)
summary(airQ.df)
skimr::skim(airQ.df)
colSums(is.na(airQ.df))
# mean replacement
OzoneMean <- mean(airQ.df$Ozone) # doesn't work due to NA's
OzoneMean <- mean(airQ.df$Ozone, na.rm = TRUE)
is.na(airQ.df["Ozone"])
airQ.df['Ozone'][is.na(airQ.df['Ozone'])] <- OzoneMean
airQ.df
skimr::skim(airQ.df)
```[1] "data.frame"
[1] "list"
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
| Name | airQ.df |
| Number of rows | 153 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Ozone | 37 | 0.76 | 42.13 | 32.99 | 1.0 | 18.00 | 31.5 | 63.25 | 168.0 | ▇▃▂▁▁ |
| Solar.R | 7 | 0.95 | 185.93 | 90.06 | 7.0 | 115.75 | 205.0 | 258.75 | 334.0 | ▅▃▅▇▅ |
| Wind | 0 | 1.00 | 9.96 | 3.52 | 1.7 | 7.40 | 9.7 | 11.50 | 20.7 | ▂▇▇▃▁ |
| Temp | 0 | 1.00 | 77.88 | 9.47 | 56.0 | 72.00 | 79.0 | 85.00 | 97.0 | ▂▃▇▇▃ |
| Month | 0 | 1.00 | 6.99 | 1.42 | 5.0 | 6.00 | 7.0 | 8.00 | 9.0 | ▇▇▇▇▇ |
| Day | 0 | 1.00 | 15.80 | 8.86 | 1.0 | 8.00 | 16.0 | 23.00 | 31.0 | ▇▇▇▇▆ |
Ozone Solar.R Wind Temp Month Day
37 7 0 0 0 0
Ozone
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
[5,] TRUE
[6,] FALSE
[7,] FALSE
[8,] FALSE
[9,] FALSE
[10,] TRUE
[11,] FALSE
[12,] FALSE
[13,] FALSE
[14,] FALSE
[15,] FALSE
[16,] FALSE
[17,] FALSE
[18,] FALSE
[19,] FALSE
[20,] FALSE
[21,] FALSE
[22,] FALSE
[23,] FALSE
[24,] FALSE
[25,] TRUE
[26,] TRUE
[27,] TRUE
[28,] FALSE
[29,] FALSE
[30,] FALSE
[31,] FALSE
[32,] TRUE
[33,] TRUE
[34,] TRUE
[35,] TRUE
[36,] TRUE
[37,] TRUE
[38,] FALSE
[39,] TRUE
[40,] FALSE
[41,] FALSE
[42,] TRUE
[43,] TRUE
[44,] FALSE
[45,] TRUE
[46,] TRUE
[47,] FALSE
[48,] FALSE
[49,] FALSE
[50,] FALSE
[51,] FALSE
[52,] TRUE
[53,] TRUE
[54,] TRUE
[55,] TRUE
[56,] TRUE
[57,] TRUE
[58,] TRUE
[59,] TRUE
[60,] TRUE
[61,] TRUE
[62,] FALSE
[63,] FALSE
[64,] FALSE
[65,] TRUE
[66,] FALSE
[67,] FALSE
[68,] FALSE
[69,] FALSE
[70,] FALSE
[71,] FALSE
[72,] TRUE
[73,] FALSE
[74,] FALSE
[75,] TRUE
[76,] FALSE
[77,] FALSE
[78,] FALSE
[79,] FALSE
[80,] FALSE
[81,] FALSE
[82,] FALSE
[83,] TRUE
[84,] TRUE
[85,] FALSE
[86,] FALSE
[87,] FALSE
[88,] FALSE
[89,] FALSE
[90,] FALSE
[91,] FALSE
[92,] FALSE
[93,] FALSE
[94,] FALSE
[95,] FALSE
[96,] FALSE
[97,] FALSE
[98,] FALSE
[99,] FALSE
[100,] FALSE
[101,] FALSE
[102,] TRUE
[103,] TRUE
[104,] FALSE
[105,] FALSE
[106,] FALSE
[107,] TRUE
[108,] FALSE
[109,] FALSE
[110,] FALSE
[111,] FALSE
[112,] FALSE
[113,] FALSE
[114,] FALSE
[115,] TRUE
[116,] FALSE
[117,] FALSE
[118,] FALSE
[119,] TRUE
[120,] FALSE
[121,] FALSE
[122,] FALSE
[123,] FALSE
[124,] FALSE
[125,] FALSE
[126,] FALSE
[127,] FALSE
[128,] FALSE
[129,] FALSE
[130,] FALSE
[131,] FALSE
[132,] FALSE
[133,] FALSE
[134,] FALSE
[135,] FALSE
[136,] FALSE
[137,] FALSE
[138,] FALSE
[139,] FALSE
[140,] FALSE
[141,] FALSE
[142,] FALSE
[143,] FALSE
[144,] FALSE
[145,] FALSE
[146,] FALSE
[147,] FALSE
[148,] FALSE
[149,] FALSE
[150,] TRUE
[151,] FALSE
[152,] FALSE
[153,] FALSE
Ozone Solar.R Wind Temp Month Day
1 41.00000 190 7.4 67 5 1
2 36.00000 118 8.0 72 5 2
3 12.00000 149 12.6 74 5 3
4 18.00000 313 11.5 62 5 4
5 42.12931 NA 14.3 56 5 5
6 28.00000 NA 14.9 66 5 6
7 23.00000 299 8.6 65 5 7
8 19.00000 99 13.8 59 5 8
9 8.00000 19 20.1 61 5 9
10 42.12931 194 8.6 69 5 10
11 7.00000 NA 6.9 74 5 11
12 16.00000 256 9.7 69 5 12
13 11.00000 290 9.2 66 5 13
14 14.00000 274 10.9 68 5 14
15 18.00000 65 13.2 58 5 15
16 14.00000 334 11.5 64 5 16
17 34.00000 307 12.0 66 5 17
18 6.00000 78 18.4 57 5 18
19 30.00000 322 11.5 68 5 19
20 11.00000 44 9.7 62 5 20
21 1.00000 8 9.7 59 5 21
22 11.00000 320 16.6 73 5 22
23 4.00000 25 9.7 61 5 23
24 32.00000 92 12.0 61 5 24
25 42.12931 66 16.6 57 5 25
26 42.12931 266 14.9 58 5 26
27 42.12931 NA 8.0 57 5 27
28 23.00000 13 12.0 67 5 28
29 45.00000 252 14.9 81 5 29
30 115.00000 223 5.7 79 5 30
31 37.00000 279 7.4 76 5 31
32 42.12931 286 8.6 78 6 1
33 42.12931 287 9.7 74 6 2
34 42.12931 242 16.1 67 6 3
35 42.12931 186 9.2 84 6 4
36 42.12931 220 8.6 85 6 5
37 42.12931 264 14.3 79 6 6
38 29.00000 127 9.7 82 6 7
39 42.12931 273 6.9 87 6 8
40 71.00000 291 13.8 90 6 9
41 39.00000 323 11.5 87 6 10
42 42.12931 259 10.9 93 6 11
43 42.12931 250 9.2 92 6 12
44 23.00000 148 8.0 82 6 13
45 42.12931 332 13.8 80 6 14
46 42.12931 322 11.5 79 6 15
47 21.00000 191 14.9 77 6 16
48 37.00000 284 20.7 72 6 17
49 20.00000 37 9.2 65 6 18
50 12.00000 120 11.5 73 6 19
51 13.00000 137 10.3 76 6 20
52 42.12931 150 6.3 77 6 21
53 42.12931 59 1.7 76 6 22
54 42.12931 91 4.6 76 6 23
55 42.12931 250 6.3 76 6 24
56 42.12931 135 8.0 75 6 25
57 42.12931 127 8.0 78 6 26
58 42.12931 47 10.3 73 6 27
59 42.12931 98 11.5 80 6 28
60 42.12931 31 14.9 77 6 29
61 42.12931 138 8.0 83 6 30
62 135.00000 269 4.1 84 7 1
63 49.00000 248 9.2 85 7 2
64 32.00000 236 9.2 81 7 3
65 42.12931 101 10.9 84 7 4
66 64.00000 175 4.6 83 7 5
67 40.00000 314 10.9 83 7 6
68 77.00000 276 5.1 88 7 7
69 97.00000 267 6.3 92 7 8
70 97.00000 272 5.7 92 7 9
71 85.00000 175 7.4 89 7 10
72 42.12931 139 8.6 82 7 11
73 10.00000 264 14.3 73 7 12
74 27.00000 175 14.9 81 7 13
75 42.12931 291 14.9 91 7 14
76 7.00000 48 14.3 80 7 15
77 48.00000 260 6.9 81 7 16
78 35.00000 274 10.3 82 7 17
79 61.00000 285 6.3 84 7 18
80 79.00000 187 5.1 87 7 19
81 63.00000 220 11.5 85 7 20
82 16.00000 7 6.9 74 7 21
83 42.12931 258 9.7 81 7 22
84 42.12931 295 11.5 82 7 23
85 80.00000 294 8.6 86 7 24
86 108.00000 223 8.0 85 7 25
87 20.00000 81 8.6 82 7 26
88 52.00000 82 12.0 86 7 27
89 82.00000 213 7.4 88 7 28
90 50.00000 275 7.4 86 7 29
91 64.00000 253 7.4 83 7 30
92 59.00000 254 9.2 81 7 31
93 39.00000 83 6.9 81 8 1
94 9.00000 24 13.8 81 8 2
95 16.00000 77 7.4 82 8 3
96 78.00000 NA 6.9 86 8 4
97 35.00000 NA 7.4 85 8 5
98 66.00000 NA 4.6 87 8 6
99 122.00000 255 4.0 89 8 7
100 89.00000 229 10.3 90 8 8
101 110.00000 207 8.0 90 8 9
102 42.12931 222 8.6 92 8 10
103 42.12931 137 11.5 86 8 11
104 44.00000 192 11.5 86 8 12
105 28.00000 273 11.5 82 8 13
106 65.00000 157 9.7 80 8 14
107 42.12931 64 11.5 79 8 15
108 22.00000 71 10.3 77 8 16
109 59.00000 51 6.3 79 8 17
110 23.00000 115 7.4 76 8 18
111 31.00000 244 10.9 78 8 19
112 44.00000 190 10.3 78 8 20
113 21.00000 259 15.5 77 8 21
114 9.00000 36 14.3 72 8 22
115 42.12931 255 12.6 75 8 23
116 45.00000 212 9.7 79 8 24
117 168.00000 238 3.4 81 8 25
118 73.00000 215 8.0 86 8 26
119 42.12931 153 5.7 88 8 27
120 76.00000 203 9.7 97 8 28
121 118.00000 225 2.3 94 8 29
122 84.00000 237 6.3 96 8 30
123 85.00000 188 6.3 94 8 31
124 96.00000 167 6.9 91 9 1
125 78.00000 197 5.1 92 9 2
126 73.00000 183 2.8 93 9 3
127 91.00000 189 4.6 93 9 4
128 47.00000 95 7.4 87 9 5
129 32.00000 92 15.5 84 9 6
130 20.00000 252 10.9 80 9 7
131 23.00000 220 10.3 78 9 8
132 21.00000 230 10.9 75 9 9
133 24.00000 259 9.7 73 9 10
134 44.00000 236 14.9 81 9 11
135 21.00000 259 15.5 76 9 12
136 28.00000 238 6.3 77 9 13
137 9.00000 24 10.9 71 9 14
138 13.00000 112 11.5 71 9 15
139 46.00000 237 6.9 78 9 16
140 18.00000 224 13.8 67 9 17
141 13.00000 27 10.3 76 9 18
142 24.00000 238 10.3 68 9 19
143 16.00000 201 8.0 82 9 20
144 13.00000 238 12.6 64 9 21
145 23.00000 14 9.2 71 9 22
146 36.00000 139 10.3 81 9 23
147 7.00000 49 10.3 69 9 24
148 14.00000 20 16.6 63 9 25
149 30.00000 193 6.9 70 9 26
150 42.12931 145 13.2 77 9 27
151 14.00000 191 14.3 75 9 28
152 18.00000 131 8.0 76 9 29
153 20.00000 223 11.5 68 9 30
| Name | airQ.df |
| Number of rows | 153 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Ozone | 0 | 1.00 | 42.13 | 28.69 | 1.0 | 21.00 | 42.13 | 46.00 | 168.0 | ▇▇▂▁▁ |
| Solar.R | 7 | 0.95 | 185.93 | 90.06 | 7.0 | 115.75 | 205.00 | 258.75 | 334.0 | ▅▃▅▇▅ |
| Wind | 0 | 1.00 | 9.96 | 3.52 | 1.7 | 7.40 | 9.70 | 11.50 | 20.7 | ▂▇▇▃▁ |
| Temp | 0 | 1.00 | 77.88 | 9.47 | 56.0 | 72.00 | 79.00 | 85.00 | 97.0 | ▂▃▇▇▃ |
| Month | 0 | 1.00 | 6.99 | 1.42 | 5.0 | 6.00 | 7.00 | 8.00 | 9.0 | ▇▇▇▇▇ |
| Day | 0 | 1.00 | 15.80 | 8.86 | 1.0 | 8.00 | 16.00 | 23.00 | 31.0 | ▇▇▇▇▆ |
14.1 Mean replacement
```{r mean replacement extra not covered in the video}
SolarMean <- mean(airQ.df$Solar.R, na.rm = TRUE)
airQ.df$Solar.R [is.na(airQ.df$Solar.R)] <- SolarMean # treat the column as a vector
airQ.df['Solar.R'][is.na(airQ.df['Solar.R'])] <- SolarMean # treat the column as a data frame but the same result
library(skimr)
skim(airQ.df)
```| Name | airQ.df |
| Number of rows | 153 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Ozone | 0 | 1 | 42.13 | 28.69 | 1.0 | 21.0 | 42.13 | 46.0 | 168.0 | ▇▇▂▁▁ |
| Solar.R | 0 | 1 | 185.93 | 87.96 | 7.0 | 120.0 | 194.00 | 256.0 | 334.0 | ▅▃▆▇▅ |
| Wind | 0 | 1 | 9.96 | 3.52 | 1.7 | 7.4 | 9.70 | 11.5 | 20.7 | ▂▇▇▃▁ |
| Temp | 0 | 1 | 77.88 | 9.47 | 56.0 | 72.0 | 79.00 | 85.0 | 97.0 | ▂▃▇▇▃ |
| Month | 0 | 1 | 6.99 | 1.42 | 5.0 | 6.0 | 7.00 | 8.0 | 9.0 | ▇▇▇▇▇ |
| Day | 0 | 1 | 15.80 | 8.86 | 1.0 | 8.0 | 16.00 | 23.0 | 31.0 | ▇▇▇▇▆ |