7 Data management and wrangling

When in RStudio, quickly jump to this page using r3::open_data_wrangling().

Session objectives:

Learn the difference between “messy” and “tidy” data and how to create tidy data to simplify your analysis.
Perform simple transformations and subsetting of datasets, such as:
- Subset specific columns and rows of a dataset (with filter()and select()).
- Sort rows of a dataset by a specific column (with arrange()).
- Create new or transform existing columns in a dataset (with mutate()).
- Calculate simple data summaries (with summarize()).
Learn about and apply the “split-apply-combine” method for doing analyses (with group_by() and summarize()).
Know the difference between “long” and “wide” data and how to convert between them by “pivotting” (with pivot_longer() and pivot_wider()).
Write “tidier” and more readable code by using the pipe (%>%) operator.

7.1 “Messy” vs. “tidy” data

Take 10 min to read through this “Messy” vs “tidy” data section and the Managing and working with data in R section.

The concept of “tidy” data was popularized in an article by Hadley Wickham and described in more detail in the Tidy Data chapter of the R for Data Science online book.

But before we continue with tidy data, we need to cover something that is related to the concept of “tidy” and that will come up often in this course: the tidyverse. The tidyverse is an ecosystem of R packages that are designed to work well together, that all follow a strong “design philosophy” and common style guide. This makes them much easier to use together. These packages also tend to have excellent, beginner-friendly documentation and tutorials on learning and using the packages. We teach the tidyverse because of these reasons.

Ok, back to “tidy data”. A tidy dataset is when:

Each variable has its own column (e.g. “Body Weight”)
Each observation has its own row (e.g. “Person”)
Each value has its own cell (e.g. “Body weight for a person at a specific date”)

Take a look at the example “tidy” and “messy” data frames (also called “tibbles” in the tidyverse) below. These datasets come built in with the tidyr package for teaching purpose. Think about why each is “tidy” or “messy”. What do you notice between the tidy versions and the messier versions?

# Datasets come from tidyr
# Tidy:
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
# Partly tidy:
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583
# Messier:
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#> * <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583
# Messy:
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#> * <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
# Messy:
table4b
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#> * <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

The “most” tidy version is table1 which has columns that describe their values (e.g. population is population size), each row is unique (e.g. first row is for values from Afghanistan from 1999), and each cell is an explicit value representative of its column and row. table2 is a “long” version of table1 so it is partly “tidy”, but it doesn’t satisfy the rule that each variable has a column, since count represents both cases and population size. On the other hand, table3 is messy because the rate column values are a composite of two other column values (cases and population), when it should be a single number (a percent). Both table4a and table4b have columns with ambiguous values inside; what does values in the 1999 column contain? You can’t tell from the data.

Tidy data has a few notable benefits:

Time spent preparing your data to be tidy from the beginning can save days of frustration in the long run.
“Tidy data” is a conceptual framework that allows you to easily build off and wrangle (i.e. “manipulate”, “clean up”, “manage”) data in simpler and easy to interpret ways, especially when used within the framework of the tidyverse.

The concept of tidy data also gives rise to “tidy code” for wrangling. By using “verbs” (R functions) and chaining them together in “sentences” (in a sequential pipeline), you can construct meaningful and readable code that describes in plainer English what you are doing to the data

7.2 Managing and working with data in R

Take 5 min and read through this section. When working with data, there are a few principles to follow:

Never edit raw data and save it in a separate location (could put in the data-raw/ folder)
- Note: Saving to data-raw/ depends on how you collected the data and how many collaborators are on your team. You may end up storing and processing the data in another folder as a project of its own.
Only work with your raw data using R code, don’t manually edit it. Manual editing doesn’t leave a history of what you’ve done to it, so you can’t go back and see what you’ve done. Always keep a history of any changes you’ve made to the data, preferably by using R code.
Save the edited data as another dataset and store it in the data/ folder.

When wrangling your data with R code make sure to:

Document and comment as best you can what you did to your data and why you did it to help you remember
Write the code itself to be as descriptive as you can and to readable enough to understand what is being done to the data. Keep the code simple: Don’t be clever, be clear. Clear code is easier to understand than some clever code.

In data wrangling, most tasks can be expressed by a few simple “verbs” (actions). Wrangling here is used in the sense of maneuvering, managing, controlling, and turning your data around to clean it up, to better understand it, and to prepare it for later analyses. The table below lists some common “verbs” from the dplyr and tidyr packages that come from the tidyverse:

Table 7.1: List of common data wrangling tasks, along with an example and the function used for the wrangling.
Task	Example	Function
Select columns	Remove data entry columns such as person’s name who entered the data.	`select()`
Rename columns	Changing a column name from ‘Q1’ to ‘ParticipantName’.	`rename()`
Transform or modify columns	Multiplying a column’s values or taking the log.	`mutate()`
Subset/filter rows	Keeping rows with glucose values above 4.	`filter()`
Sort rows	Show rows with the smallest value at the top.	`arrange()`
Convert data from wide to long	One row per participant to multiple participants per row (repeated measures).	`pivot_longer()`
Convert data from long to wide	Multiple rows per participant (repeated measures) to one participant per row.	`pivot_wider()`
Calculate summaries of the data	Calculating the maximum, median, and minimum age.	`summarise()`
Run an analysis by a group	Calculate means of age by males and females.	`group_by()` with `summarise()`

The functions above come from the packages dplyr and tidyr. These packages provide easy tools for most common data manipulation tasks. No other packages have developed such a “language” of wrangling, and is the reason we teach them. Plus both packages have excellent documentation and tutorials on how to use them. For dplyr, it is built to work directly with data frames (i.e. rectangular data like those found in spreadsheets) and has an additional feature to interact directly with data stored in an external database, such as in SQL. Working with databases is a powerful way to work with massive datasets (100s of GB), more than what your computer could normally handle. Working with massive data won’t be covered in this course, but see this resource from Data Carpentry) to learn more.

Tip: Sometimes you need to do some complicated wrangling to get your data in appropriate “shape” to use for later analyses. To help save some time, you could save the wrangled data as an “output” dataset in the data/ folder. That way, you can easily use it again later rather than having to run the wrangling code every time you want to work with the data.

7.3 Load the packages and dataset

We’re going to use the US NHANES dataset to demonstrate the wrangling functions. There is an NHANES package that contains a teaching version of the original dataset, so we’ll use that for this lesson. First, make sure the R Project you created previously is open. Then open the R/package-loading.R script, add the dataset package to the file, so it looks like:

library(tidyverse)
library(NHANES)

Then open the R/wrangling-session.R script to start typing out the next code. We’ll use this file to write the code for this session (but not for the exercises).

## # Load up the packages
## source(here::here("R/package-loading.R"))

# Briefly glimpse contents of dataset
glimpse(NHANES)
#> Rows: 10,000
#> Columns: 76
#> $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 516…
#> $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10…
#> $ Gender           <fct> male, male, male, male, female, male, male, female, …
#> $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10,…
#> $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  4…
#> $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795…
#> $ Race1            <fct> White, White, White, Other, White, White, White, Whi…
#> $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Education        <fct> High School, High School, High School, NA, Some Coll…
#> $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, …
#> $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, …
#> $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 875…
#> $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00…
#> $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3…
#> $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, …
#> $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, …
#> $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7…
#> $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166…
#> $ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.…
#> $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plu…
#> $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, …
#> $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111,…
#> $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, …
#> $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124,…
#> $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, …
#> $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108,…
#> $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, …
#> $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114,…
#> $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, …
#> $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, …
#> $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, …
#> $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113,…
#> $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1…
#> $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, …
#> $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vg…
#> $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0…
#> $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA…
#> $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, N…
#> $ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None…
#> $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, …
#> $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, N…
#> $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, …
#> $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, …
#> $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes,…
#> $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2,…
#> $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
#> $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, …
#> $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, …
#> $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA,…
#> $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, …
#> $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA,…
#> $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No,…
#> $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smok…
#> $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, …
#> $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Y…
#> $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, …
#> $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes…
#> $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, …
#> $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes,…
#> $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, …
#> $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, …
#> $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA,…
#> $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1…
#> $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, …
#> $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Hetero…
#> $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

7.4 Exercise: Become familiar with the dataset

Time: 5 min

Take the time to get familiar with the NHANES dataset.

Create a new R script by typing in the RStudio Console usethis::use_r("exercises-wrangling").
Copy the code below and paste it into the new exercise file.
Replace the ___ with the NHANES dataset.
Run each line of code by typing Ctrl-Enter.

# Load the packages
library(tidyverse)
library(NHANES)

# Check column names
colnames(___)

# Look at contents
str(___)
glimpse(___)

# See summary
summary(___)

# Look over the dataset documentation
?___

Click for the solution

# Load the packages
library(tidyverse)
library(NHANES)

# Check column names
colnames(NHANES)

# Look at contents
str(NHANES)
glimpse(NHANES)

# See summary
summary(NHANES)

# Look over the dataset documentation
?NHANES

7.5 Select specific columns in a dataset

Selecting columns of a dataset is a very common data wrangling task. The function for this task is appropriately called select(). For the input arguments, it takes the dataset as the first argument, which is the first input position right after the opening bracket (, and then takes the names of the columns you want to select. Because the argument after the data argument is ..., it means that you can add as many columns as you want, separated by a ,.

# Select one column by its name, without quotes
select(NHANES, Age)
#> # A tibble: 10,000 x 1
#>      Age
#>    <int>
#>  1    34
#>  2    34
#>  3    34
#>  4     4
#>  5    49
#>  6     9
#>  7     8
#>  8    45
#>  9    45
#> 10    45
#> # … with 9,990 more rows

# Select two or more columns by name, without quotes
select(NHANES, Age, Weight, BMI)
#> # A tibble: 10,000 x 3
#>      Age Weight   BMI
#>    <int>  <dbl> <dbl>
#>  1    34   87.4  32.2
#>  2    34   87.4  32.2
#>  3    34   87.4  32.2
#>  4     4   17    15.3
#>  5    49   86.7  30.6
#>  6     9   29.8  16.8
#>  7     8   35.2  20.6
#>  8    45   75.7  27.2
#>  9    45   75.7  27.2
#> 10    45   75.7  27.2
#> # … with 9,990 more rows

# To *exclude* a column, use minus (-)
select(NHANES, -HeadCirc)
#> # A tibble: 10,000 x 75
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 66 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, Height <dbl>,
#> #   BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>, Pulse <int>,
#> #   BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>, BPSys2 <int>,
#> #   BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

If some of your columns have similar patterns at the beginning, within, or end, you can use the helper functions to choose these columns. Use ?select_helpers (choose the “Select helpers” option in the menu that pops up) to read more about these functions and to get help on them. Some more useful helpers are:

starts_with(): Select columns that begin with a pattern.
ends_with(): Select columns that end with a pattern.
contains(): Select columns that contain a pattern.

# All columns starting with letters "BP" (blood pressure)
select(NHANES, starts_with("BP"))
#> # A tibble: 10,000 x 8
#>    BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3
#>       <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1      113       85    114     88    114     88    112     82
#>  2      113       85    114     88    114     88    112     82
#>  3      113       85    114     88    114     88    112     82
#>  4       NA       NA     NA     NA     NA     NA     NA     NA
#>  5      112       75    118     82    108     74    116     76
#>  6       86       47     84     50     84     50     88     44
#>  7      107       37    114     46    108     36    106     38
#>  8      118       64    106     62    118     68    118     60
#>  9      118       64    106     62    118     68    118     60
#> 10      118       64    106     62    118     68    118     60
#> # … with 9,990 more rows
# All columns ending in letters "Day"
select(NHANES, ends_with("Day"))
#> # A tibble: 10,000 x 3
#>    TVHrsDay CompHrsDay AlcoholDay
#>    <fct>    <fct>           <int>
#>  1 <NA>     <NA>               NA
#>  2 <NA>     <NA>               NA
#>  3 <NA>     <NA>               NA
#>  4 <NA>     <NA>               NA
#>  5 <NA>     <NA>                2
#>  6 <NA>     <NA>               NA
#>  7 <NA>     <NA>               NA
#>  8 <NA>     <NA>                3
#>  9 <NA>     <NA>                3
#> 10 <NA>     <NA>                3
#> # … with 9,990 more rows
# All columns containing letters "Age"
select(NHANES, contains("Age"))
#> # A tibble: 10,000 x 9
#>      Age AgeDecade AgeMonths DiabetesAge Age1stBaby SmokeAge AgeFirstMarij
#>    <int> <fct>         <int>       <int>      <int>    <int>         <int>
#>  1    34 " 30-39"        409          NA         NA       18            17
#>  2    34 " 30-39"        409          NA         NA       18            17
#>  3    34 " 30-39"        409          NA         NA       18            17
#>  4     4 " 0-9"           49          NA         NA       NA            NA
#>  5    49 " 40-49"        596          NA         27       38            18
#>  6     9 " 0-9"          115          NA         NA       NA            NA
#>  7     8 " 0-9"          101          NA         NA       NA            NA
#>  8    45 " 40-49"        541          NA         NA       NA            13
#>  9    45 " 40-49"        541          NA         NA       NA            13
#> 10    45 " 40-49"        541          NA         NA       NA            13
#> # … with 9,990 more rows, and 2 more variables: AgeRegMarij <int>, SexAge <int>

For more information on using the pattern functions such as starts_with(), check ?select_helpers.

You’ll notice that running these functions doesn’t actually change the data itself. When you run a function without assigning it using <-, the only action the function does is to send the output to your screen. But if you want to create a new dataset with only the columns you selected, you’ll need to assign it to a new object.

# Recall the style guide for naming objects
nhanes_blood_pressure <- select(NHANES, starts_with("BP"))
nhanes_blood_pressure
#> # A tibble: 10,000 x 8
#>    BPSysAve BPDiaAve BPSys1 BPDia1 BPSys2 BPDia2 BPSys3 BPDia3
#>       <int>    <int>  <int>  <int>  <int>  <int>  <int>  <int>
#>  1      113       85    114     88    114     88    112     82
#>  2      113       85    114     88    114     88    112     82
#>  3      113       85    114     88    114     88    112     82
#>  4       NA       NA     NA     NA     NA     NA     NA     NA
#>  5      112       75    118     82    108     74    116     76
#>  6       86       47     84     50     84     50     88     44
#>  7      107       37    114     46    108     36    106     38
#>  8      118       64    106     62    118     68    118     60
#>  9      118       64    106     62    118     68    118     60
#> 10      118       64    106     62    118     68    118     60
#> # … with 9,990 more rows

7.6 Rename specific columns

Depending on how your data was collected, it may have column names that aren’t very descriptive. So you’ll probably want to rename them to something more explanatory. As with select(), the rename you use the function called rename(). Like select(), rename() takes the dataset as the first argument (first position) and then takes as many renaming arguments as you want (because the second argument position is ...). Renaming takes the form of newname = oldname.

rename(NHANES, NumberBabies = nBabies)
#> # A tibble: 10,000 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>,
#> #   NumberBabies <int>, Age1stBaby <int>, SleepHrsNight <int>,
#> #   SleepTrouble <fct>, PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>,
#> #   CompHrsDay <fct>, TVHrsDayChild <int>, CompHrsDayChild <int>,
#> #   Alcohol12PlusYr <fct>, AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>,
#> #   Smoke100 <fct>, Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>,
#> #   AgeFirstMarij <int>, RegularMarij <fct>, AgeRegMarij <int>,
#> #   HardDrugs <fct>, SexEver <fct>, SexAge <int>, SexNumPartnLife <int>,
#> #   SexNumPartYear <int>, SameSex <fct>, SexOrientation <fct>,
#> #   PregnantNow <fct>

You can’t really see what was changed. Let’s make a new object of only nBabies.

nhanes_number_babies <- select(NHANES, nBabies)
rename(nhanes_number_babies, NumberBabies = nBabies)
#> # A tibble: 10,000 x 1
#>    NumberBabies
#>           <int>
#>  1           NA
#>  2           NA
#>  3           NA
#>  4           NA
#>  5            2
#>  6           NA
#>  7           NA
#>  8           NA
#>  9           NA
#> 10           NA
#> # … with 9,990 more rows

You can now see how it changes the name. What if you want to select some columns and then rename some of them, do you have to create a new data object every time? No! We can make use of a very powerful tool called piping (with the %>% function).

7.7 Chaining functions with the pipe

Take 5 minutes and read this section before we continue.

A key component of the tidy data and tidy code concept is making use of the %>% operator. This operator allows you to “pipe” the output from one function to the input of another function, like a plumbing pipe would do for water. This allows you to easily chain functions together into “sentences”. Let’s use an example based on English words for some action. This is the English sentence:

We need some eggs. Drive to the grocery store and buy up some eggs before coming home from work.

There are basically two actions here (“drive” and “buy”) and four “inputs” (“work”, “grocery store”, “eggs”, “home”), that are all based on the previous action. Since an action in R is a function, the functions would be drive() and buy(). In regular R, if we wanted to chain these functions together, we would have to nest them like this:

drive(buy(drive(at_work, "grocery store"), "eggs"), "home")

This is difficult to read. We could also create temporary objects:

at_grocery_store <- drive(at_work, "grocery store")
got_eggs <- buy(at_grocery_store, "eggs")
at_home <- drive(got_eggs, "home")

But this still isn’t too “readable”. The pipe %>% operator can really simplify this:

at_work %>% 
    drive("grocery store") %>% 
    buy("eggs") %>% 
    drive("home")

Do you find this more readable and understandable? We read it like how it would actually be done, in order of the steps taken.

Instead of nesting functions (reading from the inside to the outside), the idea of piping is to read the functions from left to right. This can help clarify and break down complex data processing workflows, and is the basis for all tidyverse and many other packages. This is a basic design philosophy of interacting with data when using the tidyverse.

The pipe %>% takes the output from the object or function on the left hand side and puts it into the function of the right hand side. All input goes into the first position argument of the function. So with tidyverse packages, the first position always takes the data, to make the function usable with the pipe.

Ok, let’s return back together and try this out.

# These two ways are the same
colnames(NHANES)
#>  [1] "ID"               "SurveyYr"         "Gender"           "Age"             
#>  [5] "AgeDecade"        "AgeMonths"        "Race1"            "Race3"           
#>  [9] "Education"        "MaritalStatus"    "HHIncome"         "HHIncomeMid"     
#> [13] "Poverty"          "HomeRooms"        "HomeOwn"          "Work"            
#> [17] "Weight"           "Length"           "HeadCirc"         "Height"          
#> [21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"           
#> [25] "BPSysAve"         "BPDiaAve"         "BPSys1"           "BPDia1"          
#> [29] "BPSys2"           "BPDia2"           "BPSys3"           "BPDia3"          
#> [33] "Testosterone"     "DirectChol"       "TotChol"          "UrineVol1"       
#> [37] "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
#> [41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad" 
#> [45] "LittleInterest"   "Depressed"        "nPregnancies"     "nBabies"         
#> [49] "Age1stBaby"       "SleepHrsNight"    "SleepTrouble"     "PhysActive"      
#> [53] "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"   
#> [57] "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
#> [61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"        
#> [65] "Marijuana"        "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"     
#> [69] "HardDrugs"        "SexEver"          "SexAge"           "SexNumPartnLife" 
#> [73] "SexNumPartYear"   "SameSex"          "SexOrientation"   "PregnantNow"
NHANES %>% 
    colnames()
#>  [1] "ID"               "SurveyYr"         "Gender"           "Age"             
#>  [5] "AgeDecade"        "AgeMonths"        "Race1"            "Race3"           
#>  [9] "Education"        "MaritalStatus"    "HHIncome"         "HHIncomeMid"     
#> [13] "Poverty"          "HomeRooms"        "HomeOwn"          "Work"            
#> [17] "Weight"           "Length"           "HeadCirc"         "Height"          
#> [21] "BMI"              "BMICatUnder20yrs" "BMI_WHO"          "Pulse"           
#> [25] "BPSysAve"         "BPDiaAve"         "BPSys1"           "BPDia1"          
#> [29] "BPSys2"           "BPDia2"           "BPSys3"           "BPDia3"          
#> [33] "Testosterone"     "DirectChol"       "TotChol"          "UrineVol1"       
#> [37] "UrineFlow1"       "UrineVol2"        "UrineFlow2"       "Diabetes"        
#> [41] "DiabetesAge"      "HealthGen"        "DaysPhysHlthBad"  "DaysMentHlthBad" 
#> [45] "LittleInterest"   "Depressed"        "nPregnancies"     "nBabies"         
#> [49] "Age1stBaby"       "SleepHrsNight"    "SleepTrouble"     "PhysActive"      
#> [53] "PhysActiveDays"   "TVHrsDay"         "CompHrsDay"       "TVHrsDayChild"   
#> [57] "CompHrsDayChild"  "Alcohol12PlusYr"  "AlcoholDay"       "AlcoholYear"     
#> [61] "SmokeNow"         "Smoke100"         "Smoke100n"        "SmokeAge"        
#> [65] "Marijuana"        "AgeFirstMarij"    "RegularMarij"     "AgeRegMarij"     
#> [69] "HardDrugs"        "SexEver"          "SexAge"           "SexNumPartnLife" 
#> [73] "SexNumPartYear"   "SameSex"          "SexOrientation"   "PregnantNow"

Because the pipe automatically takes NHANES and puts it into the first position, we don’t need to type out NHANES inside colnames() when piping.

Let’s try the pipe on the select() and rename() function from the previous section. Remember, both select() and rename() take a dataset as the first position, which makes them pipe-able.

NHANES %>% 
    select(nBabies) %>% 
    rename(NumberBabies = nBabies)
#> # A tibble: 10,000 x 1
#>    NumberBabies
#>           <int>
#>  1           NA
#>  2           NA
#>  3           NA
#>  4           NA
#>  5            2
#>  6           NA
#>  7           NA
#>  8           NA
#>  9           NA
#> 10           NA
#> # … with 9,990 more rows

We can now “read” these actions as:

Take the NHANES dataset and then select the nBabies column and then rename the nBabies column to NumberBabies.

7.8 Exercise: Practice what we’ve learned

Time: 8 min

In the exercise-wrangling.R file, complete these tasks:

Copy and paste the below code into the exercise file. In the select() function, type in the columns HomeOwn, TVHrsDay, and Diabetes where the blank space is.
```
NHANES %>% 
    select(___)
```
Copy and paste the below code and fill out the blanks. Rename DiabetesAge to be DiabetesAgeOfDiagnosis and Gender to be Sex (gender is the social construct, while sex is biological). Tip: Recall that renaming is in the form new = old.
```
NHANES %>% 
    rename(___ = ___, ___ = ____)
```
Re-write this bit of code to use the pipe:
```
select(NHANES, BMI, contains("Age"))
```
Read aloud (under your breath or in your head) the below code. How intuitive is it to read? Now re-write this code so you don’t need to create the temporary drug_use object by using the pipe, then re-read the revised version. Which do you feel is easier to “read”?
```
drug_use <- select(NHANES, Marijuana, AgeFirstMarij)
rename(drug_use, AgeOfFirstMarijuanaUse = AgeFirstMarij)
```

Click for the solution

# 1. Select specific columns
NHANES %>%
    select(HomeOwn, TVHrsDay, Diabetes)

# 2. Rename columns
NHANES %>%
    rename(DiabetesAgeOfDiagnosis = DiabetesAge, Sex = Gender)

# 3. Re-write with pipe
NHANES %>% 
    select(BMI, contains("Age"))

# 4. Re-write with pipe
NHANES %>% 
    select(Marijuana, AgeFirstMarij) %>% 
    rename(AgeOfFirstMarijuanaUse = AgeFirstMarij)

7.9 Filter the data by row

Filtering data by row is a very common activity in data analysis, for example, to get rid of outliers or to subset by a categorical group. As with the previous functions, the function to subset/filter is called filter(). The filter() function takes a logic condition (TRUE or FALSE). As with the other functions, the first position argument is the dataset, and all others are logic conditions to use. With filter(), when the logic conditions equals TRUE, that means it keeps the rows that equal TRUE and drop those that are FALSE.

A warning: Since filter() uses logical conditions, you need to be really careful when writing the logic. As you probably know, humans are really really bad at logic. So if your logical condition starts getting even a little complex, double and triple check that you know your logic code is doing what you think it is. It’s very easy to make mistakes at this stage, even for advanced R users.

The simplest kind of logic condition is to test for “equality”. In R, “equal to” is represented by ==. For example, if we want to keep only females in the dataset it would be:

NHANES %>%
    filter(Gender == "female")
#> # A tibble: 5,020 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  2 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  3 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  4 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  5 51659 2009_10  female    10 " 10-19"        123 White <NA>  <NA>     
#>  6 51666 2009_10  female    58 " 50-59"        700 Mexi… <NA>  High Sch…
#>  7 51671 2009_10  female     9 " 0-9"          112 Black <NA>  <NA>     
#>  8 51685 2009_10  female    56 " 50-59"        677 White <NA>  College …
#>  9 51685 2009_10  female    56 " 50-59"        677 White <NA>  College …
#> 10 51691 2009_10  female    57 " 50-59"        694 White <NA>  High Sch…
#> # … with 5,010 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

We’d “read” this code as:

Take the NHANES dataset, and then filter so that only rows where Gender is equal to “female” are kept.

So, when a row in the Gender column has the value “female”, that row is kept. Otherwise, it is dropped. There are other logic comparisons to use. Use Table 7.2 as a reference for logical conditions in R.

Table 7.2: Logical operators in R.
Operator	Description
<	less than
<=	less than or equal to
>	greater than
>=	greater than or equal to
==	equal to
!=	not equal to
!x	Not x (if x is true or false)
x \| y	x OR y
x & y	x AND y

Let’s try out a few of these logic conditions with filter().

# When rows don't have female
NHANES %>%
    filter(Gender != "female")
#> # A tibble: 4,980 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  6 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  7 51654 2009_10  male      66 " 60-69"        795 White <NA>  Some Col…
#>  8 51656 2009_10  male      58 " 50-59"        707 White <NA>  College …
#>  9 51657 2009_10  male      54 " 50-59"        654 White <NA>  9 - 11th…
#> 10 51667 2009_10  male      50 " 50-59"        603 White <NA>  Some Col…
#> # … with 4,970 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

# when BMI is equal to 25
NHANES %>%
    filter(BMI == 25)
#> # A tibble: 35 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 57116 2009_10  male      24 " 20-29"        288 Other <NA>  8th Grade
#>  2 57116 2009_10  male      24 " 20-29"        288 Other <NA>  8th Grade
#>  3 62222 2011_12  male      32 " 30-39"         NA White White College …
#>  4 62222 2011_12  male      32 " 30-39"         NA White White College …
#>  5 62222 2011_12  male      32 " 30-39"         NA White White College …
#>  6 62222 2011_12  male      32 " 30-39"         NA White White College …
#>  7 62277 2011_12  female    55 " 50-59"         NA White White Some Col…
#>  8 62578 2011_12  female     6 " 0-9"           NA White White <NA>     
#>  9 62806 2011_12  male      53 " 50-59"         NA Black Black 9 - 11th…
#> 10 62881 2011_12  female    57 " 50-59"         NA White White College …
#> # … with 25 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

# when BMI is equal to or more than 25
NHANES %>%
    filter(BMI >= 25)
#> # A tibble: 5,422 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  5 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  6 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  7 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  8 51657 2009_10  male      54 " 50-59"        654 White <NA>  9 - 11th…
#>  9 51666 2009_10  female    58 " 50-59"        700 Mexi… <NA>  High Sch…
#> 10 51667 2009_10  male      50 " 50-59"        603 White <NA>  Some Col…
#> # … with 5,412 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

We use the | (“or”) and & (“and”) when we want to combine conditions across columns. Be especially careful with these operators and whenever combining logic conditions, as they can sometimes work differently than our human brains interpret them (speaking from experience). For &, both sides must be TRUE in order for the combination to be TRUE. For |, only one side needs to be TRUE in order for the combination to be TRUE. To see how they work try these:

TRUE & TRUE
#> [1] TRUE
TRUE & FALSE
#> [1] FALSE
FALSE & FALSE
#> [1] FALSE
TRUE | TRUE
#> [1] TRUE
TRUE | FALSE
#> [1] TRUE
FALSE | FALSE
#> [1] FALSE

# when BMI is 25 AND Gender is female
NHANES %>%
    filter(BMI == 25 & Gender == "female")
#> # A tibble: 21 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 62277 2011_12  female    55 " 50-59"         NA White White Some Col…
#>  2 62578 2011_12  female     6 " 0-9"           NA White White <NA>     
#>  3 62881 2011_12  female    57 " 50-59"         NA White White College …
#>  4 62881 2011_12  female    57 " 50-59"         NA White White College …
#>  5 62881 2011_12  female    57 " 50-59"         NA White White College …
#>  6 65572 2011_12  female    79 " 70+"           NA Other Asian College …
#>  7 65688 2011_12  female    54 " 50-59"         NA White White Some Col…
#>  8 65688 2011_12  female    54 " 50-59"         NA White White Some Col…
#>  9 65688 2011_12  female    54 " 50-59"         NA White White Some Col…
#> 10 65688 2011_12  female    54 " 50-59"         NA White White Some Col…
#> # … with 11 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

# when BMI is 25 OR gender is female
NHANES %>%
    filter(BMI == 25 | Gender == "female")
#> # A tibble: 5,034 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  2 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  3 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  4 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  5 51659 2009_10  female    10 " 10-19"        123 White <NA>  <NA>     
#>  6 51666 2009_10  female    58 " 50-59"        700 Mexi… <NA>  High Sch…
#>  7 51671 2009_10  female     9 " 0-9"          112 Black <NA>  <NA>     
#>  8 51685 2009_10  female    56 " 50-59"        677 White <NA>  College …
#>  9 51685 2009_10  female    56 " 50-59"        677 White <NA>  College …
#> 10 51691 2009_10  female    57 " 50-59"        694 White <NA>  High Sch…
#> # … with 5,024 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

7.10 (Re)Arranging the rows of your data by column

You may want to sort your rows by a specific column so that rows are arranged with bigger (or smaller) values on top. Arranging is done by using arrange(). Again, arrange() takes the dataset as the first argument and anything else it uses as the columns to order by. By default, arrange orders in ascending order.

# ascending order by age
NHANES %>%
    select(Age) %>% 
    arrange(Age)
#> # A tibble: 10,000 x 1
#>      Age
#>    <int>
#>  1     0
#>  2     0
#>  3     0
#>  4     0
#>  5     0
#>  6     0
#>  7     0
#>  8     0
#>  9     0
#> 10     0
#> # … with 9,990 more rows

We’re selecting age first so we can show what is happening. It also arranges characters alphabetically.

NHANES %>% 
    select(HealthGen) %>% 
    arrange(HealthGen)
#> # A tibble: 10,000 x 1
#>    HealthGen
#>    <fct>    
#>  1 Excellent
#>  2 Excellent
#>  3 Excellent
#>  4 Excellent
#>  5 Excellent
#>  6 Excellent
#>  7 Excellent
#>  8 Excellent
#>  9 Excellent
#> 10 Excellent
#> # … with 9,990 more rows

We can do this also in descending order with desc().

# descending order
NHANES %>%
    select(Age) %>% 
    arrange(desc(Age))
#> # A tibble: 10,000 x 1
#>      Age
#>    <int>
#>  1    80
#>  2    80
#>  3    80
#>  4    80
#>  5    80
#>  6    80
#>  7    80
#>  8    80
#>  9    80
#> 10    80
#> # … with 9,990 more rows

You can also arrange by multiple columns. For instance, first arrange by Gender and then by Age.

# ascending order by Gender and Age
NHANES %>%
    select(Gender, Age) %>% 
    arrange(Gender, Age)
#> # A tibble: 10,000 x 2
#>    Gender   Age
#>    <fct>  <int>
#>  1 female     0
#>  2 female     0
#>  3 female     0
#>  4 female     0
#>  5 female     0
#>  6 female     0
#>  7 female     0
#>  8 female     0
#>  9 female     0
#> 10 female     0
#> # … with 9,990 more rows

7.11 Transform or add columns

To “transform” (modify) an existing column or to add a new one, the function to use is called mutate(). Unfortunately, unlike the other functions, the name is not as obvious about what it does. The meaning of mutate though is to change or modify, so it kind of makes sense. Like the other functions, the first input is the data and the other arguments are columns to add or modify.

The form that mutate() uses is similar to normal R assignment: For instance, since the height’s values are in centimeters, maybe we’d rather want them in meters. So, in mutate() we’d type out:

Height = Height / 100

This form is similar to how math works. The action that happens on the right hand side is put into the variable of the left hand side. With using mutate() itself:

NHANES %>%
    mutate(Height = Height / 100)
#> # A tibble: 10,000 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

Or we can create a new column (maybe log transforming height):

NHANES %>% 
    mutate(LoggedHeight = log(Height))
#> # A tibble: 10,000 x 77
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 68 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>, LoggedHeight <dbl>

We can also add multiple modifications or additions with mutate by separating with ,. So if we first wanted to have height as meters and then take the log, it would be:

NHANES %>% 
    mutate(Height = Height / 100,
           LoggedHeight = log(Height))
#> # A tibble: 10,000 x 77
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 68 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>, LoggedHeight <dbl>

We can also have different values based on a logic conditions using if_else(). Use Table 7.2 to help with creating the logic condition.

NHANES %>%
    mutate(HighlyActive = if_else(PhysActiveDays >= 5, "yes", "no"))
#> # A tibble: 10,000 x 77
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  2 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  3 51624 2009_10  male      34 " 30-39"        409 White <NA>  High Sch…
#>  4 51625 2009_10  male       4 " 0-9"           49 Other <NA>  <NA>     
#>  5 51630 2009_10  female    49 " 40-49"        596 White <NA>  Some Col…
#>  6 51638 2009_10  male       9 " 0-9"          115 White <NA>  <NA>     
#>  7 51646 2009_10  male       8 " 0-9"          101 White <NA>  <NA>     
#>  8 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#>  9 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> 10 51647 2009_10  female    45 " 40-49"        541 White <NA>  College …
#> # … with 9,990 more rows, and 68 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>, HighlyActive <chr>

Recall that the original dataset doesn’t change. If we want the added variable to be included we must assign it to something with <-. So putting it all together:

NHANES_update <- NHANES %>%
    mutate(Height = Height / 100,
           LoggedHeight = log(Height),
           HighlyActive = if_else(PhysActiveDays >= 5, "Yes", "No"))

7.12 Exercise: Piping, filtering, and mutating

Time: 15 min

Copy and paste the code below into the script exercises-wrangling.R. Then start replacing the ___ with the appropriate code to complete the tasks below. (Suggestion: Create a new “Section” in the R script for this exercise by using Ctrl-Shift-R).

Filter NHANES so only those with a BMI of more than or equal to 20 and less than or equal to 40 and keep those who have diabetes.
Create a new variable called UrineVolAverage by calculating the average urine volumne (from UrineVol1 and UrineVol2).
- Comment: After creating the UrineVolAverage column, check out values. What do you nothice? Compare with the original urine columns.
Create a new variable called YoungChild when age is less than 6 years.

# 1. BMI between 20 and 40 and who have diabetes
NHANES %>%
    # format: variable >= number or character
    filter(___ >= ___ & ___ <= ___ & ___ == ___)

# Pipe the data into mutate function and:
NHANES_modified <- ___ %>% # dataset
    mutate(
        # 2. Calculate average urine volume
        ___ = ___,
        # 3. Create YoungChild variable using a condition
        ___ = if_else(___, "Yes", "No")
    )
NHANES_modified

Click for a possible solution

# 1. BMI between 20 and 40 and who have diabetes
NHANES %>%
    # format: variable >= number
    filter(BMI >= 20 & BMI <= 40 & Diabetes == "Yes")
#> # A tibble: 616 x 76
#>       ID SurveyYr Gender   Age AgeDecade AgeMonths Race1 Race3 Education
#>    <int> <fct>    <fct>  <int> <fct>         <int> <fct> <fct> <fct>    
#>  1 51702 2009_10  male      44 " 40-49"        539 White <NA>  9 - 11th…
#>  2 51702 2009_10  male      44 " 40-49"        539 White <NA>  9 - 11th…
#>  3 51707 2009_10  female    64 " 60-69"        771 Other <NA>  8th Grade
#>  4 51711 2009_10  female    59 " 50-59"        718 Other <NA>  8th Grade
#>  5 51711 2009_10  female    59 " 50-59"        718 Other <NA>  8th Grade
#>  6 51748 2009_10  male      56 " 50-59"        682 Mexi… <NA>  8th Grade
#>  7 51752 2009_10  female    78 " 70+"          937 Mexi… <NA>  8th Grade
#>  8 51819 2009_10  male      80  <NA>            NA White <NA>  9 - 11th…
#>  9 51828 2009_10  female    66 " 60-69"        800 Black <NA>  9 - 11th…
#> 10 51828 2009_10  female    66 " 60-69"        800 Black <NA>  9 - 11th…
#> # … with 606 more rows, and 67 more variables: MaritalStatus <fct>,
#> #   HHIncome <fct>, HHIncomeMid <int>, Poverty <dbl>, HomeRooms <int>,
#> #   HomeOwn <fct>, Work <fct>, Weight <dbl>, Length <dbl>, HeadCirc <dbl>,
#> #   Height <dbl>, BMI <dbl>, BMICatUnder20yrs <fct>, BMI_WHO <fct>,
#> #   Pulse <int>, BPSysAve <int>, BPDiaAve <int>, BPSys1 <int>, BPDia1 <int>,
#> #   BPSys2 <int>, BPDia2 <int>, BPSys3 <int>, BPDia3 <int>, Testosterone <dbl>,
#> #   DirectChol <dbl>, TotChol <dbl>, UrineVol1 <int>, UrineFlow1 <dbl>,
#> #   UrineVol2 <int>, UrineFlow2 <dbl>, Diabetes <fct>, DiabetesAge <int>,
#> #   HealthGen <fct>, DaysPhysHlthBad <int>, DaysMentHlthBad <int>,
#> #   LittleInterest <fct>, Depressed <fct>, nPregnancies <int>, nBabies <int>,
#> #   Age1stBaby <int>, SleepHrsNight <int>, SleepTrouble <fct>,
#> #   PhysActive <fct>, PhysActiveDays <int>, TVHrsDay <fct>, CompHrsDay <fct>,
#> #   TVHrsDayChild <int>, CompHrsDayChild <int>, Alcohol12PlusYr <fct>,
#> #   AlcoholDay <int>, AlcoholYear <int>, SmokeNow <fct>, Smoke100 <fct>,
#> #   Smoke100n <fct>, SmokeAge <int>, Marijuana <fct>, AgeFirstMarij <int>,
#> #   RegularMarij <fct>, AgeRegMarij <int>, HardDrugs <fct>, SexEver <fct>,
#> #   SexAge <int>, SexNumPartnLife <int>, SexNumPartYear <int>, SameSex <fct>,
#> #   SexOrientation <fct>, PregnantNow <fct>

# Pipe the data into mutate function and:
NHANES_modified <- NHANES %>% # dataset
    mutate(
        # 2. Calculate average urine volume
        UrineVolAverage = (UrineVol1 + UrineVol2) / 2,
        # 3. Create YoungChild variable using a condition
        YoungChild = if_else(Age < 6, "Yes", "No")
    )

For the “UrineVolAverage” values, they are probably almost entirely NA (aka missing). That’s because NA values are infectious. See how “UrineVol2” has mostly NA values too? When you calculate something that has NA, you get another NA. This is something to be careful about. So always check your calculations!

7.13 Split-apply-combine: Summarizing data

Take 5 min to read through parts of this section, before we continue.

Summarizing or applying simple (or complex) statistics to data is (obviously) a key component of any analysis. Simple summaries or statistics can be done either on all the data or on groups of it. There are many data analysis tasks that can be approached using the split-apply-combine method: split the data into groups, apply some analysis to each group, and then combine the results together.

In dplyr, to summarize on all the data, you would use the function summarize(). If you want to do a split-apply-combine (e.g. find the max height of females and males) analysis, you would use the functions group_by() and then summarize(). Using group_by() splits the data up and summarise() then applies an analysis and immediately combines it back together.

The first position arguments to group_by() is, as usual, the dataset. The next arguments are the columns that contain the values you want to group by. These columns must contain categorical data (e.g. Sex). On its own, group_by() does nothing but instead works with other functions.

As with the other functions, summarise() takes the data as the first position argument. The next arguments work similar to mutate() with one difference: the output must create a single value (e.g. a max or a mean). Like mutate(), you can add multiple “summaries” by adding new columns separated by comma ,.

Simple statistics include: min(), max(), mean(), median(), sd().

Ok, let’s get back together and try this out. Let’s calculate the max age. Because NA values “propogate” (if there is one missing, then a max or mean will be missing), we need to tell max() to exclude NA using na.rm = TRUE.

NHANES %>%
    summarize(MaxAge = max(Age, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   MaxAge
#>    <int>
#> 1     80

Add another summary column with ,.

NHANES %>%
    summarize(MaxAge = max(Age, na.rm = TRUE),
              MinAge = min(Age, na.rm = TRUE))
#> # A tibble: 1 x 2
#>   MaxAge MinAge
#>    <int>  <int>
#> 1     80      0

This is partly useful. But it really shines when combined with group_by(). Let’s find out the mean age and BMI between those with and without Diabetes.

NHANES %>%
    group_by(Diabetes) %>% 
    summarise(MeanAge = mean(Age, na.rm = TRUE),
              MeanBMI = mean(BMI, na.rm = TRUE))
#> # A tibble: 3 x 3
#>   Diabetes MeanAge MeanBMI
#>   <fct>      <dbl>   <dbl>
#> 1 No         35.4     26.2
#> 2 Yes        59.2     32.6
#> 3 <NA>        1.20    29.6

Quick note: If you are using dplyr version 1.0.0 (the latest), you’ll get a message informing you that it is regrouping output. This is simply a message and can be ignored. If you don’t want the message displayed, write options(dplyr.summarise.inform = FALSE) at the top of your script and run it.

We get a warning about there being missing values in Diabetes, so let’s first remove rows that have missing Diabetes status.

NHANES %>%
    # Recall ! is "NOT", so !is.na means "is not missing"
    filter(!is.na(Diabetes)) %>% 
    group_by(Diabetes) %>% 
    summarise(MeanAge = mean(Age, na.rm = TRUE),
              MeanBMI = mean(BMI, na.rm = TRUE))
#> # A tibble: 2 x 3
#>   Diabetes MeanAge MeanBMI
#>   <fct>      <dbl>   <dbl>
#> 1 No          35.4    26.2
#> 2 Yes         59.2    32.6

Cool! But we can add more columns to the grouping. Let’s compare mean age and BMI by sex and diabetes status. We should probably also rename gender to sex.

NHANES %>%
    rename(Sex = Gender) %>% 
    filter(!is.na(Diabetes)) %>% 
    group_by(Diabetes, Sex) %>% 
    summarise(MeanAge = mean(Age, na.rm = TRUE),
              MeanBMI = mean(BMI, na.rm = TRUE))
#> # A tibble: 4 x 4
#> # Groups:   Diabetes [2]
#>   Diabetes Sex    MeanAge MeanBMI
#>   <fct>    <fct>    <dbl>   <dbl>
#> 1 No       female    36.5    26.2
#> 2 No       male      34.3    26.1
#> 3 Yes      female    59.9    33.7
#> 4 Yes      male      58.6    31.5

7.14 Converting between wide and long data

We’ve covered the basic “verbs” (functions) of the dplyr package. Now we’ll get into the “pivot” functions from dplyr’s companion package, tidyr. There are many useful functions in the tidyr package, but the pivot functions (pivot_longer() and pivot_wider()) are key ones. Pivoting converts wide data into long data or vice versa. So what is wide or long date?

Wide data is data where values may repeat across columns. With repeated measurements, it is often easier to enter data in wide form or to use wide data to present in tables. But there are problems with wide data, especially when it comes to analysing it. For instance, wide data may look like:

Table 7.3: Example of a **wide** dataset that is useful for data entry.
PersonID	Glucose_0	Glucose_30	Glucose_60
1	5.6	7.8	4.5
2	4.7	9.5	5.3
3	5.1	10.2	4.2

However, this type of data is not tidy for a few reasons:

We don’t know precisely what the values represent in the glucose columns (though in this case we could guess, but this isn’t always the case). Often in the wide form you need to rely a lot more on either very descriptive names or have a detailed data dictionary to refer to.
The glucose columns all represent the same value type (glucose concentration) so there is some duplication of meaning between columns.
The column names include data in them as well (time of glucose measurement).

On the other hand, a long data form is usually better suited for almost any type of analysis and for visualizing, especially when doing split-apply-combine techniques. Long form data also tends to be more tidy compared to wide form.

Table 7.4: Example of a **long** dataset that is more usable for analyses and visualizing.
PersonID	MeasurementTime	GlucoseConcentration
1	0	5.6
1	30	7.8
1	60	4.5
2	0	4.7
2	30	9.5
2	60	5.3
3	0	5.1
3	30	10.2
3	60	4.2

7.14.1 Pivot from wide to long

How, when, and why to pivot your data can be conceptually challenging to grasp at first. Let’s try out some examples and use pivot_longer(). Like all the other functions, the first position argument to pivot_longer() is the data. The other necessary arguments (in order) are:

cols: The columns to use to convert to long form. The input is a vector made using c() that contains the column names, like you would use in select() (e.g. you can use the select_helpers like starts_with(), or - minus to exclude).
names_to: The name of the newly created column (as a quoted character) that contains the original column names.
values_to: The name of the newly created column (as a quoted character) that contains the original cells of the original columns.

As with everything, using an example would help clarify things. In the NHANES dataset, there are several columns that would be suitable for pivoting, because they are “messy”. These are the BP blood pressure columns. Let’s select the required ID and SurveyYr and the BP columns (we’ll exclude the Ave ones for now), then we’ll use pivot_longer(). Because we only want to pivot the BP, we need to exclude ID and SurveyYr from pivoting by using -.

NHANES %>%
    # Recall that - (minus) is used to exclude
    select(ID, SurveyYr, starts_with("BP"), -ends_with("Ave")) %>% 
    pivot_longer(c(-ID, -SurveyYr), names_to = "BPTypeAndNumber", values_to = "BloodPressure")
#> # A tibble: 60,000 x 4
#>       ID SurveyYr BPTypeAndNumber BloodPressure
#>    <int> <fct>    <chr>                   <int>
#>  1 51624 2009_10  BPSys1                    114
#>  2 51624 2009_10  BPDia1                     88
#>  3 51624 2009_10  BPSys2                    114
#>  4 51624 2009_10  BPDia2                     88
#>  5 51624 2009_10  BPSys3                    112
#>  6 51624 2009_10  BPDia3                     82
#>  7 51624 2009_10  BPSys1                    114
#>  8 51624 2009_10  BPDia1                     88
#>  9 51624 2009_10  BPSys2                    114
#> 10 51624 2009_10  BPDia2                     88
#> # … with 59,990 more rows

We use - here to tell pivot_longer() to not include (to exclude) the columns from being converted to long form. We could even use starts_with():

NHANES %>%
    # Recall that - (minus) is used to exclude
    select(ID, SurveyYr, starts_with("BP"), -ends_with("Ave")) %>% 
    pivot_longer(starts_with("BP"), names_to = "BPTypeAndNumber", values_to = "BloodPressure")
#> # A tibble: 60,000 x 4
#>       ID SurveyYr BPTypeAndNumber BloodPressure
#>    <int> <fct>    <chr>                   <int>
#>  1 51624 2009_10  BPSys1                    114
#>  2 51624 2009_10  BPDia1                     88
#>  3 51624 2009_10  BPSys2                    114
#>  4 51624 2009_10  BPDia2                     88
#>  5 51624 2009_10  BPSys3                    112
#>  6 51624 2009_10  BPDia3                     82
#>  7 51624 2009_10  BPSys1                    114
#>  8 51624 2009_10  BPDia1                     88
#>  9 51624 2009_10  BPSys2                    114
#> 10 51624 2009_10  BPDia2                     88
#> # … with 59,990 more rows

The reason that the arguments names_to and values_to require quoting "" is that these are the column names that we will create. Because they don’t exist yet as columns, we need to use the quotes.

7.14.2 Pivot from long to wide

You can also convert from long to wide, though this is less commonly done, as most analyses either work better or require the long form. It can also be much more tricky to convert over to. But sometimes you may need to have a wide form for your data. Here you can use pivot_wider() function. Like its opposite, the first position argument is the data and the other necessary arguments are:

id_cols: This is optional as it will default to all column names. This argument tells pivot_wider() to use the given columns as the identifiers for when converting. This is where the tricky part comes in, because
names_from: Similar to the pivot_longer(), this is the name of the column that will make up the new columns. Unlike in pivot_longer(), the column name given is unquoted since the column must already exist in the dataset.
values_from: As with above, this is the column name (that exists and must be given unquoted) for the values that will be in the new columns.

Unfortunately, NHANES as it is doesn’t have a structure that allows us to easily convert to wide form. So we’ll use the table2 example from the beginning of the wrangling section:

table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

This data frame is in a very long format. So we could pivot type and count into the wide format:

table2 %>% 
    pivot_wider(names_from = type, values_from = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

If we wanted to make it even wider, we could include year and type in the pivoting by wrapping them with c() in the names_from argument:

table2 %>% 
    pivot_wider(names_from = c(year, type), values_from = count)
#> # A tibble: 3 x 5
#>   country     `1999_cases` `1999_population` `2000_cases` `2000_population`
#>   <chr>              <int>             <int>        <int>             <int>
#> 1 Afghanistan          745          19987071         2666          20595360
#> 2 Brazil             37737         172006362        80488         174504898
#> 3 China             212258        1272915272       213766        1280428583

The key to using pivot_wider() is that there are uniquely identifying rows that allow pivoting that maintains the integrity of the data. Since the NHANES dataset we use is for teaching purposes, there are some cases where the same person is recorded multiple times in one survey year, which doesn’t make sense and prevents us from adequately pivoting wider.

7.15 Pivot, then split-apply-combine

The real strength of pivoting is when you use it with the split-apply-combine method. For instance, if we wanted to find some simple statistics of all the columns by survey year and sex. In this case, we would use pivot_longer() to convert to a long form and then use group_by() and summarize() to find the simple statistics. So let’s find the mean values of some continuous variables.

nhanes_mean_values <- NHANES %>%
    rename(Sex = Gender) %>%
    select(SurveyYr, Sex, BMI, Age, starts_with("BP")) %>%
    pivot_longer(c(-SurveyYr, -Sex),
                 names_to = "Variables",
                 values_to = "Values") %>%
    group_by(SurveyYr, Sex, Variables) %>% 
    summarize(MeanValues = mean(Values, na.rm = TRUE))
nhanes_mean_values
#> # A tibble: 40 x 4
#> # Groups:   SurveyYr, Sex [4]
#>    SurveyYr Sex    Variables MeanValues
#>    <fct>    <fct>  <chr>          <dbl>
#>  1 2009_10  female Age             38.0
#>  2 2009_10  female BMI             27.0
#>  3 2009_10  female BPDia1          66.3
#>  4 2009_10  female BPDia2          65.6
#>  5 2009_10  female BPDia3          65.2
#>  6 2009_10  female BPDiaAve        65.5
#>  7 2009_10  female BPSys1         117. 
#>  8 2009_10  female BPSys2         116. 
#>  9 2009_10  female BPSys3         115. 
#> 10 2009_10  female BPSysAve       116. 
#> # … with 30 more rows

We could now use pivot_wider() since the structure allows for it:

nhanes_mean_values %>% 
    pivot_wider(names_from = Variables, values_from = MeanValues)
#> # A tibble: 4 x 12
#> # Groups:   SurveyYr, Sex [4]
#>   SurveyYr Sex     Age   BMI BPDia1 BPDia2 BPDia3 BPDiaAve BPSys1 BPSys2 BPSys3
#>   <fct>    <fct> <dbl> <dbl>  <dbl>  <dbl>  <dbl>    <dbl>  <dbl>  <dbl>  <dbl>
#> 1 2009_10  fema…  38.0  27.0   66.3   65.6   65.2     65.5   117.   116.   115.
#> 2 2009_10  male   35.5  26.7   69.0   68.4   67.7     67.9   121.   120.   120.
#> 3 2011_12  fema…  37.3  26.5   67.8   67.2   67.2     67.3   118.   117.   117.
#> 4 2011_12  male   36.2  26.4   70.0   69.5   69.0     69.3   121.   121.   120.
#> # … with 1 more variable: BPSysAve <dbl>

Which now gives us the mean values of the variables by sex and survey year!

7.16 Saving datasets as files

This will be a very short section. Sometimes you’ll need or want to save the dataset you’ve been working on, maybe because you’ve done a lot of cleaning to it, preparing it for later analyses, or because you’ve ran an analysis and want to save the results. Either way, a recommended way of saving your dataset is to use the usethis::use_data() function. Let’s do a very simple example:

nhanes_bmi_only <- NHANES %>% 
    select(BMI)
usethis::use_data(nhanes_bmi_only, overwrite = TRUE)

The usethis::use_data() function outputs some information, the last of which (“Document your data”) we won’t cover in this course. The function takes any number of datasets and saves each individually as a .rda R dataset file in the data/ folder. You load the dataset and use it again, but before we do, let’s restart the R session with either Ctrl-Shift-F10 or with the menu item “Session -> Restart R”. Now, type out and run this:

load(here::here("data/nhanes_bmi_only.rda"))

You should see this dataset in the Environment tab. This is how you save and load data.

7.17 Final exercise: Group work

Time: ~30 min

This exercise has two aims: to get working on and completing the group project, and to get you practicing using the dplyr and tidyr functions we covered today. First, take maximum 10 min to:

As a group, complete item 2 of the group assignment (to jump quickly to the assignment, run r3::open_assignment() in the RStudio Console).
Individually, open your group R project and complete item 3 of the group assignment. Follow the appropriate filenaming conventions as we learned in the Project Management session for the dataset (e.g. don’t use spaces, instead use either - or _).

With the remaining time:

As a group, complete item 4 of the group assignment. Explore the data and use all the functions we’ve covered in this session to better understand the data you’ll work with. You can (and probably should) divide exploratory tasks between group members, so that you have a good understanding as a group of the data.

Tip: Make use of TAB auto-completion when typing out the dplyr and tidyr functions for wrangling the data to speed up your coding and to get help if needed. For instance, type out sel, hit TAB, and see the list of possible functions. Choose the right item in the menu and hit TAB again to finish the function. If you want to see a list of other functions in dplyr, type out dplyr:: and then hit TAB. You’ll now have a list of functions inside the dplyr package.