2 Introductory Data Transformation with dplyr

This MVP covers

Learning the basics of assigning values to variables
Installing and loading packages from CRAN and GitHub
Accessing datasets and understanding the basic elements of a table using the dplyr filter() function, with expressions that filter a table to only the rows you need
Reordering rows with the dplyr arrange() function performing several column-selection operations with dplyr’s select() function
Creating new table columns (or modifying existing ones) with mutate() and carefully crafted expressions
Summarizing tabular data with the combination of group_by() and summarize()

We will start off by getting our bearings with assignment of variables in R. Later, we’ll certainly get the opportunity to learn more programming fundamentals, however assignment is one of those topics that should be addressed right away. Then, the installation of packages will be covered. This will give us the chance to install some of required packages: tidyverse, dspatterns, and devtools.

The rest of the section is devoted to learning a skill that is very important: transforming tabular data. Of all the skills you will learn, the value of this one cannot be understated. A bit more than the usual amount of time will and should be spent on this.

The dspatterns package will give us the dataset needed for the examples in this section: sw. This dataset is a modified version of the starwars dataset that is available in the dplyr package. Whenever we load in the dspatterns package using library(dspatterns), the sw table will be available for use. This particular dataset has 87 rows, one for each Star Wars character, and the following 8 variables (or columns):

name: the name of the character
height, mass: the character’s height and weight (in centimeters and kilograms)
hair_color: a description of the character’s hair color, where available
gender, homeworld, species: the character’s gender, homeworld (i.e., a planet name), and species

We will learn how to install packages like tidyverse, dspatterns, and devtools. Once you’ve gotten the hang of that (and those packages are installed), more information on the sw dataset can be obtained by executing ?sw in the RStudio console (a help page will appear).

2.1 Programming Basics and Package Installation

Before we get to the fun stuff, it’s best if we first get acquainted with two things: (1) the programming concept of assignment, and (2) installing packages. We will have a more in-depth look at R programming in later sections and just touch upon the basics for now. In R, installing and using packages is commonplace and thus a fundamental skill, so a few packages will be installed in this section.

2.1.1 The Basics of Assignment

Let’s create a new R script using the New File icon in the RStudio toolbar. Now we can type lines of code in the open R Script document in the Source pane, and, also execute individual statements or regions of code by selecting them (as you would select text in a word processor) and using the Control + Enter shortcut.

We assign values to variables (that we name ourselves) by using the leftward arrow <-. Many different types of data can be assigned to a variable. They can be single values, a vector of values (within the c() function, which combines individual values of the same type), or, more complex objects like tables (we’ll get to that in the next section). Let’s have a look at three examples.

We can assign the single number 5 to the variable called a_number. Executing the line with just the variable name prints the value 5 in the console (the assignment line itself will show nothing in the console).

Assigning a single number to a_number.

Notes on the Code

L.1 This is the assignment of the number 5 to the variable called a_number.
L.2 Executing just a_number by itself returns the values held by the variable.

a_number <- 5
a_number

[1] 5

Just below the code listing, the console output is shown ([1] 5 in this case).

Multiple values (of the same type, like numbers or character strings) can be put together (with the c() function) as a vector and assigned to a single variable. An example of this is shown in the following code listing.

Assigning a vector of three numbers to three_numbers.

Notes on the Code

L.1 Assignment of a vector of numeric values (using c(), to combine). The name three_numbers is just a name that was made up; it could easily have been called foo.

three_numbers <- c(1, 2, 3)
three_numbers

[1] 1 2 3

Shown again, just above, is the console output. We are getting back the values we put into three_numbers.

Character strings can be assigned to variables. In the next example, the single string "hello" is assigned to one_word.

Assigning a character string to one_word.

Notes on the Code

L.1 We need quotes to distinguish character strings from variables. While you can use single quotes, I recommend adopting double quotes as used here.

one_word <- "hello"
one_word

[1] "hello"

The console output shown above consists of all values in one_word: #> [1] "hello".

A good way to check that the variables were actually defined is to look at the Environment tab in the top-right pane of RStudio. The names of the variables will appear here as they are assigned, and, you also get an indication of the value attached to the variable.

The *Environment* tab in **RStudio** is a great way to quickly see which variables have been assigned and what their values are.

We can overwrite any variable currently assigned. This is done by assigning a value to any variable name that was previously assigned. In the next example, we’ll overwrite the one_word variable with "hi" and inspect the variable by printing it.

Replacing the one_word variable with a different value.

Notes on the Code

L.1 Assigning any value to an already assigned variable. Here, R will simply overwrite the previous value in one_word, and, R won't inform you that one_word was previously assigned.

one_word <- "hi"
one_word

[1] "hi"

The return value in the console is now "hi", or, #> [1] "hi". The Environment tab in RStudio will show us this change after executing the first statement in the previous code listing. The execution, or printing, of just the variable one_word further confirms this in the console.

2.2 Using dplyr to Transform Data

While visualizing your data is useful for generating insights, it’s not common that you’ll initially have the data in exactly the form you need. It’s far more likely that you first will have to create some new variables or generate some summaries before plotting the data. So, we’ll get to visualization in the next section and learn all transforming our data first. The process of data transformation is made easier in R by using the dplyr package. We’ll learn how to use only a few functions from dplyr to modify a dataset in several useful ways.

2.2.1 The Main dplyr Functions

There are five key dplyr functions that can help us solve the vast majority of data manipulation tasks:

filter() — Pick observations by their values
arrange() — Reorder the rows
select() — Pick variables by their names
mutate() — Create new variables with functions of existing variables
summarize() — Collapse many rows down to a single summary

These can be used in conjunction with the group_by() function, which tells dplyr that it should be operating on data group-by-group (i.e., with collections of rows). Taken together, these functions enable a cohesive language of data manipulation.

All these functions generally work the same, where

the first argument is the data
the subsequent arguments describe what to do with the data
the result is a new data table

Eventually we’ll see that we can chain together multiple simple steps to achieve a complex result but, first, let’s just see how each of these functions work in isolation.

Through the rest of this section, let’s work with dplyr in the context of an R Markdown document. In the following code listing, we’ll create a new R Markdown document and include this R code chunk just after the YAML header.

Loading the tidyverse and dspatterns packages.

Notes on the Code

L.1 Loading the tidyverse package like this actually auto-loads all core tidyverse packages (this is almost everything we'll need for most analyses!).
L.2 The dspatterns package is this book's namesake package. It has the datasets we need for all of the examples.

library(tidyverse)
library(dspatterns)

The following is the console output which, in an R Markdown document, appears just below the code chunk (as chunk output).

#> ── Attaching packages ──────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.6
#> ✔ tidyr   0.8.1     ✔ stringr 1.3.1
#> ✔ readr   1.1.1     ✔ forcats 0.3.0
#> package 'dplyr' was built under R version 3.5.1── Conflicts
#> ───────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

After running the chunk (using Shift + Control + Enter or the Run Current Chunk button), you might see messages relating to conflicts: this is entirely normal and expected (so don’t worry about it). Now these packages are loaded, and all of their functions are available to use.

We’ll continue the practice of writing new R code chunks and running them interactively in an R Markdown document. The Environment pane in RStudio will fill up with variables and objects during this progression. Whenever in doubt about whether a variable has or hasn’t been assigned during this sort of ad-hoc data analysis, it is recommended to use the RStudio menu option Session >> Restart R and Clear Output. This provides a clean slate where you can then run the chunks from top to bottom.

2.2.2 The `sw` Dataset and the Elements of a Table

The dataset with the name sw is provided by the dspatterns package. That is, once dspatterns is loaded in the session with library(dspatterns), the sw dataset will be available. R considers datasets to be important components of packages since they are valuable for testing out a package’s functions. It may seem odd, but we can treat sw like a variable that doesn’t have to be explicitly assigned by the user. In the following code listing, we see that just by entering sw in the console, its output is provided.

The sw dataset can be printed to the console by using sw.

Notes on the Code

L.1 So long as we have the dspatterns package loaded in (with library(dspatterns)), we have access to the sw dataset. Executing sw (a tibble) prints it and shows us some of its values.

sw

# A tibble: 87 × 8
   name              height  mass hair_color birth_year gender homeworld species
   <chr>              <int> <dbl> <chr>           <dbl> <chr>  <chr>     <chr>  
 1 Luke Skywalker       172    77 blond            19   mascu… Tatooine  Human  
 2 C-3PO                167    75 <NA>            112   mascu… Tatooine  Droid  
 3 R2-D2                 96    32 <NA>             33   mascu… Naboo     Droid  
 4 Darth Vader          202   136 none             41.9 mascu… Tatooine  Human  
 5 Leia Organa          150    49 brown            19   femin… Alderaan  Human  
 6 Owen Lars            178   120 brown, gr…       52   mascu… Tatooine  Human  
 7 Beru Whitesun la…    165    75 brown            47   femin… Tatooine  Human  
 8 R5-D4                 97    32 <NA>             NA   mascu… Tatooine  Droid  
 9 Biggs Darklighter    183    84 black            24   mascu… Tatooine  Human  
10 Obi-Wan Kenobi       182    77 auburn, w…       57   mascu… Stewjon   Human  
# ℹ 77 more rows

The sw dataset is a tibble object, a special implementation of a data frame (a tibble prints its data differently, filling the entire width of the console). We’ll learn a bit more about tibbles in a later section, especially in how else they differ from data frames. For sw, there are more rows in the data than what is shown (we see 6 rows, but there are 81 more that aren’t displayed).

The returned console output for sw has a lot of information. This is a good opportunity to take some time to understand this type of output. The very first line describes the type of object returned, which is a tibble, and it also provides the dimensions of the table (87 rows and 8 columns, or, variables). The second line shows the table’s column names, and the line below that describes each column’s type:

<chr> is for character (or text) data
<int> is for integer (numbers without decimal places),
<dbl> is for double-precision values which are numeric, except they have a decimal component unlike integers

Depending on the width of your editing pane, the printing of the tibble might show more or less of the table data. Columns that don’t fit in the print area are mentioned at the bottom (e.g., ... with 3 more variables: ...) and, with smaller widths of the console or R Markdown editing pane, data in some columns will be more truncated (e.g., showing Obi-Wan Ken... instead of the full name of Obi-Wan Kenobi). While this is intentional, we may want to inspect just a little bit more of the data values. One quick fix is to manually widen the output area (by dragging the center, vertical pane in RStudio left or right) and print the output again. The extra width will provide more space for the individual columns and the output may even show you data in more columns.

Sometimes, you’ll discover a dataset in a package, look at it, and have no context for what the columns are supposed to be indicating. Datasets in R usually have internal help documents that explain the dataset in some detail. Try typing in help(sw) or ?sw in the console, the help document will appear in the Help section in the bottom-right pane in RStudio.

You may have noticed a few of the values in the printed dataset show as NA. Those indicate missing values, perhaps where no observation exists for a particular variable in a row. We can have NA values for all different types of columns (e.g., character, numeric, etc.).

2.2.3 filter: Picking Observations by their Values

The filter() function allows you to generate a subset of rows with a filtering expression. Let’s filter the sw dataset in the same R Markdown document that we started earlier and get back only those rows where the species is "Droid". The expression to use is species == "Droid". The first argument is the name of the data to transform: sw.

The filter() function, and filtering by a single value in the species column of the sw dataset.

Notes on the Code

L.1 The first value given to filter() is the data, which is sw; the filtering expression comes next.

filter(sw, species == "Droid")

# A tibble: 6 × 8
  name   height  mass hair_color birth_year gender    homeworld species
  <chr>   <int> <dbl> <chr>           <dbl> <chr>     <chr>     <chr>  
1 C-3PO     167    75 <NA>              112 masculine Tatooine  Droid  
2 R2-D2      96    32 <NA>               33 masculine Naboo     Droid  
3 R5-D4      97    32 <NA>               NA masculine Tatooine  Droid  
4 IG-88     200   140 none               15 masculine <NA>      Droid  
5 R4-P17     96    NA none               NA feminine  <NA>      Droid  
6 BB8        NA    NA none               NA masculine <NA>      Droid

The table prints out after this transformation (because it wasn’t assigned to a variable). We see from the output that returned table seems to have just those species that are of the "Droid" type.

Filtering expressions are often performed by using comparison operators. The main ones are:

== (equal),
!= (not equal),
>, >= (greater than and greater than or equal to),
<, <= (less than and less than or equal to),
%in% (part of a set)

Let’s look at an example using a greater than (>) comparison operator. To get a filtered table with all characters with a height more than 220 centimeters, we need to use height > 220.

Filtering to only keep rows where the height column has values greater than 220.

filter(sw, height > 220)

# A tibble: 5 × 8
  name         height  mass hair_color birth_year gender    homeworld species 
  <chr>         <int> <dbl> <chr>           <dbl> <chr>     <chr>     <chr>   
1 Chewbacca       228   112 brown             200 masculine Kashyyyk  Wookiee 
2 Roos Tarpals    224    82 none               NA masculine Naboo     Gungan  
3 Yarael Poof     264    NA none               NA masculine Quermia   Quermian
4 Lama Su         229    88 none               NA masculine Kamino    Kaminoan
5 Tarfful         234   136 brown              NA masculine Kashyyyk  Wookiee

Just by scanning the height column we can see that all values are greater than 220.

There are also logical operators and they can be used to link together filtering statements. These operators consist of the following set:

& (the AND operator)
| (the OR operator)

The following code filters all Star Wars characters in the sw table to those that either have height above 210 or a mass above 120.

Filtering characters by either height above 210, or, mass above 120.

filter(sw, height > 210 | mass > 120)

# A tibble: 10 × 8
   name              height  mass hair_color birth_year gender homeworld species
   <chr>              <int> <dbl> <chr>           <dbl> <chr>  <chr>     <chr>  
 1 Darth Vader          202   136 none             41.9 mascu… Tatooine  Human  
 2 Chewbacca            228   112 brown           200   mascu… Kashyyyk  Wookiee
 3 Jabba Desilijic …    175  1358 <NA>            600   mascu… Nal Hutta Hutt   
 4 IG-88                200   140 none             15   mascu… <NA>      Droid  
 5 Roos Tarpals         224    82 none             NA   mascu… Naboo     Gungan 
 6 Yarael Poof          264    NA none             NA   mascu… Quermia   Quermi…
 7 Lama Su              229    88 none             NA   mascu… Kamino    Kamino…
 8 Taun We              213    NA none             NA   femin… Kamino    Kamino…
 9 Grievous             216   159 none             NA   mascu… Kalee     Kaleesh
10 Tarfful              234   136 brown            NA   mascu… Kashyyyk  Wookiee

We get back 10 rows (out of 87) with the filtering expression used here. Note that NA values in the mass column are preserved. These rows are present because even though mass is clearly not greater than 120 (it’s missing!), height is still greater than 210.

The next code listing provides another common filtering scenario, where we get just the "Droid" and "Human" species. This subset table can be gotten with at least two different filter() statements. Here is an example that uses two expressions with an OR (|) operator.

Get all Droid and Human characters (method 1 with a [expr | expr] construction).

Notes on the Code

L.1 This is our second use of the OR operator (|), which can be confusing until you think of it in terms of 'include either this or that'.

filter(sw, species == "Droid" | species == "Human")

# A tibble: 41 × 8
   name              height  mass hair_color birth_year gender homeworld species
   <chr>              <int> <dbl> <chr>           <dbl> <chr>  <chr>     <chr>  
 1 Luke Skywalker       172    77 blond            19   mascu… Tatooine  Human  
 2 C-3PO                167    75 <NA>            112   mascu… Tatooine  Droid  
 3 R2-D2                 96    32 <NA>             33   mascu… Naboo     Droid  
 4 Darth Vader          202   136 none             41.9 mascu… Tatooine  Human  
 5 Leia Organa          150    49 brown            19   femin… Alderaan  Human  
 6 Owen Lars            178   120 brown, gr…       52   mascu… Tatooine  Human  
 7 Beru Whitesun la…    165    75 brown            47   femin… Tatooine  Human  
 8 R5-D4                 97    32 <NA>             NA   mascu… Tatooine  Droid  
 9 Biggs Darklighter    183    84 black            24   mascu… Tatooine  Human  
10 Obi-Wan Kenobi       182    77 auburn, w…       57   mascu… Stewjon   Human  
# ℹ 31 more rows

The tibble that’s printed shows that, at least in the first 6 rows, each character is either a Human or a Droid.

Next, we provide the better way to get the same result, which is to look for occurrences in a set (defined with the help of c()). So, instead of using species twice in our statement, we can use the %in% operator in a single expression.

Get all Droid and Human characters (method 2 with [colname |> c(...)] construction).

Notes on the Code

L.1 This statement gets us exactly the same output as before. It might be an easier statement to write out and to reason about. We can think of it as 'the column must contain any of this set of values to be kept'.

filter(sw, species %in% c("Droid", "Human"))

# A tibble: 41 × 8
   name              height  mass hair_color birth_year gender homeworld species
   <chr>              <int> <dbl> <chr>           <dbl> <chr>  <chr>     <chr>  
 1 Luke Skywalker       172    77 blond            19   mascu… Tatooine  Human  
 2 C-3PO                167    75 <NA>            112   mascu… Tatooine  Droid  
 3 R2-D2                 96    32 <NA>             33   mascu… Naboo     Droid  
 4 Darth Vader          202   136 none             41.9 mascu… Tatooine  Human  
 5 Leia Organa          150    49 brown            19   femin… Alderaan  Human  
 6 Owen Lars            178   120 brown, gr…       52   mascu… Tatooine  Human  
 7 Beru Whitesun la…    165    75 brown            47   femin… Tatooine  Human  
 8 R5-D4                 97    32 <NA>             NA   mascu… Tatooine  Droid  
 9 Biggs Darklighter    183    84 black            24   mascu… Tatooine  Human  
10 Obi-Wan Kenobi       182    77 auburn, w…       57   mascu… Stewjon   Human  
# ℹ 31 more rows

We get the same output as before. The %in% operator checks for membership in a set (here, "Droid" and "Human" in the species column). This use of %in% is preferred over multiple | operators because there’s less redundancy, it’s much more readable, and it is less error prone to write.

2.2.4 arrange: Reordering Rows

Sometimes, you’ll want to sort your data. The arrange() function makes this possible and it works just like filter() except that instead of selecting rows, it changes their order. It takes the table and a set of column names to order by. If more than one column name is given, each additional column will be used to break ties (if there are still remaining ties, then the original ordering is preserved). Let’s arrange the sw dataset by the height of its characters.

Arranging characters by increasing height.

Notes on the Code

L.1 Like the filter() function (and really, this applies to all of dplyr's main functions), the first value to give to arrange() is the data: sw.

arrange(sw, height)

# A tibble: 87 × 8
   name              height  mass hair_color birth_year gender homeworld species
   <chr>              <int> <dbl> <chr>           <dbl> <chr>  <chr>     <chr>  
 1 Yoda                  66    17 white             896 mascu… <NA>      Yoda's…
 2 Ratts Tyerell         79    15 none               NA mascu… Aleen Mi… Aleena 
 3 Wicket Systri Wa…     88    20 brown               8 mascu… Endor     Ewok   
 4 Dud Bolt              94    45 none               NA mascu… Vulpter   Vulpte…
 5 R2-D2                 96    32 <NA>               33 mascu… Naboo     Droid  
 6 R4-P17                96    NA none               NA femin… <NA>      Droid  
 7 R5-D4                 97    32 <NA>               NA mascu… Tatooine  Droid  
 8 Sebulba              112    40 none               NA mascu… Malastare Dug    
 9 Gasgano              122    NA none               NA mascu… Troiken   Xexto  
10 Watto                137    NA black              NA mascu… Toydaria  Toydar…
# ℹ 77 more rows

We can see from this snippet of data (6 of 77 rows) that height is increasing (ascending) from top to bottom.

Arranging rows by multiple columns can be a common task and the next example arranges by two different character-based (<chr>) columns: hair_color and then gender.

Arranging characters by hair_color and then by gender.

Notes on the Code

L.1 We can keep adding column names, separated by commas, to further specify the priority of ordering.

arrange(sw, hair_color, gender)

# A tibble: 87 × 8
   name              height  mass hair_color birth_year gender homeworld species
   <chr>              <int> <dbl> <chr>           <dbl> <chr>  <chr>     <chr>  
 1 Mon Mothma           150  NA   auburn           48   femin… Chandrila Human  
 2 Wilhuff Tarkin       180  NA   auburn, g…       64   mascu… Eriadu    Human  
 3 Obi-Wan Kenobi       182  77   auburn, w…       57   mascu… Stewjon   Human  
 4 Shmi Skywalker       163  NA   black            72   femin… Tatooine  Human  
 5 Luminara Unduli      170  56.2 black            58   femin… Mirial    Mirial…
 6 Barriss Offee        166  50   black            40   femin… Mirial    Mirial…
 7 Biggs Darklighter    183  84   black            24   mascu… Tatooine  Human  
 8 Boba Fett            183  78.2 black            31.5 mascu… Kamino    Human  
 9 Lando Calrissian     177  79   black            31   mascu… Socorro   Human  
10 Watto                137  NA   black            NA   mascu… Toydaria  Toydar…
# ℹ 77 more rows

The sorting is alphabetical and first ordering on hair_color and then ordering on gender for those groups of rows with the same hair_color value. Notice that, in the output, for all the rows where hair_color is "black", the sorting of rows is by "female" first then "male" after that.

We can change the order of any variables to descending with desc(). Let’s modify the previous example so that hair_color is sorted in reverse and gender is still ascending in order.

Arranging by descending hair_color, then by ascending gender.

Notes on the Code

L.1 Arranging a column in descending order requires a modifier, and we use the desc() helper function to indicate that those columns that should be sorted in that way.

arrange(sw, desc(hair_color), gender)

# A tibble: 87 × 8
   name           height  mass hair_color birth_year gender    homeworld species
   <chr>           <int> <dbl> <chr>           <dbl> <chr>     <chr>     <chr>  
 1 Jocasta Nu        167    NA white              NA feminine  Coruscant Human  
 2 Yoda               66    17 white             896 masculine <NA>      Yoda's…
 3 Ki-Adi-Mundi      198    82 white              92 masculine Cerea     Cerean 
 4 Dooku             193    80 white             102 masculine Serenno   Human  
 5 Captain Phasma     NA    NA unknown            NA <NA>      <NA>      <NA>   
 6 Ayla Secura       178    55 none               48 feminine  Ryloth    Twi'lek
 7 Adi Gallia        184    50 none               NA feminine  Coruscant Tholot…
 8 Taun We           213    NA none               NA feminine  Kamino    Kamino…
 9 R4-P17             96    NA none               NA feminine  <NA>      Droid  
10 Shaak Ti          178    57 none               NA feminine  Shili     Togruta
# ℹ 77 more rows

We immediately notice that the first few rows have a hair_color value of "white" (i.e., near the end of the alphabet) and also that gender is indeed sorted in alphabetical order (within the hair_color subgroups).

A final note about the arrange() function: NA values will always be sorted last, and, using desc() doesn’t affect that. Why? Because there is nothing to sort by. NA values are lumped together at the end (although their initial ordering is preserved). And while we are on the topic of NA values, in the above tibble printing, one of the hair_color values should be assigned as an NA value. Which one? The "unknown" one for Captain Phasma. The hair_color values of "none" are perfectly fine as they are because those observations were actually made (of no hair color because, no hair).

2.2.5 select: Picking Variables by their Names

It’s not unheard of to be working with datasets having dozens, hundreds, or even thousands of variables. If you’re exploring such a dataset, at some point you’ll likely need to focus on a few variables at a time. The select() function allows us to obtain a useful subset of variables, and, it’s easy to use: provide the names of the variables you want to keep.

Let’s run through a few examples so we can get the general idea. The first example shows how we can get just 3 columns (name, gender, species) from the 8 available.

Selecting the name, gender, and species columns.

Notes on the Code

L.1 With select() we first supply the data (sw), and then the columns we want to keep. We separate each name with a comma.

select(sw, name, gender, species)

# A tibble: 87 × 3
   name               gender    species
   <chr>              <chr>     <chr>  
 1 Luke Skywalker     masculine Human  
 2 C-3PO              masculine Droid  
 3 R2-D2              masculine Droid  
 4 Darth Vader        masculine Human  
 5 Leia Organa        feminine  Human  
 6 Owen Lars          masculine Human  
 7 Beru Whitesun lars feminine  Human  
 8 R5-D4              masculine Droid  
 9 Biggs Darklighter  masculine Human  
10 Obi-Wan Kenobi     masculine Human  
# ℹ 77 more rows

Indeed, we find columns remain in the transformed table. Also note that the order in which the variables are provided is preserved in the output.

Because the order of variables matters, we can create a slightly different output table with the same columns. In the next code listing we will see that the order of the variables provided will result in same ordering of variables in the output table: gender, species, and then name.

Selecting the gender, species, and names columns (in that order).

Notes on the Code

L.1 We are keeping the same columns, but they appear (left to right) in the order we specify: gender, species, name.

select(sw, gender, species, name)

# A tibble: 87 × 3
   gender    species name              
   <chr>     <chr>   <chr>             
 1 masculine Human   Luke Skywalker    
 2 masculine Droid   C-3PO             
 3 masculine Droid   R2-D2             
 4 masculine Human   Darth Vader       
 5 feminine  Human   Leia Organa       
 6 masculine Human   Owen Lars         
 7 feminine  Human   Beru Whitesun lars
 8 masculine Droid   R5-D4             
 9 masculine Human   Biggs Darklighter 
10 masculine Human   Obi-Wan Kenobi    
# ℹ 77 more rows

The output table shows that the order of columns is that which was specified in the select() call.

There are a handful of helper functions you can use within select():

starts_with(): matches column names that begin with certain characters,
ends_with(): matches column names that end with specific characters, and
contains(): matches column names that contain a sequence of characters.

There aren’t any variables in sw that share a common beginning but there are more than a few with common endings. Let’s get the name column first and then any columns that end with the letter "s".

Selecting the name column and any additional columns ending with "s".

Notes on the Code

L.1 We want to keep the name column (appearing first, at the left) and then all columns with those names ending in the letter "s" (in the order they originally appear).

select(sw, name, ends_with("s"))

# A tibble: 87 × 3
   name                mass species
   <chr>              <dbl> <chr>  
 1 Luke Skywalker        77 Human  
 2 C-3PO                 75 Droid  
 3 R2-D2                 32 Droid  
 4 Darth Vader          136 Human  
 5 Leia Organa           49 Human  
 6 Owen Lars            120 Human  
 7 Beru Whitesun lars    75 Human  
 8 R5-D4                 32 Droid  
 9 Biggs Darklighter     84 Human  
10 Obi-Wan Kenobi        77 Human  
# ℹ 77 more rows

As we expect, we get the column called name first then all of the columns ending with "s", in the order that those last two columns appeared in the original dataset.

Noting that there are a few columns in sw that contain the underscore ("_") character, let’s try using the contains() helper function for quickly selecting those column names (along with name).

Selecting the name column and any additional columns containing an underscore.

Notes on the Code

L.1 The name column comes first. Any column names containing an underscore character will follow (in their original order).

select(sw, name, contains("_"))

# A tibble: 87 × 3
   name               hair_color    birth_year
   <chr>              <chr>              <dbl>
 1 Luke Skywalker     blond               19  
 2 C-3PO              <NA>               112  
 3 R2-D2              <NA>                33  
 4 Darth Vader        none                41.9
 5 Leia Organa        brown               19  
 6 Owen Lars          brown, grey         52  
 7 Beru Whitesun lars brown               47  
 8 R5-D4              <NA>                NA  
 9 Biggs Darklighter  black               24  
10 Obi-Wan Kenobi     auburn, white       57  
# ℹ 77 more rows

Aside from the name column, we see the columns with underscores populating the rest of the tibble printout.

If ever we want to put a few columns near the start and keep other columns in the same order then the everything() select helper function is useful for this.

Selecting three specific columns and then everything() else after that.

Notes on the Code

L.1 Three columns are selected directly by their names, then, all remaining columns come after those in their original order (captured by everything()).

select(sw, name, homeworld, gender, everything())

# A tibble: 87 × 8
   name              homeworld gender height  mass hair_color birth_year species
   <chr>             <chr>     <chr>   <int> <dbl> <chr>           <dbl> <chr>  
 1 Luke Skywalker    Tatooine  mascu…    172    77 blond            19   Human  
 2 C-3PO             Tatooine  mascu…    167    75 <NA>            112   Droid  
 3 R2-D2             Naboo     mascu…     96    32 <NA>             33   Droid  
 4 Darth Vader       Tatooine  mascu…    202   136 none             41.9 Human  
 5 Leia Organa       Alderaan  femin…    150    49 brown            19   Human  
 6 Owen Lars         Tatooine  mascu…    178   120 brown, gr…       52   Human  
 7 Beru Whitesun la… Tatooine  femin…    165    75 brown            47   Human  
 8 R5-D4             Tatooine  mascu…     97    32 <NA>             NA   Droid  
 9 Biggs Darklighter Tatooine  mascu…    183    84 black            24   Human  
10 Obi-Wan Kenobi    Stewjon   mascu…    182    77 auburn, w…       57   Human  
# ℹ 77 more rows

Here we notice that we do get the name, homeworld, and gender columns first, and then every other remaining column (in their preserved order):

These select helper functions are very useful. They save us having to carefully type out every column name, which can be arduous and error-prone if we have a very large number of columns.

2.2.6 mutate: Creating New Variables with Expressions

We can create new columns of data with mutate(). It adds the new variables at the end of the table, so, let’s prepare a narrower version of sw by using select().

Create the sw_small table containing just three columns.

Notes on the Code

L.1 We select() three columns from sw and assign the resulting table to sw_small.
L.3 Executing the newly made variable (sw_small) displays that smaller table in the output.

sw_small <- select(sw, name, height, mass)

sw_small

# A tibble: 87 × 3
   name               height  mass
   <chr>               <int> <dbl>
 1 Luke Skywalker        172    77
 2 C-3PO                 167    75
 3 R2-D2                  96    32
 4 Darth Vader           202   136
 5 Leia Organa           150    49
 6 Owen Lars             178   120
 7 Beru Whitesun lars    165    75
 8 R5-D4                  97    32
 9 Biggs Darklighter     183    84
10 Obi-Wan Kenobi        182    77
# ℹ 77 more rows

Now, let’s take the new sw_small table and create a new column called bmi and calculate the body mass index for all characters in the next code listing. The mathematical expression uses column names to generate a row-wise calculation result. Here we use the right-hand side (RHS) expression: mass / (height / 100)^2. On the left-hand side (LHS) of the expression is the new column bmi.

Add the bmi column to sw_small using the mutate() function.

Notes on the Code

L.1 The mutate() function takes data first (sw_small) and then an expression that creates the new column bmi. The expression uses values from the mass and height columns and places the results in the bmi column.

mutate(sw_small, bmi = mass / (height / 100)^2)

# A tibble: 87 × 4
   name               height  mass   bmi
   <chr>               <int> <dbl> <dbl>
 1 Luke Skywalker        172    77  26.0
 2 C-3PO                 167    75  26.9
 3 R2-D2                  96    32  34.7
 4 Darth Vader           202   136  33.3
 5 Leia Organa           150    49  21.8
 6 Owen Lars             178   120  37.9
 7 Beru Whitesun lars    165    75  27.5
 8 R5-D4                  97    32  34.0
 9 Biggs Darklighter     183    84  25.1
10 Obi-Wan Kenobi        182    77  23.2
# ℹ 77 more rows

We can see that the column bmi is now available to the right of the original columns name, height, and mass. All new columns that are created by mutate() are placed at the end (which is to the right). The bmi column has values in every row (this is a double-type variable, since mass is of the type double).

It turns out that the mutate() function is very powerful. We can have multiple mutations in a single call of mutate(). What’s even more interesting (and useful) is that a variable created in an expression is immediately available for use in a following expression (in the same mutate() call).

Add the bmi column as before and then create bmi_rnd.

Notes on the Code

L.1 This mutate() statement is a bit longer than the last so let's split it over a few lines.
L.2 The data is given first.
L.3 This expression creates the bmi column and its values (just like in the last example).
L.4 The second expression creates the bmi_rnd column, a rounded version of bmi (with no decimal places).

mutate(
  sw_small,
  bmi = mass / (height / 100)^2,
  bmi_rnd = round(bmi, 0)
)

# A tibble: 87 × 5
   name               height  mass   bmi bmi_rnd
   <chr>               <int> <dbl> <dbl>   <dbl>
 1 Luke Skywalker        172    77  26.0      26
 2 C-3PO                 167    75  26.9      27
 3 R2-D2                  96    32  34.7      35
 4 Darth Vader           202   136  33.3      33
 5 Leia Organa           150    49  21.8      22
 6 Owen Lars             178   120  37.9      38
 7 Beru Whitesun lars    165    75  27.5      28
 8 R5-D4                  97    32  34.0      34
 9 Biggs Darklighter     183    84  25.1      25
10 Obi-Wan Kenobi        182    77  23.2      23
# ℹ 77 more rows

While bmi is calculated just as before, there is a second expression to create another column. In that second expression, bmi is used in a rounding statement (rounding to 0 decimal places) to create the bmi_rnd variable.

2.2.7 summarize: Collapsing Many Values Down to a Single Summary

To get summary statistics across a set of groups we can use group_by() in conjunction with summarize(). The summarize() function works a lot like mutate() in that we can provide a new column name and a statement for its calculation. We can use several different aggregation functions inside summarize(): mean(), median(), sd(), min(), max() are the common ones. The summarize() function is typically used right after the group_by() function, which creates grouped data. Several examples will be useful to show us how this works.

Let’s group by species and get the mean of the mass variable for each group. First, we’ll make grouped data in a separate object and then we can generate a summary of the average mass for each species.

Grouping by the species variable and summarizing to get the mean mass.

Notes on the Code

L.1 We are using our original dataset, sw, and grouping it by the species column. That's assigned to the by_species object.
L.3 The by_species object, now a grouped table, is the data for the summarize() statement. For each of group of species we get a row of the average mass per group.

by_species <- group_by(sw, species)

summarize(by_species, avg_mass = mean(mass))

# A tibble: 38 × 2
   species   avg_mass
   <chr>        <dbl>
 1 Aleena          15
 2 Besalisk       102
 3 Cerean          82
 4 Chagrian        NA
 5 Clawdite        55
 6 Droid           NA
 7 Dug             40
 8 Ewok            20
 9 Geonosian       80
10 Gungan          NA
# ℹ 28 more rows

Notice that we get two columns back: one is from the species group and one is from the column we defined in the summarize() statement (avg_mass).

The pattern will hold that we get all grouping columns and all summary columns back. Another observation here is that some values for avg_mass appear as NA. This occurs if any NA values are introduced to mean(), the result will invariably be NA (the behavior holds true for the other aggregation functions as well). To circumvent this, we can exclude NA values in the calculation of the mean by using the na.rm = TRUE option within mean(). Let’s now revise that summarize() statement.

A slight modification to the summary expression to deal with missing (NA) values.

Notes on the Code

L.1 By augmenting the mean() expression used inside summarize() some of the NA values are replaced with non-NA (i.e., numeric) values.

summarize(by_species, avg_mass = mean(mass, na.rm = TRUE))

# A tibble: 38 × 2
   species   avg_mass
   <chr>        <dbl>
 1 Aleena        15  
 2 Besalisk     102  
 3 Cerean        82  
 4 Chagrian     NaN  
 5 Clawdite      55  
 6 Droid         69.8
 7 Dug           40  
 8 Ewok          20  
 9 Geonosian     80  
10 Gungan        74  
# ℹ 28 more rows

Now we see that all values in the avg_mass column are non-NA, however, there is now at least one NaN (not a number) value. This is explainable. The cause of this stems from there not being any non-NA values for that species. The consequence is that a mean couldn’t be calculated, so R returned an NaN.

Let’s create a summary containing two grouping columns (species and gender) and two summary columns (avg_mass and avg_height) to further demonstrate what can be done with the group_by()/summarize() combination.

Grouping by two columns and obtaining a summary with two new columns.

Notes on the Code

L.1 Grouping by two columns: species and gender; each unique combination is a separate group.
L.3 This summarize() statement will give us four columns: two for the two different groups, and two for the summary columns (avg_mass and avg_height).

grouping <- group_by(sw, species, gender)

summarize(
  grouping,
  avg_mass = mean(mass, na.rm = TRUE), 
  avg_height = mean(height, na.rm = TRUE)
)

# A tibble: 42 × 4
# Groups:   species [38]
   species   gender    avg_mass avg_height
   <chr>     <chr>        <dbl>      <dbl>
 1 Aleena    masculine     15           79
 2 Besalisk  masculine    102          198
 3 Cerean    masculine     82          198
 4 Chagrian  masculine    NaN          196
 5 Clawdite  feminine      55          168
 6 Droid     feminine     NaN           96
 7 Droid     masculine     69.8        140
 8 Dug       masculine     40          112
 9 Ewok      masculine     20           88
10 Geonosian masculine     80          183
# ℹ 32 more rows

This yields four columns as we may have expected. The summarize() function has an interface consistent with mutate(), easily allowing for multiple expressions (separated by commas) where the names we provide on the LHS are the names of the new columns.

2.2.8 Bringing All of This Together with the Pipe (`|>`)

Imagine that we want to get a ranking of average BMI values for the human characters across the different home worlds. Using what you know about dplyr, you might write code like this

Bringing it all together with multiple dplyr functions and assigning each step to a different object.

Notes on the Code

L.1 A filter() statement, assigning the output table to humans (a sensible, short name).
L.3 A mutate() statement, assigning the output table to humans_bmi (not a bad name).
L.9 A group_by() statement, assigning the output table to humans_avg_bmi_by_homeworld (kind of a long name!).
L.11 A summarize() statement, assigning the output table to humans_avg_bmi_by_homeworld (an even longer name, but it makes sense).
L.18 An arrange() statement, assigning the output table to... an extremely long and redundant name. I'm getting some name fatigue here.
L.20 Let's type out this terrible name once more why don't we.

humans <- filter(sw, species == "Human")

humans_bmi <-
  mutate(
    humans,
    bmi = mass / (height / 100)^2
  )                  

humans_bmi_by_homeworld <- group_by(humans_bmi, homeworld)

humans_avg_bmi_by_homeworld <-
  summarize(
    humans_bmi_by_homeworld,
    avg_bmi = mean(bmi, na.rm = TRUE)
  )      

humans_avg_bmi_by_homeworld_sorted <-
  arrange(humans_avg_bmi_by_homeworld, desc(avg_bmi))

humans_avg_bmi_by_homeworld_sorted

# A tibble: 16 × 2
   homeworld    avg_bmi
   <chr>          <dbl>
 1 Bestine IV      34.0
 2 Tatooine        28.9
 3 Bespin          25.8
 4 Corellia        25.7
 5 Socorro         25.2
 6 <NA>            23.9
 7 Haruun Kal      23.8
 8 Concord Dawn    23.6
 9 Kamino          23.4
10 Stewjon         23.2
11 Naboo           22.4
12 Alderaan        22.1
13 Serenno         21.5
14 Chandrila      NaN  
15 Coruscant      NaN  
16 Eriadu         NaN

This output is fine but the code is more than a little frustrating to write because we have to give each intermediate table object a name. We’re just going to use it as input to the next statement, so, we don’t care much about it. As ever, naming things is hard and the process of thinking of meaningful and distinct names will invariably slow down our analysis.

There is another way! It involves using the pipe operator (|>). In the next code listing, we can take a similar set of statements as above and rewrite them in a much more succinct manner (and the final table is exactly the same as before).

Bringing it all together with multiple dplyr functions: this time using the pipe operator.

Notes on the Code

L.1 Just one name to think about for the assignment!
L.2 We begin the statement with the data (sw) and pass it into the next function with the pipe (|>).
L.3 The filter() statement is the first transformation step. We don't need to specify the data, filter() already has it thanks to the pipe.
L.4 The filtered data is given to mutate(), because we used |> just before.
L.5 The mutated table is now grouped by homeworld.
L.6 This is the summarization step. We are passing in a grouped table to summarize().
L.7 Our summary table (now with two columns) is now sorted by descending values of avg_bmi with the arrange() function.

bmi_sorted <-
  sw |>
  filter(species == "Human") |>
  mutate(bmi = mass / (height / 100)^2) |>
  group_by(homeworld) |>
  summarize(avg_bmi = mean(bmi, na.rm = TRUE)) |>
  arrange(desc(avg_bmi))

The |> operator, the pipe, makes the code easier to both write and read later on. The focus is now on the transformations. A good way to reason about the pipe operator is to think of |> as meaning “then”. Now we can imagine reading the above code as a series of imperative statements: filter, and then mutate, and then group, and then summarize, and then arrange. Let’s rewrite the above statement as pseudocode to make clear that we have a series of imperative statements that can be read in a left-to-right manner:

Using pseudocode to demonstrate the readability of the code in the previous listing.

Notes on the Code

L.5 This is really the previous code listing written out in prose. It's sometimes helpful to see it written out in this way so we can better understand what we did at a higher level.

# Get a table of average BMI values for humans across
# the different worlds in a single, piped expression;
# The following is written as pseudocode
# (don't run it, just read it)
<assigning to bmi_sorted> <-
  <start with the starwars table> %{then}%
  <keep only the rows where species is “Human”> %{then}%
  <create a new column called bmi using a calculation> %{then}%
  <create a group for each unique value in homeworld> %{then}%
  <for each group create a summary table (one col: avg_bmi)> %{then}%
  <order the rows by decreasing avg_bmi value>

<view bmi_sorted>

This pseudocode should roughly map to what we should be thinking when writing out dplyr statements. With a bit of practice, the process of transforming tabular data starts to become natural and the idea of a grammar of data manipulation begins to make more sense.

2.3 Getting Back to Diamonds and Combining Data Transformation with Plotting

Your data may not have exactly what you’d like to plot. It’s a reality that’s all too common but we can do something about since we learned the basics of dplyr. We will make some transformations to the dmd table to make a plot we otherwise couldn’t before. Before showing the actual transformation statements, let’s have a look at the plan and rationale for the work.

Suppose we would like to have a new measure that provides the value of a diamond by weight. This is a simple calculation that divides the price of a diamond by its number of carats (price / carats). The new cost per carat variable (cpc) can be easily added to the dmd table by using dplyr’s mutate() function by taking dmd and piping it to mutate(): dmd |> mutate(cpc = price / carats).

Now that we have the cpc variable, taken as a better measure of the worth of a diamond based on its qualities, we can divide the entire set of diamonds into two price classes: those with higher cpc than the median cpc value, and those that are lower. We won’t describe in detail how to get the median cpc value, so we’ll accept that it is around $3,460 per carat. Within our second mutate() statement, we will use ifelse() to get a new price_class variable. The next code listing takes the dmd dataset and applies both mutate() statements to dmd and assigns the result to dmd_mod.

Modifying dmd to obtain two new columns: cpc and price_class.

Notes on the Code

L.3 This mutate() statement creates the cpc column...
L.4 ...while this mutate() call makes the price_class column.

dmd_mod <- 
  dmd |>
  mutate(cpc = price / carats) |>
  mutate(price_class = ifelse(
    cpc >= 3460, "Above Median", "Below Median")
  )

We haven’t yet used mutate() with ifelse() so let’s examine this more closely. The ifelse() statement used here checks every row of the table for whether cpc is greater than or equal to 3460. For each row where that statement is true, the value in the new price_class column will be given "Above Median". If not true, then the value will be "Below Median".

For our third and final mutate(), we will suppose that diamonds with cut and clarity labeled as The Best should be high-quality diamonds, and thus fetch higher prices. Another ifelse() is to be used within the mutate() statement, creating a new variable called quality. The following listing augments the earlier code with a third mutate().

A third and final mutate() statement to add the quality column to our modified dataset (dmd_mod).

Notes on the Code

L.8 The third mutate() statement (that creates the quality column) is pretty long so it's broken across a few lines for better readability

dmd_mod <- 
  dmd |>
  mutate(cpc = price / carats) |>
  mutate(price_class = ifelse(
    cpc >= 3460, "Above Median", "Below Median"
    )
  ) |>
  mutate(quality = ifelse(
    cut == "The Best" & clarity == "The Best",
    "Top Drawer", "The Rest"
    )
  )

In the code listing, we are using ifelse() within the third mutate() statement to check for the dual condition of both cut and clarity being equal to "The Best". Those diamonds for which the statement is true will have a quality label that is "Top Drawer". Otherwise, all other diamonds will get a label of "The Rest".

Once we have modified dmd and assigned the results to a new object called dmd_mod, we can make a different type of plot that uses the newly-created variables of price_class and quality. Faceting will be done on the price vs. carats plot, using the new variables in the facet_wrap() statement to create a 2-by-2 plot that compares data points by two quality categories and two price categories.

Modifying dmd to add three new columns, and, plotting dmd_mod with the new variables.

Notes on the Code

L.1 This dplyr code adds three columns: one for each mutate() call.
L.14 This ggplot code uses the new variables when faceting (price_class and quality).

dmd_mod <-
  dmd |>
  mutate(cpc = price / carats) |>
  mutate(price_class = ifelse(
    cpc >= 3460, "Above Median", "Below Median"
    )
  ) |>
  mutate(quality = ifelse(
    cut == "The Best" & clarity == "The Best",
    "Top Drawer", "The Rest"
    )
  )

ggplot(dmd_mod, aes(x = carats, y = price)) +
  geom_point() +
  facet_wrap(
    facets = vars(price_class, quality),
    labeller = label_both
  ) +
  labs(x = "Carats", y = "Price")

Figure 2.1: Modifying data with **dplyr**; plotting it with **ggplot**.

Moving between data transformation activities and plotting with ggplot like this is often valuable for better expressing the data you have to an audience, or, for exploring the data and getting insightful views that were otherwise hidden. Getting to a stage where you can rapidly translate your analysis/visualization needs to raw dplyr & ggplot code is worth the effort and understandably takes some practice.

2.4 Summary

We learned how to assign values to variables using the <- operator
Transforming a dataset can be done with these five dplyr functions: filter(), arrange(), select(), mutate(), and summarize() (with group_by())
The pipe operator (|>) (which was used with the dplyr functions) improved the composition and readability of our R statements
Transforming data with dplyr and visualizing data with ggplot is often something that’s done hand in hand
We can use the tibble() and tribble() functions to create our own data tables

2.5 Exercises

When you are creating a new R Markdown document for analysis, what set of R statements is likely required before any analysis?
What’s the difference between = and ==? Where should these used (and not used)?
What do you expect to be returned with a filter statement like filter(sw, species == "kangaroo")? There are no kangaroos in Star Wars, so what might happen? (Try it out.)
When the tibble output reads # A tibble: 3 x 8, what do the 3 and 8 correspond to?
What is the difference between these two filter() statements: (1) filter(sw, species == "Human" & mass > 100) and (2) filter(sw, species == "Human" | mass > 100)? Which one would be expected to yield more rows?
You may have noticed that in the statement filter(sw, species %in% c("Droid", "Human")), the word species is not in quotes. Why do you think this is?
When ordering rows with the statement arrange(sw, gender, hair_color), what is the significance of having gender first and hair_color second? What would you expect with a statement like arrange(sw, desc(gender), hair_color)?
Selecting columns is done with select() and select helper functions can be used within that. What selection of columns would occur if you used the statement select(table, name, weight, height, ends_with("t")) on the table with columns name, height, ticket, weight, and city?
How do we drop columns from a table? Say we had a table with the columns name, height, ticket, weight, city and we don’t want the ticket column. How would you write the select() statement?
How do you drop multiple columns from a table? What select() statement could be used on a table with the columns name, height, ticket, weight, and city that would result in leaving in the columns name, height, and weight?
What does the everything() helper function do inside of a select() statement?
Recall that the sw_small table was a narrower version of sw table, with only the name, height, and mass variables. What would result from the statement mutate(sw_small, less_mass = mass - 10)?
Taking once again the sw_small table, do the following two statements result in the same values within the bmi column? Here are the statements: (1) mutate(sw_small, hgt2 = (height / 100)^2, bmi = mass / hgt2), (2) mutate(sw_small, bmi = mass / (height / 100)^2).
Write a mutate() statement with the sw_small table that creates the log_mass variable, which is a calculation of the log of the mass of each character.
What output table would you expect from the statement group_by(sw_small, species)?
What does the summarize() function do?
How many columns are available in the table generated by the statement summarize(group_by(sw, species), avg_mass = mean(mass))?
What’s one helpful thing that the pipe (|>) does?
How would you rewrite the statement in Q17 with the use of the |> operator?
When using the piped statements of sw |> filter(species == "Human") |> mutate(bmi = mass / (height / 100)^2) |> group_by(homeworld), what extra information do you get in the tibble printout?

2.6 Possible Answers

One or more library() statements should ideally come first to load in package functions. For this lesson, the statements could be library(tidyverse) and library(dspatterns) (to use the datasets and functions suitable for the book content).
The = symbol is an assignment operator, whereas == is a comparison operator. Like the name implies, = is used for assignment (e.g., inside function calls where you’d like to use the argument name and the bound value) and == is used for comparisons (e.g., in expressions that the filter() function uses as a filtering condition).
You get an empty tibble! Dimensions provided are 0 rows by 8 columns. No error: this provides usable output to pass on to other functions.
The 3 corresponds to rows and the 8 represents columns (or variables). It’s customary to represent table dimensions with rows first and then the columns.
The first filter() statement using & will return only those rows describing a "Human" with a mass greater than 100. This is more restrictive than the second filter() statement using |, which returns those rows describing a "Human" or any rows where mass is greater than 100 (not exclusive to Humans). The second statement returns more rows than the first statement.
The word species is not in enclosed in quotes because it represents a column. The dplyr convention is that column names are not surrounded by quotation marks.
The sequence of column names in an arrange() statement directs dplyr to prioritize sorting by the columns provided first. With the statement arrange(sw, desc(gender), hair_color) sorting is still prioritized by gender except that’s a descending sort and the secondary sorting by hair_color is an ascending sort.
The resulting selection of columns would be name, weight, height, and ticket.
Columns are dropped in a select statement by preceding the column name with a hyphen (e.g., select(<table>, -<column_name>)). For the table where we’d want to exclude the ticket column, a suitable statement would be select(<table>, -ticket). Alternatively, we can also write select(<table>, name, height, weight, city), however, this approach becomes less practical when the table has significantly more variables.
Multiple columns can be dropped by prefixing a hyphen to all columns we want to drop. To end up with the columns name, height, and weight from the table with the columns name, height, ticket, weight, and city we can write select(table, -ticket, -city). There is also another way. We can combine the columns we want to exclude in c() and prefix that with a hyphen, as in: select(table, -c(ticket, city)).
The everything() helper function, when inside of a select() statement, provides the remainder of columns that weren’t used explicitly. It can be thought of as “the rest of the columns in the table that I didn’t call out by name”.
The statement mutate(sw_small, less_mass = mass - 10) would add a new variable called less_mass where the value is 10 less than that found in mass (except when mass is NA, whereby less_mass also becomes NA).
The values within the bmi column for both tables will be the same. Notably, with the first statement, we’ll end up with a table that has a hgt_2 column and we probably won’t find that variable very valuable.
The mutate() statement could very well be: mutate(sw_small, log_mass = log(mass)).
With the statement group_by(sw_small, species) you get the same table as sw_small! This is because we just defined some grouping (on species) but we didn’t do anything yet with the groups.
The summarize() function helps us apply one or more aggregation functions (like sum(), mean(), min(), max(), etc.) on grouped data. We get back one summary row per group.
When using the statement summarize(group_by(sw, species), avg_mass = mean(mass)), we get back a summary table with two columns: species (the grouping column) and avg_mass (the column we define in the summarize() function call).
The pipe (|>) is helpful because: (1) it eliminates to need to create intermediate object names for functions that work well with the pipe, and (2) it can be more readable than nested function calls.
The Q17 statement can be rewritten with the pipe as: sw |> group_by(species) |> summarize(avg_mass = mean(mass)).
In the tibble printout, we get information on the groups we made from the group_by() statement. It provides the number of groups and lists the group names alongside that.

2.1 Programming Basics and Package Installation

2.1.1 The Basics of Assignment

2.2 Using dplyr to Transform Data

2.2.1 The Main dplyr Functions

2.2.2 The sw Dataset and the Elements of a Table

2.2.3 filter: Picking Observations by their Values

2.2.4 arrange: Reordering Rows

2.2.5 select: Picking Variables by their Names

2.2.6 mutate: Creating New Variables with Expressions

2.2.7 summarize: Collapsing Many Values Down to a Single Summary

2.2.8 Bringing All of This Together with the Pipe (|>)

2.3 Getting Back to Diamonds and Combining Data Transformation with Plotting

2.4 Summary

2.5 Exercises

2.6 Possible Answers

2.2.2 The `sw` Dataset and the Elements of a Table

2.2.8 Bringing All of This Together with the Pipe (`|>`)