1  Introduction to Data Frames and Tibbles

This MVP covers

When you’re working with data, quite often it’ll be structured data in the form of tables. Tabular data is usually a joy to work with: there’s regularity in its shape (rectangular!), the rules for working with tables are widely known, and we can also take proactive steps to shape the data in a way that’s even more convenient for data analysis.

Tables come in all shapes and sizes. Some are very small, filling maybe a screen or two with data. Some are in between, requiring querying and summarizing to really make sense of the data. Finally, some are huge and those require more efficient tools for transforming, querying, and summarizing the data. Here, we’ll start small and learn how to explore tables of the small variety. In R, tabular data is often stored in data structures known as data frames. It’s part of base R and you don’t need any extra packages to use data frames.

We also have another variety of tables which are called tibbles. A tibble is a special implementation of a data frame that prints its data differently and behaves in a more consistent manner than data frames. This flavor of data frame is used quite a lot throughout this book and we’ll make the case that you should also make them yours.

In this chapter we’ll take a look at a number of great ways to work with tables (whether they are data frames or tibbles). It’s really hard to read all of the data in a large table so we can make use of a plethora of functions that provide different views of a table of interest. We will explore a handful of different packages that expressly deal with data frame exploration and summarization for better understanding.

1.1 Several Quick Ways to Initially Explore a Dataset

Let’s look at some data from episodes dataset in the dspatterns package. It serves as a highly quantitative episode guide. Before we start, we’ll load in the necessary packages, namely the tidyverse and dspatterns packages (the latter will load the bakeoff package, and that’s where the dataset originates).

Loading the tidyverse and dspatterns packages.

Notes on the CodeL.1 Loading the tidyverse package like this actually auto-loads all core tidyverse packages (this is almost everything we'll need for most analyses!).
L.2 The dspatterns package is this book's namesake package. It has the datasets we need for all of the examples.

Now let’s look at the episodes dataset by printing it out:

episodes
# A tibble: 94 × 10
   series episode bakers_appeared bakers_out bakers_remaining star_bakers
    <dbl>   <dbl>           <int>      <int>            <int>       <int>
 1      1       1              10          2                8           0
 2      1       2               8          2                6           0
 3      1       3               6          1                5           0
 4      1       4               5          1                4           0
 5      1       5               4          1                3           0
 6      1       6               3          0                3           0
 7      2       1              12          1               11           1
 8      2       2              11          1               10           1
 9      2       3              10          2                8           1
10      2       4               8          1                7           2
# ℹ 84 more rows
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
#   winner_name <chr>, eliminated <chr>

It provides a lot of information, but, it doesn’t overload the console with tons of information. Because the tidyverse packages are loaded, the tibble version of the dataset is what’s provided during printing. At any rate we can see some of the data (10 rows), the dimensions of the table (94 rows by 10 columns), and, a useful tip to use print(n = ...) to see more rows. Let’s try that:

print(episodes, n = 20)
# A tibble: 94 × 10
   series episode bakers_appeared bakers_out bakers_remaining star_bakers
    <dbl>   <dbl>           <int>      <int>            <int>       <int>
 1      1       1              10          2                8           0
 2      1       2               8          2                6           0
 3      1       3               6          1                5           0
 4      1       4               5          1                4           0
 5      1       5               4          1                3           0
 6      1       6               3          0                3           0
 7      2       1              12          1               11           1
 8      2       2              11          1               10           1
 9      2       3              10          2                8           1
10      2       4               8          1                7           2
11      2       5               7          2                5           1
12      2       6               5          1                4           1
13      2       7               4          1                3           0
14      2       8               3          0                3           0
15      3       1              12          1               11           1
16      3       2              11          1               10           1
17      3       3              10          1                9           1
18      3       4               9          1                8           1
19      3       5               8          1                7           1
20      3       6               7          0                7           1
# ℹ 74 more rows
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
#   winner_name <chr>, eliminated <chr>

Being able to print an exact number of rows with print() is sometimes useful if you have a generally small table and you need to see more of it.

Warning

When using print() to specify the number of rows displayed, we have to be sure that the table object is a tibble. This won’t work with data frames and using print(mtcars, n = 5)

If you wanted just a few rows, you could use the head() function like this:

head(episodes)
# A tibble: 6 × 10
  series episode bakers_appeared bakers_out bakers_remaining star_bakers
   <dbl>   <dbl>           <int>      <int>            <int>       <int>
1      1       1              10          2                8           0
2      1       2               8          2                6           0
3      1       3               6          1                5           0
4      1       4               5          1                4           0
5      1       5               4          1                3           0
6      1       6               3          0                3           0
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
#   winner_name <chr>, eliminated <chr>

If you wanted smaller, more focused output on what’s in the table then the names() and dim() functions will yield vectors of column names and the dimensions of the table.

names(episodes)
 [1] "series"            "episode"           "bakers_appeared"  
 [4] "bakers_out"        "bakers_remaining"  "star_bakers"      
 [7] "technical_winners" "sb_name"           "winner_name"      
[10] "eliminated"       
dim(episodes)
[1] 94 10

Take note that the convention for table dimensions in R is first the number of rows (94) and then the number of variables or columns (10). If you’re using the RStudio IDE, then the special View() function will put a table into a spreadsheet-like view:

View(episodes)

Please note that if you’re using Quarto or R Markdown, having View() in a chunk is generally not a good idea if you’re intending on rendering the document for distribution (since the effect of View() is to provide a secondary ‘view’ of your data). It’s best to use it in only an interactive context.

Sometimes you may want to see just a small portion of your input data. We can use gt_preview() from the gt package to get the first x rows of data and the last y rows of data (these parameters can be set by the top_n and bottom_n arguments of gt_preview()). Let’s try it with the bakers dataset.

gt_preview(bakers)
series baker star_baker technical_winner technical_top3 technical_bottom technical_highest technical_lowest technical_median series_winner series_runner_up total_episodes_appeared first_date_appeared last_date_appeared first_date_us last_date_us percent_episodes_appeared percent_technical_top3 baker_full age occupation hometown baker_last baker_first
1 1 Annetha 0 0 1 1 2 7 4.5 0 0 2 2010-08-17 2010-08-24 NA NA 33.33333 50.00000 Annetha Mills 30 Midwife Essex Mills Annetha
2 1 David 0 0 1 3 3 8 4.5 0 0 4 2010-08-17 2010-09-07 NA NA 66.66667 25.00000 David Chambers 31 Entrepreneur Milton Keynes Chambers David
3 1 Edd 0 2 4 1 1 6 2.0 1 0 6 2010-08-17 2010-09-21 NA NA 100.00000 66.66667 Edward "Edd" Kimber 24 Debt collector for Yorkshire Bank Bradford Kimber Edward
4 1 Jasminder 0 0 2 2 2 5 3.0 0 0 5 2010-08-17 2010-09-14 NA NA 83.33333 40.00000 Jasminder Randhawa 45 Assistant Credit Control Manager Birmingham Randhawa Jasminder
5 1 Jonathan 0 1 1 2 1 9 6.0 0 0 3 2010-08-17 2010-08-31 NA NA 50.00000 33.33333 Jonathan Shepherd 25 Research Analyst St Albans Shepherd Jonathan
6..119
120 10 Steph 0 1 6 4 1 10 3.0 0 0 10 NA NA NA NA 100.00000 60.00000 Steph Blackwell 28 Shop assistant Chester Blackwell Steph

What you get by default is the first five rows and the last row of the bakers dataset. We can see that what’s not shown are rows 6 to 119 (it’s shown as 6..119 in the table stub).

If you wanted to show the first and last 10 rows of the bakers dataset, that’s not a problem. It can be accomplished with the top_n and bottom_n arguments, like this:

gt_preview(bakers, top_n = 10, bottom_n = 10)
series baker star_baker technical_winner technical_top3 technical_bottom technical_highest technical_lowest technical_median series_winner series_runner_up total_episodes_appeared first_date_appeared last_date_appeared first_date_us last_date_us percent_episodes_appeared percent_technical_top3 baker_full age occupation hometown baker_last baker_first
1 1 Annetha 0 0 1 1 2 7 4.5 0 0 2 2010-08-17 2010-08-24 NA NA 33.33333 50.00000 Annetha Mills 30 Midwife Essex Mills Annetha
2 1 David 0 0 1 3 3 8 4.5 0 0 4 2010-08-17 2010-09-07 NA NA 66.66667 25.00000 David Chambers 31 Entrepreneur Milton Keynes Chambers David
3 1 Edd 0 2 4 1 1 6 2.0 1 0 6 2010-08-17 2010-09-21 NA NA 100.00000 66.66667 Edward "Edd" Kimber 24 Debt collector for Yorkshire Bank Bradford Kimber Edward
4 1 Jasminder 0 0 2 2 2 5 3.0 0 0 5 2010-08-17 2010-09-14 NA NA 83.33333 40.00000 Jasminder Randhawa 45 Assistant Credit Control Manager Birmingham Randhawa Jasminder
5 1 Jonathan 0 1 1 2 1 9 6.0 0 0 3 2010-08-17 2010-08-31 NA NA 50.00000 33.33333 Jonathan Shepherd 25 Research Analyst St Albans Shepherd Jonathan
6 1 Lea 0 0 0 1 10 10 10.0 0 0 1 2010-08-17 2010-08-17 NA NA 16.66667 0.00000 Lea Harris 51 Retired Midlothian, Scotland Harris Lea
7 1 Louise 0 0 0 1 4 4 4.0 0 0 2 2010-08-17 2010-08-24 NA NA 33.33333 0.00000 Louise Brimelow 44 Police Officer Manchester Brimelow Louise
8 1 Mark 0 0 0 0 NA NA NA 0 0 1 2010-08-17 2010-08-17 NA NA 16.66667 0.00000 Mark Whithers 48 Bus Driver South Wales Whithers Mark
9 1 Miranda 0 2 4 1 1 8 3.0 0 0 6 2010-08-17 2010-09-21 NA NA 100.00000 66.66667 Miranda Gore Browne 37 Food buyer for Marks & Spencer Midhurst, West Sussex Browne Miranda
10 1 Ruth 0 0 2 2 2 5 3.5 0 0 6 2010-08-17 2010-09-21 NA NA 100.00000 33.33333 Ruth Clemens 31 Retail manager/Housewife Poynton, Cheshire Clemens Ruth
11..110
111 10 David 0 2 8 2 1 10 2.0 1 0 10 NA NA NA NA 100.00000 80.00000 David Atherton 36 International health adviser Whitby Atherton David
112 10 Helena 0 1 1 4 1 12 9.0 0 0 5 NA NA NA NA 50.00000 20.00000 Helena Garcia 40 Online project manager Leeds Garcia Helena
113 10 Henry 0 2 5 3 1 6 3.0 0 0 8 NA NA NA NA 80.00000 62.50000 Henry Bird 20 Student Durham Bird Henry
114 10 Jamie 0 0 0 2 11 13 12.0 0 0 2 NA NA NA NA 20.00000 0.00000 Jamie Finn 20 Part-time waiter Surrey Finn Jamie
115 10 Michael 0 0 0 7 4 11 6.0 0 0 7 NA NA NA NA 70.00000 0.00000 Michael Chakraverty 26 Theatre manager/fitness instructor Stratford-upon-Avon Chakraverty Michael
116 10 Michelle 0 0 0 5 5 8 6.0 0 0 5 NA NA NA NA 50.00000 0.00000 Michelle Evans-Fecci 35 Print shop administrator Tenby, Wales Evans-Fecci Michelle
117 10 Phil 0 0 1 3 3 10 7.0 0 0 4 NA NA NA NA 40.00000 25.00000 Phil Thorne 56 HGV driver Rainham Thorne Phil
118 10 Priya 0 0 1 5 2 10 7.0 0 0 6 NA NA NA NA 60.00000 16.66667 Priya O'Shea 34 Marketing consultant Leicester O'Shea Priya
119 10 Rosie 0 2 4 5 1 9 4.0 0 0 9 NA NA NA NA 90.00000 44.44444 Rosie Brandreth-Poynter 28 Veterinary surgeon Somerset Brandreth-Poynter Rosie
120 10 Steph 0 1 6 4 1 10 3.0 0 0 10 NA NA NA NA 100.00000 60.00000 Steph Blackwell 28 Shop assistant Chester Blackwell Steph

It’s a relatively simple function that gt_preview(), but it comes in handy if you want a nicer display of the head and tail of a dataset.

1.2 Using glimpse() to Go Sideways

While inspecting rows of your raw data isn’t always the best thing it could be useful for quickly understanding how the different variables fit together. The glimpse() function (accessible from the dplyr package) allows you to have a look at the first few records of a dataset. This is somewhat like the head() function seen earlier but turned sideways:

glimpse(episodes)
Rows: 94
Columns: 10
$ series            <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ episode           <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
$ bakers_appeared   <int> 10, 8, 6, 5, 4, 3, 12, 11, 10, 8, 7, 5, 4, 3, 12, 11…
$ bakers_out        <int> 2, 2, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1…
$ bakers_remaining  <int> 8, 6, 5, 4, 3, 3, 11, 10, 8, 7, 5, 4, 3, 3, 11, 10, …
$ star_bakers       <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1…
$ technical_winners <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sb_name           <chr> NA, NA, NA, NA, NA, NA, "Holly", "Jason", "Yasmin", …
$ winner_name       <chr> NA, NA, NA, NA, NA, "Edd", NA, NA, NA, NA, NA, NA, N…
$ eliminated        <chr> "Lea, Mark", "Annetha, Louise", "Jonathan", "David",…

Unlike the tibble view (with head() or not), you get to see all of the columns in the data table. The interesting thing about glimpse() is that it invisibly returns the data that’s given to it. Because of that, you can have safely have one or several glimpse() calls in a data transformation pipeline and each of those will print the state of the data at different junctures.

episodes |>
  glimpse() |>
  select(series, episode, winner_name) |>
  filter(!is.na(winner_name)) |>
  glimpse()
Rows: 94
Columns: 10
$ series            <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ episode           <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
$ bakers_appeared   <int> 10, 8, 6, 5, 4, 3, 12, 11, 10, 8, 7, 5, 4, 3, 12, 11…
$ bakers_out        <int> 2, 2, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1…
$ bakers_remaining  <int> 8, 6, 5, 4, 3, 3, 11, 10, 8, 7, 5, 4, 3, 3, 11, 10, …
$ star_bakers       <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1…
$ technical_winners <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sb_name           <chr> NA, NA, NA, NA, NA, NA, "Holly", "Jason", "Yasmin", …
$ winner_name       <chr> NA, NA, NA, NA, NA, "Edd", NA, NA, NA, NA, NA, NA, N…
$ eliminated        <chr> "Lea, Mark", "Annetha, Louise", "Jonathan", "David",…
Rows: 10
Columns: 3
$ series      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ episode     <dbl> 6, 8, 10, 10, 10, 10, 10, 10, 10, 10
$ winner_name <chr> "Edd", "Joanne", "John", "Frances", "Nancy", "Nadiya", "Ca…

As can be seen above, the original dataset was printed with glimpse() and it was also passed to select() and filter() statements just before a final glimpse call (to see the transformed data). The output is two glimpse() outputs stacked atop each other.

1.3 Getting Data Summaries

Something else that’s very useful during the exploration phase of data work is the summary() function. It’ll break down each column of data into their own summaries.

summary(episodes)
     series          episode       bakers_appeared    bakers_out    
 Min.   : 1.000   Min.   : 1.000   Min.   : 3.000   Min.   :0.0000  
 1st Qu.: 3.250   1st Qu.: 3.000   1st Qu.: 5.000   1st Qu.:1.0000  
 Median : 6.000   Median : 5.000   Median : 7.000   Median :1.0000  
 Mean   : 5.766   Mean   : 5.287   Mean   : 7.553   Mean   :0.9468  
 3rd Qu.: 8.000   3rd Qu.: 8.000   3rd Qu.:10.000   3rd Qu.:1.0000  
 Max.   :10.000   Max.   :10.000   Max.   :13.000   Max.   :2.0000  
 bakers_remaining  star_bakers     technical_winners   sb_name         
 Min.   : 3.000   Min.   :0.0000   Min.   :0.0000    Length:94         
 1st Qu.: 4.000   1st Qu.:1.0000   1st Qu.:1.0000    Class :character  
 Median : 6.500   Median :1.0000   Median :1.0000    Mode  :character  
 Mean   : 6.606   Mean   :0.8404   Mean   :0.9894                      
 3rd Qu.: 9.000   3rd Qu.:1.0000   3rd Qu.:1.0000                      
 Max.   :12.000   Max.   :2.0000   Max.   :1.0000                      
 winner_name         eliminated       
 Length:94          Length:94         
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      

For columns that are numeric, the summary() function automatically calculates the following summary statistics for each column of the table:

  • Min: The minimum value
  • 1st Qu: The first quartile value (25th percentile)
  • Median: The median value
  • 3rd Qu: The third quartile value (75th percentile)
  • Max: The maximum value

There are a few character columns in the episodes dataset (e.g., sb_name, etc.) and summary() doesn’t do much with those other than state that they are indeed of the character class. If there were to be any NA values, summary() would report how many on a column-by-column basis.

For a more comprehensive look at a dataset, the skim() function from the skimr package offers a report that is broken down by variable type. Using skim() with the episodes dataset from bakeoff will give us an overall data summary, information for the character variables (sb_name, winner_name, and eliminated) such as n_missing, complete_rate, and, summary statistics for the numeric variables.

skim(episodes)
── Data Summary ────────────────────────
                           Values  
Name                       episodes
Number of rows             94      
Number of columns          10      
_______________________            
Column type frequency:             
  character                3       
  numeric                  7       
________________________           
Group variables            None    

── Variable type: character ───────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 sb_name              16         0.830   3  12     0       47          0
2 winner_name          84         0.106   3   7     0       10          0
3 eliminated           13         0.862   3  16     0       76          0

── Variable type: numeric ─────────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate  mean    sd p0  p25 p50 p75 p100 hist 
1 series                    0             1 5.77  2.77   1 3.25 6     8   10 ▆▇▇▇▇
2 episode                   0             1 5.29  2.83   1 3    5     8   10 ▇▇▇▇▆
3 bakers_appeared           0             1 7.55  2.97   3 5    7    10   13 ▇▅▅▅▃
4 bakers_out                0             1 0.947 0.472  0 1    1     1    2 ▂▁▇▁▁
5 bakers_remaining          0             1 6.61  2.80   3 4    6.5   9   12 ▇▅▅▅▃
6 star_bakers               0             1 0.840 0.396  0 1    1     1    2 ▂▁▇▁▁
7 technical_winners         0             1 0.989 0.103  0 1    1     1    1 ▁▁▁▁▇

The numeric variable types really get the deluxe treatment here with a statistical summary consisting of the mean, the standard deviation (sd), percentile values, and little histograms! It doesn’t take very long at all to get such a summary so it’s worth it every time for new and unfamiliar datasets.

1.4 Rolling Our Own Tabular Data

Creating your own tabular data can be really useful for sharing (especially when you need to create a particular one for debugging something) and for having a table of manageable size for learning purposes. To that end, we’ll learn how to make our own tibbles from scratch. Although we customarily get our data from other sources (e.g., CSV files, database tables, Excel files, etc.), there are a few good reasons for wanting to handcraft our own tibble objects:

  1. To have simple tables for experimentation with functions that operate on tabular data
  2. To reduce the need to use Excel or some other data entry systems (for small enough data)
  3. To create small tables that interface with larger tables (e.g., joining, combining, etc.)
  4. To gain a better understanding how tibbles work under the hood

We can create tibbles in a few different ways but let’s focus on tibble construction using either of two functions available in the dplyr package: tibble() and the similarly-named tribble().

1.4.1 Creating Tibbles with the tibble() Function

Let’s have a look at a few examples of tibble-making first with tibble(), which takes in named vectors as arguments. In the following example, we use two equal-length vectors (called a and b).

Using tibble() with equal-length vectors to make a tibble.

Notes on the CodeL.2 This will become column a.
L.3 This is to be column b.
tibble( 
  a = c(3, 5, 2, 6),
  b = c("a", "b", "g", "b")
)
# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     5 b    
3     2 g    
4     6 b    

As can be seen, the type of each column is based on the type of the vector. The order of columns in the output table is based on the order of the names provided inside tibble().

Let’s make another tibble in a similar manner, but with a single value for a (the value 3 will be repeated down its column).

Using tibble() with two vectors: one of length 1 and the other of length 4.

Notes on the CodeL.2 Only one value for a! That's okay, it will be repeated.
L.3 This will become column b, a column of character-based values.
tibble(
  a = 3,
  b = c("a", "b", "g", "b")
)
# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     3 b    
3     3 g    
4     3 b    

In the printed tibble the value 3 in column a is indeed repeated down.

The key is to provide either n-length (n here signifies the total number of rows in the table) or some combination of n-length and length-1 vectors. The length-1 value will be repeated down. Any vector with a length between 1 and n will result in an error.

We can also pass in NA (missing) values by including NAs in the appropriate vector. In the next example, we incorporate NA values in the two n-length vectors.

Using tibble() with two vectors that contain NA values.

Notes on the CodeL.2 We intentionally placed an NA value among other values in column a.
L.3 There is also an NA value in the b column.
tibble(
  a = c(3, 5, 2, NA),
  b = c("a", NA, "g", "b")
  )
# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     5 <NA> 
3     2 g    
4    NA b    

The resulting tibble here shows that those NA values in the numeric and character input vectors appear in the output tibble in their expected locations.

In the next code listing, an NA value is used in a vector of length 1. What will happen? Will the NA values be repeated down in the column? Let’s have a look.

Using a single-length vector with an NA value in tibble().

Notes on the CodeL.2 Using a single NA (and nothing else) gives us a certain type of NA: a logical NA (yes, there are different types).
tibble(
  a = NA,
  b = c("a", "b", "g", "b")
)
# A tibble: 4 × 2
  a     b    
  <lgl> <chr>
1 NA    a    
2 NA    b    
3 NA    g    
4 NA    b    

Yes. The NA is repeated down the a column. We can see that column a’s type is <lgl>, or, logical.

Using just NA in a column does result in repeated NAs, however, the column is classified as a logical column (which is meant for TRUE or FALSE values, likely not was intended). If we want this column to be a character column, we should use a specific type of NA: NA_character_. (There are other missing value constants for other types: NA_real_, NA_integer_, and NA_complex_.) Let’s replace a = NA with a = NA_character_:

Using a single-length vector with an NA_character_ value in tibble().

Notes on the CodeL.2 We are now being specific about the type of NAs we want (the character version).
tibble(
  a = NA_character_,
  b = c("a", "b", "g", "b")
)
# A tibble: 4 × 2
  a     b    
  <chr> <chr>
1 <NA>  a    
2 <NA>  b    
3 <NA>  g    
4 <NA>  b    

And we get a column type of <chr> for a, which is what we wanted.

1.4.2 Creating Tibbles a Different Way with the tribble() Function

We can use the tribble() function as an alternative constructor for tibble objects. This next example with tribble() reproduces a tibble generated in a previous code listing:

Creating a tibble using the tribble() function.

Notes on the CodeL.1 As tribble() is very close in spelling to tibble(), be a little careful here.
L.2 The column names are prepended by the tilde character, and we don't use quotes.
L.6 The last (hanging) comma here is fine to keep. It won't result in an error.
tribble(
  ~a, ~b,
  3,  "a",
  5,  "b",
  2,  "g",
  6,  "b",
)
# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     5 b    
3     2 g    
4     6 b    

The resulting tibble appears just as we laid out the values. As can be seen in the code listing, the table values aren’t provided as vectors but instead are laid out as column names and values in manner that approximates the structure of a table. Importantly the column names are preceded by a tilde (~) character, and, commas separate all values. This way of building a simple tibble can be useful when having values side-by-side is important for minimizing the potential for error.

1.5 Summary

  • There are many datasets kicking around in R packages; after loading the package (and discovering the datasets) you simply use the dataset name to print it
  • You can use a number of functions to look at a dataset that’s a table: print(), head(), dplyr’s glimpse(), View() in RStudio, gt’s gt_preview(), and skimr’s skim()
  • Get the column names of a table with names() or colnames(); get the dimensions with dim()
  • The base R function summary() gives you a nice summary of a table and it’s always there for you
  • You can easily make your own tibbles with either tibble() or tribble()