1 Introduction to Data Frames and Tibbles

This MVP covers

Printing a dataset obtained from a package with the object name and with print(), head(), and dplyr’s glimpse()
Using the names() and dim() functions to get the column names and the dimensions of a table
Looking at a full dataset with View() and the head and tail with gt’s gt_preview() function
Using summary() to get a basic summary of a table, and, going further with skimr’s skim() more comprehensive summary
Creating our own tables/tibbles with tibble() and tribble()

When you’re working with data, quite often it’ll be structured data in the form of tables. Tabular data is usually a joy to work with: there’s regularity in its shape (rectangular!), the rules for working with tables are widely known, and we can also take proactive steps to shape the data in a way that’s even more convenient for data analysis.

Tables come in all shapes and sizes. Some are very small, filling maybe a screen or two with data. Some are in between, requiring querying and summarizing to really make sense of the data. Finally, some are huge and those require more efficient tools for transforming, querying, and summarizing the data. Here, we’ll start small and learn how to explore tables of the small variety. In R, tabular data is often stored in data structures known as data frames. It’s part of base R and you don’t need any extra packages to use data frames.

We also have another variety of tables which are called tibbles. A tibble is a special implementation of a data frame that prints its data differently and behaves in a more consistent manner than data frames. This flavor of data frame is used quite a lot throughout this book and we’ll make the case that you should also make them yours.

In this chapter we’ll take a look at a number of great ways to work with tables (whether they are data frames or tibbles). It’s really hard to read all of the data in a large table so we can make use of a plethora of functions that provide different views of a table of interest. We will explore a handful of different packages that expressly deal with data frame exploration and summarization for better understanding.

1.1 Several Quick Ways to Initially Explore a Dataset

Let’s look at some data from episodes dataset in the dspatterns package. It serves as a highly quantitative episode guide. Before we start, we’ll load in the necessary packages, namely the tidyverse and dspatterns packages (the latter will load the bakeoff package, and that’s where the dataset originates).

Loading the tidyverse and dspatterns packages.

Notes on the Code

L.1 Loading the tidyverse package like this actually auto-loads all core tidyverse packages (this is almost everything we'll need for most analyses!).
L.2 The dspatterns package is this book's namesake package. It has the datasets we need for all of the examples.

library(tidyverse)
library(dspatterns)

Now let’s look at the episodes dataset by printing it out:

episodes

# A tibble: 94 × 10
   series episode bakers_appeared bakers_out bakers_remaining star_bakers
    <dbl>   <dbl>           <int>      <int>            <int>       <int>
 1      1       1              10          2                8           0
 2      1       2               8          2                6           0
 3      1       3               6          1                5           0
 4      1       4               5          1                4           0
 5      1       5               4          1                3           0
 6      1       6               3          0                3           0
 7      2       1              12          1               11           1
 8      2       2              11          1               10           1
 9      2       3              10          2                8           1
10      2       4               8          1                7           2
# ℹ 84 more rows
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
#   winner_name <chr>, eliminated <chr>

It provides a lot of information, but, it doesn’t overload the console with tons of information. Because the tidyverse packages are loaded, the tibble version of the dataset is what’s provided during printing. At any rate we can see some of the data (10 rows), the dimensions of the table (94 rows by 10 columns), and, a useful tip to use print(n = ...) to see more rows. Let’s try that:

print(episodes, n = 20)

# A tibble: 94 × 10
   series episode bakers_appeared bakers_out bakers_remaining star_bakers
    <dbl>   <dbl>           <int>      <int>            <int>       <int>
 1      1       1              10          2                8           0
 2      1       2               8          2                6           0
 3      1       3               6          1                5           0
 4      1       4               5          1                4           0
 5      1       5               4          1                3           0
 6      1       6               3          0                3           0
 7      2       1              12          1               11           1
 8      2       2              11          1               10           1
 9      2       3              10          2                8           1
10      2       4               8          1                7           2
11      2       5               7          2                5           1
12      2       6               5          1                4           1
13      2       7               4          1                3           0
14      2       8               3          0                3           0
15      3       1              12          1               11           1
16      3       2              11          1               10           1
17      3       3              10          1                9           1
18      3       4               9          1                8           1
19      3       5               8          1                7           1
20      3       6               7          0                7           1
# ℹ 74 more rows
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
#   winner_name <chr>, eliminated <chr>

Being able to print an exact number of rows with print() is sometimes useful if you have a generally small table and you need to see more of it.

Warning

When using print() to specify the number of rows displayed, we have to be sure that the table object is a tibble. This won’t work with data frames and using print(mtcars, n = 5)

If you wanted just a few rows, you could use the head() function like this:

head(episodes)

# A tibble: 6 × 10
  series episode bakers_appeared bakers_out bakers_remaining star_bakers
   <dbl>   <dbl>           <int>      <int>            <int>       <int>
1      1       1              10          2                8           0
2      1       2               8          2                6           0
3      1       3               6          1                5           0
4      1       4               5          1                4           0
5      1       5               4          1                3           0
6      1       6               3          0                3           0
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
#   winner_name <chr>, eliminated <chr>

If you wanted smaller, more focused output on what’s in the table then the names() and dim() functions will yield vectors of column names and the dimensions of the table.

names(episodes)

 [1] "series"            "episode"           "bakers_appeared"  
 [4] "bakers_out"        "bakers_remaining"  "star_bakers"      
 [7] "technical_winners" "sb_name"           "winner_name"      
[10] "eliminated"

dim(episodes)

[1] 94 10

Take note that the convention for table dimensions in R is first the number of rows (94) and then the number of variables or columns (10). If you’re using the RStudio IDE, then the special View() function will put a table into a spreadsheet-like view:

View(episodes)

Please note that if you’re using Quarto or R Markdown, having View() in a chunk is generally not a good idea if you’re intending on rendering the document for distribution (since the effect of View() is to provide a secondary ‘view’ of your data). It’s best to use it in only an interactive context.

Sometimes you may want to see just a small portion of your input data. We can use gt_preview() from the gt package to get the first x rows of data and the last y rows of data (these parameters can be set by the top_n and bottom_n arguments of gt_preview()). Let’s try it with the bakers dataset.

gt_preview(bakers)

	series	baker	star_baker	technical_winner	technical_top3	technical_bottom	technical_highest	technical_lowest	technical_median	series_winner	series_runner_up	total_episodes_appeared	first_date_appeared	last_date_appeared	first_date_us	last_date_us	percent_episodes_appeared	percent_technical_top3	baker_full	age	occupation	hometown	baker_last	baker_first
1	1	Annetha	0	0	1	1	2	7	4.5	0	0	2	2010-08-17	2010-08-24	NA	NA	33.33333	50.00000	Annetha Mills	30	Midwife	Essex	Mills	Annetha
2	1	David	0	0	1	3	3	8	4.5	0	0	4	2010-08-17	2010-09-07	NA	NA	66.66667	25.00000	David Chambers	31	Entrepreneur	Milton Keynes	Chambers	David
3	1	Edd	0	2	4	1	1	6	2.0	1	0	6	2010-08-17	2010-09-21	NA	NA	100.00000	66.66667	Edward "Edd" Kimber	24	Debt collector for Yorkshire Bank	Bradford	Kimber	Edward
4	1	Jasminder	0	0	2	2	2	5	3.0	0	0	5	2010-08-17	2010-09-14	NA	NA	83.33333	40.00000	Jasminder Randhawa	45	Assistant Credit Control Manager	Birmingham	Randhawa	Jasminder
5	1	Jonathan	0	1	1	2	1	9	6.0	0	0	3	2010-08-17	2010-08-31	NA	NA	50.00000	33.33333	Jonathan Shepherd	25	Research Analyst	St Albans	Shepherd	Jonathan
6..119
120	10	Steph	0	1	6	4	1	10	3.0	0	0	10	NA	NA	NA	NA	100.00000	60.00000	Steph Blackwell	28	Shop assistant	Chester	Blackwell	Steph

What you get by default is the first five rows and the last row of the bakers dataset. We can see that what’s not shown are rows 6 to 119 (it’s shown as 6..119 in the table stub).

If you wanted to show the first and last 10 rows of the bakers dataset, that’s not a problem. It can be accomplished with the top_n and bottom_n arguments, like this:

gt_preview(bakers, top_n = 10, bottom_n = 10)

	series	baker	star_baker	technical_winner	technical_top3	technical_bottom	technical_highest	technical_lowest	technical_median	series_winner	series_runner_up	total_episodes_appeared	first_date_appeared	last_date_appeared	first_date_us	last_date_us	percent_episodes_appeared	percent_technical_top3	baker_full	age	occupation	hometown	baker_last	baker_first
1	1	Annetha	0	0	1	1	2	7	4.5	0	0	2	2010-08-17	2010-08-24	NA	NA	33.33333	50.00000	Annetha Mills	30	Midwife	Essex	Mills	Annetha
2	1	David	0	0	1	3	3	8	4.5	0	0	4	2010-08-17	2010-09-07	NA	NA	66.66667	25.00000	David Chambers	31	Entrepreneur	Milton Keynes	Chambers	David
3	1	Edd	0	2	4	1	1	6	2.0	1	0	6	2010-08-17	2010-09-21	NA	NA	100.00000	66.66667	Edward "Edd" Kimber	24	Debt collector for Yorkshire Bank	Bradford	Kimber	Edward
4	1	Jasminder	0	0	2	2	2	5	3.0	0	0	5	2010-08-17	2010-09-14	NA	NA	83.33333	40.00000	Jasminder Randhawa	45	Assistant Credit Control Manager	Birmingham	Randhawa	Jasminder
5	1	Jonathan	0	1	1	2	1	9	6.0	0	0	3	2010-08-17	2010-08-31	NA	NA	50.00000	33.33333	Jonathan Shepherd	25	Research Analyst	St Albans	Shepherd	Jonathan
6	1	Lea	0	0	0	1	10	10	10.0	0	0	1	2010-08-17	2010-08-17	NA	NA	16.66667	0.00000	Lea Harris	51	Retired	Midlothian, Scotland	Harris	Lea
7	1	Louise	0	0	0	1	4	4	4.0	0	0	2	2010-08-17	2010-08-24	NA	NA	33.33333	0.00000	Louise Brimelow	44	Police Officer	Manchester	Brimelow	Louise
8	1	Mark	0	0	0	0	NA	NA	NA	0	0	1	2010-08-17	2010-08-17	NA	NA	16.66667	0.00000	Mark Whithers	48	Bus Driver	South Wales	Whithers	Mark
9	1	Miranda	0	2	4	1	1	8	3.0	0	0	6	2010-08-17	2010-09-21	NA	NA	100.00000	66.66667	Miranda Gore Browne	37	Food buyer for Marks & Spencer	Midhurst, West Sussex	Browne	Miranda
10	1	Ruth	0	0	2	2	2	5	3.5	0	0	6	2010-08-17	2010-09-21	NA	NA	100.00000	33.33333	Ruth Clemens	31	Retail manager/Housewife	Poynton, Cheshire	Clemens	Ruth
11..110
111	10	David	0	2	8	2	1	10	2.0	1	0	10	NA	NA	NA	NA	100.00000	80.00000	David Atherton	36	International health adviser	Whitby	Atherton	David
112	10	Helena	0	1	1	4	1	12	9.0	0	0	5	NA	NA	NA	NA	50.00000	20.00000	Helena Garcia	40	Online project manager	Leeds	Garcia	Helena
113	10	Henry	0	2	5	3	1	6	3.0	0	0	8	NA	NA	NA	NA	80.00000	62.50000	Henry Bird	20	Student	Durham	Bird	Henry
114	10	Jamie	0	0	0	2	11	13	12.0	0	0	2	NA	NA	NA	NA	20.00000	0.00000	Jamie Finn	20	Part-time waiter	Surrey	Finn	Jamie
115	10	Michael	0	0	0	7	4	11	6.0	0	0	7	NA	NA	NA	NA	70.00000	0.00000	Michael Chakraverty	26	Theatre manager/fitness instructor	Stratford-upon-Avon	Chakraverty	Michael
116	10	Michelle	0	0	0	5	5	8	6.0	0	0	5	NA	NA	NA	NA	50.00000	0.00000	Michelle Evans-Fecci	35	Print shop administrator	Tenby, Wales	Evans-Fecci	Michelle
117	10	Phil	0	0	1	3	3	10	7.0	0	0	4	NA	NA	NA	NA	40.00000	25.00000	Phil Thorne	56	HGV driver	Rainham	Thorne	Phil
118	10	Priya	0	0	1	5	2	10	7.0	0	0	6	NA	NA	NA	NA	60.00000	16.66667	Priya O'Shea	34	Marketing consultant	Leicester	O'Shea	Priya
119	10	Rosie	0	2	4	5	1	9	4.0	0	0	9	NA	NA	NA	NA	90.00000	44.44444	Rosie Brandreth-Poynter	28	Veterinary surgeon	Somerset	Brandreth-Poynter	Rosie
120	10	Steph	0	1	6	4	1	10	3.0	0	0	10	NA	NA	NA	NA	100.00000	60.00000	Steph Blackwell	28	Shop assistant	Chester	Blackwell	Steph

It’s a relatively simple function that gt_preview(), but it comes in handy if you want a nicer display of the head and tail of a dataset.

1.2 Using `glimpse()` to Go Sideways

While inspecting rows of your raw data isn’t always the best thing it could be useful for quickly understanding how the different variables fit together. The glimpse() function (accessible from the dplyr package) allows you to have a look at the first few records of a dataset. This is somewhat like the head() function seen earlier but turned sideways:

glimpse(episodes)

Rows: 94
Columns: 10
$ series            <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ episode           <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
$ bakers_appeared   <int> 10, 8, 6, 5, 4, 3, 12, 11, 10, 8, 7, 5, 4, 3, 12, 11…
$ bakers_out        <int> 2, 2, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1…
$ bakers_remaining  <int> 8, 6, 5, 4, 3, 3, 11, 10, 8, 7, 5, 4, 3, 3, 11, 10, …
$ star_bakers       <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1…
$ technical_winners <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sb_name           <chr> NA, NA, NA, NA, NA, NA, "Holly", "Jason", "Yasmin", …
$ winner_name       <chr> NA, NA, NA, NA, NA, "Edd", NA, NA, NA, NA, NA, NA, N…
$ eliminated        <chr> "Lea, Mark", "Annetha, Louise", "Jonathan", "David",…

Unlike the tibble view (with head() or not), you get to see all of the columns in the data table. The interesting thing about glimpse() is that it invisibly returns the data that’s given to it. Because of that, you can have safely have one or several glimpse() calls in a data transformation pipeline and each of those will print the state of the data at different junctures.

episodes |>
  glimpse() |>
  select(series, episode, winner_name) |>
  filter(!is.na(winner_name)) |>
  glimpse()

Rows: 94
Columns: 10
$ series            <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ episode           <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
$ bakers_appeared   <int> 10, 8, 6, 5, 4, 3, 12, 11, 10, 8, 7, 5, 4, 3, 12, 11…
$ bakers_out        <int> 2, 2, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1…
$ bakers_remaining  <int> 8, 6, 5, 4, 3, 3, 11, 10, 8, 7, 5, 4, 3, 3, 11, 10, …
$ star_bakers       <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1…
$ technical_winners <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sb_name           <chr> NA, NA, NA, NA, NA, NA, "Holly", "Jason", "Yasmin", …
$ winner_name       <chr> NA, NA, NA, NA, NA, "Edd", NA, NA, NA, NA, NA, NA, N…
$ eliminated        <chr> "Lea, Mark", "Annetha, Louise", "Jonathan", "David",…
Rows: 10
Columns: 3
$ series      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ episode     <dbl> 6, 8, 10, 10, 10, 10, 10, 10, 10, 10
$ winner_name <chr> "Edd", "Joanne", "John", "Frances", "Nancy", "Nadiya", "Ca…

As can be seen above, the original dataset was printed with glimpse() and it was also passed to select() and filter() statements just before a final glimpse call (to see the transformed data). The output is two glimpse() outputs stacked atop each other.

1.3 Getting Data Summaries

Something else that’s very useful during the exploration phase of data work is the summary() function. It’ll break down each column of data into their own summaries.

summary(episodes)

     series          episode       bakers_appeared    bakers_out    
 Min.   : 1.000   Min.   : 1.000   Min.   : 3.000   Min.   :0.0000  
 1st Qu.: 3.250   1st Qu.: 3.000   1st Qu.: 5.000   1st Qu.:1.0000  
 Median : 6.000   Median : 5.000   Median : 7.000   Median :1.0000  
 Mean   : 5.766   Mean   : 5.287   Mean   : 7.553   Mean   :0.9468  
 3rd Qu.: 8.000   3rd Qu.: 8.000   3rd Qu.:10.000   3rd Qu.:1.0000  
 Max.   :10.000   Max.   :10.000   Max.   :13.000   Max.   :2.0000  
 bakers_remaining  star_bakers     technical_winners   sb_name         
 Min.   : 3.000   Min.   :0.0000   Min.   :0.0000    Length:94         
 1st Qu.: 4.000   1st Qu.:1.0000   1st Qu.:1.0000    Class :character  
 Median : 6.500   Median :1.0000   Median :1.0000    Mode  :character  
 Mean   : 6.606   Mean   :0.8404   Mean   :0.9894                      
 3rd Qu.: 9.000   3rd Qu.:1.0000   3rd Qu.:1.0000                      
 Max.   :12.000   Max.   :2.0000   Max.   :1.0000                      
 winner_name         eliminated       
 Length:94          Length:94         
 Class :character   Class :character  
 Mode  :character   Mode  :character

For columns that are numeric, the summary() function automatically calculates the following summary statistics for each column of the table:

Min: The minimum value
1st Qu: The first quartile value (25th percentile)
Median: The median value
3rd Qu: The third quartile value (75th percentile)
Max: The maximum value

There are a few character columns in the episodes dataset (e.g., sb_name, etc.) and summary() doesn’t do much with those other than state that they are indeed of the character class. If there were to be any NA values, summary() would report how many on a column-by-column basis.

For a more comprehensive look at a dataset, the skim() function from the skimr package offers a report that is broken down by variable type. Using skim() with the episodes dataset from bakeoff will give us an overall data summary, information for the character variables (sb_name, winner_name, and eliminated) such as n_missing, complete_rate, and, summary statistics for the numeric variables.

skim(episodes)

── Data Summary ────────────────────────
                           Values  
Name                       episodes
Number of rows             94      
Number of columns          10      
_______________________            
Column type frequency:             
  character                3       
  numeric                  7       
________________________           
Group variables            None    

── Variable type: character ───────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 sb_name              16         0.830   3  12     0       47          0
2 winner_name          84         0.106   3   7     0       10          0
3 eliminated           13         0.862   3  16     0       76          0

── Variable type: numeric ─────────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate  mean    sd p0  p25 p50 p75 p100 hist 
1 series                    0             1 5.77  2.77   1 3.25 6     8   10 ▆▇▇▇▇
2 episode                   0             1 5.29  2.83   1 3    5     8   10 ▇▇▇▇▆
3 bakers_appeared           0             1 7.55  2.97   3 5    7    10   13 ▇▅▅▅▃
4 bakers_out                0             1 0.947 0.472  0 1    1     1    2 ▂▁▇▁▁
5 bakers_remaining          0             1 6.61  2.80   3 4    6.5   9   12 ▇▅▅▅▃
6 star_bakers               0             1 0.840 0.396  0 1    1     1    2 ▂▁▇▁▁
7 technical_winners         0             1 0.989 0.103  0 1    1     1    1 ▁▁▁▁▇

The numeric variable types really get the deluxe treatment here with a statistical summary consisting of the mean, the standard deviation (sd), percentile values, and little histograms! It doesn’t take very long at all to get such a summary so it’s worth it every time for new and unfamiliar datasets.

1.4 Rolling Our Own Tabular Data

Creating your own tabular data can be really useful for sharing (especially when you need to create a particular one for debugging something) and for having a table of manageable size for learning purposes. To that end, we’ll learn how to make our own tibbles from scratch. Although we customarily get our data from other sources (e.g., CSV files, database tables, Excel files, etc.), there are a few good reasons for wanting to handcraft our own tibble objects:

To have simple tables for experimentation with functions that operate on tabular data
To reduce the need to use Excel or some other data entry systems (for small enough data)
To create small tables that interface with larger tables (e.g., joining, combining, etc.)
To gain a better understanding how tibbles work under the hood

We can create tibbles in a few different ways but let’s focus on tibble construction using either of two functions available in the dplyr package: tibble() and the similarly-named tribble().

1.4.1 Creating Tibbles with the `tibble()` Function

Let’s have a look at a few examples of tibble-making first with tibble(), which takes in named vectors as arguments. In the following example, we use two equal-length vectors (called a and b).

Using tibble() with equal-length vectors to make a tibble.

Notes on the Code

L.2 This will become column a.
L.3 This is to be column b.

tibble( 
  a = c(3, 5, 2, 6),
  b = c("a", "b", "g", "b")
)

# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     5 b    
3     2 g    
4     6 b

As can be seen, the type of each column is based on the type of the vector. The order of columns in the output table is based on the order of the names provided inside tibble().

Let’s make another tibble in a similar manner, but with a single value for a (the value 3 will be repeated down its column).

Using tibble() with two vectors: one of length 1 and the other of length 4.

Notes on the Code

L.2 Only one value for a! That's okay, it will be repeated.
L.3 This will become column b, a column of character-based values.

tibble(
  a = 3,
  b = c("a", "b", "g", "b")
)

# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     3 b    
3     3 g    
4     3 b

In the printed tibble the value 3 in column a is indeed repeated down.

The key is to provide either n-length (n here signifies the total number of rows in the table) or some combination of n-length and length-1 vectors. The length-1 value will be repeated down. Any vector with a length between 1 and n will result in an error.

We can also pass in NA (missing) values by including NAs in the appropriate vector. In the next example, we incorporate NA values in the two n-length vectors.

Using tibble() with two vectors that contain NA values.

Notes on the Code

L.2 We intentionally placed an NA value among other values in column a.
L.3 There is also an NA value in the b column.

tibble(
  a = c(3, 5, 2, NA),
  b = c("a", NA, "g", "b")
  )

# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     5 <NA> 
3     2 g    
4    NA b

The resulting tibble here shows that those NA values in the numeric and character input vectors appear in the output tibble in their expected locations.

In the next code listing, an NA value is used in a vector of length 1. What will happen? Will the NA values be repeated down in the column? Let’s have a look.

Using a single-length vector with an NA value in tibble().

Notes on the Code

L.2 Using a single NA (and nothing else) gives us a certain type of NA: a logical NA (yes, there are different types).

tibble(
  a = NA,
  b = c("a", "b", "g", "b")
)

# A tibble: 4 × 2
  a     b    
  <lgl> <chr>
1 NA    a    
2 NA    b    
3 NA    g    
4 NA    b

Yes. The NA is repeated down the a column. We can see that column a’s type is <lgl>, or, logical.

Using just NA in a column does result in repeated NAs, however, the column is classified as a logical column (which is meant for TRUE or FALSE values, likely not was intended). If we want this column to be a character column, we should use a specific type of NA: NA_character_. (There are other missing value constants for other types: NA_real_, NA_integer_, and NA_complex_.) Let’s replace a = NA with a = NA_character_:

Using a single-length vector with an NA_character_ value in tibble().

Notes on the Code

L.2 We are now being specific about the type of NAs we want (the character version).

tibble(
  a = NA_character_,
  b = c("a", "b", "g", "b")
)

# A tibble: 4 × 2
  a     b    
  <chr> <chr>
1 <NA>  a    
2 <NA>  b    
3 <NA>  g    
4 <NA>  b

And we get a column type of <chr> for a, which is what we wanted.

1.4.2 Creating Tibbles a Different Way with the `tribble()` Function

We can use the tribble() function as an alternative constructor for tibble objects. This next example with tribble() reproduces a tibble generated in a previous code listing:

Creating a tibble using the tribble() function.

Notes on the Code

L.1 As tribble() is very close in spelling to tibble(), be a little careful here.
L.2 The column names are prepended by the tilde character, and we don't use quotes.
L.6 The last (hanging) comma here is fine to keep. It won't result in an error.

tribble(
  ~a, ~b,
  3,  "a",
  5,  "b",
  2,  "g",
  6,  "b",
)

# A tibble: 4 × 2
      a b    
  <dbl> <chr>
1     3 a    
2     5 b    
3     2 g    
4     6 b

The resulting tibble appears just as we laid out the values. As can be seen in the code listing, the table values aren’t provided as vectors but instead are laid out as column names and values in manner that approximates the structure of a table. Importantly the column names are preceded by a tilde (~) character, and, commas separate all values. This way of building a simple tibble can be useful when having values side-by-side is important for minimizing the potential for error.

1.5 Summary

There are many datasets kicking around in R packages; after loading the package (and discovering the datasets) you simply use the dataset name to print it
You can use a number of functions to look at a dataset that’s a table: print(), head(), dplyr’s glimpse(), View() in RStudio, gt’s gt_preview(), and skimr’s skim()
Get the column names of a table with names() or colnames(); get the dimensions with dim()
The base R function summary() gives you a nice summary of a table and it’s always there for you
You can easily make your own tibbles with either tibble() or tribble()

1.1 Several Quick Ways to Initially Explore a Dataset

1.2 Using glimpse() to Go Sideways

1.3 Getting Data Summaries

1.4 Rolling Our Own Tabular Data

1.4.1 Creating Tibbles with the tibble() Function

1.4.2 Creating Tibbles a Different Way with the tribble() Function

1.5 Summary

1.2 Using `glimpse()` to Go Sideways

1.4.1 Creating Tibbles with the `tibble()` Function

1.4.2 Creating Tibbles a Different Way with the `tribble()` Function