1 Introduction to Data Frames and Tibbles
This MVP covers
- Printing a dataset obtained from a package with the object name and with
print(),head(), and dplyr’sglimpse() - Using the
names()anddim()functions to get the column names and the dimensions of a table - Looking at a full dataset with
View()and the head and tail with gt’sgt_preview()function
- Using
summary()to get a basic summary of a table, and, going further with skimr’sskim()more comprehensive summary - Creating our own tables/tibbles with
tibble()andtribble()
When you’re working with data, quite often it’ll be structured data in the form of tables. Tabular data is usually a joy to work with: there’s regularity in its shape (rectangular!), the rules for working with tables are widely known, and we can also take proactive steps to shape the data in a way that’s even more convenient for data analysis.
Tables come in all shapes and sizes. Some are very small, filling maybe a screen or two with data. Some are in between, requiring querying and summarizing to really make sense of the data. Finally, some are huge and those require more efficient tools for transforming, querying, and summarizing the data. Here, we’ll start small and learn how to explore tables of the small variety. In R, tabular data is often stored in data structures known as data frames. It’s part of base R and you don’t need any extra packages to use data frames.
We also have another variety of tables which are called tibbles. A tibble is a special implementation of a data frame that prints its data differently and behaves in a more consistent manner than data frames. This flavor of data frame is used quite a lot throughout this book and we’ll make the case that you should also make them yours.
In this chapter we’ll take a look at a number of great ways to work with tables (whether they are data frames or tibbles). It’s really hard to read all of the data in a large table so we can make use of a plethora of functions that provide different views of a table of interest. We will explore a handful of different packages that expressly deal with data frame exploration and summarization for better understanding.
1.1 Several Quick Ways to Initially Explore a Dataset
Let’s look at some data from episodes dataset in the dspatterns package. It serves as a highly quantitative episode guide. Before we start, we’ll load in the necessary packages, namely the tidyverse and dspatterns packages (the latter will load the bakeoff package, and that’s where the dataset originates).
Loading the tidyverse and dspatterns packages.
Notes on the Code
L.1 Loading the tidyverse package like this actually auto-loads all core tidyverse packages (this is almost everything we'll need for most analyses!).L.2 The dspatterns package is this book's namesake package. It has the datasets we need for all of the examples.
Now let’s look at the episodes dataset by printing it out:
episodes# A tibble: 94 × 10
series episode bakers_appeared bakers_out bakers_remaining star_bakers
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 10 2 8 0
2 1 2 8 2 6 0
3 1 3 6 1 5 0
4 1 4 5 1 4 0
5 1 5 4 1 3 0
6 1 6 3 0 3 0
7 2 1 12 1 11 1
8 2 2 11 1 10 1
9 2 3 10 2 8 1
10 2 4 8 1 7 2
# ℹ 84 more rows
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
# winner_name <chr>, eliminated <chr>It provides a lot of information, but, it doesn’t overload the console with tons of information. Because the tidyverse packages are loaded, the tibble version of the dataset is what’s provided during printing. At any rate we can see some of the data (10 rows), the dimensions of the table (94 rows by 10 columns), and, a useful tip to use print(n = ...) to see more rows. Let’s try that:
print(episodes, n = 20)# A tibble: 94 × 10
series episode bakers_appeared bakers_out bakers_remaining star_bakers
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 10 2 8 0
2 1 2 8 2 6 0
3 1 3 6 1 5 0
4 1 4 5 1 4 0
5 1 5 4 1 3 0
6 1 6 3 0 3 0
7 2 1 12 1 11 1
8 2 2 11 1 10 1
9 2 3 10 2 8 1
10 2 4 8 1 7 2
11 2 5 7 2 5 1
12 2 6 5 1 4 1
13 2 7 4 1 3 0
14 2 8 3 0 3 0
15 3 1 12 1 11 1
16 3 2 11 1 10 1
17 3 3 10 1 9 1
18 3 4 9 1 8 1
19 3 5 8 1 7 1
20 3 6 7 0 7 1
# ℹ 74 more rows
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
# winner_name <chr>, eliminated <chr>Being able to print an exact number of rows with print() is sometimes useful if you have a generally small table and you need to see more of it.
When using print() to specify the number of rows displayed, we have to be sure that the table object is a tibble. This won’t work with data frames and using print(mtcars, n = 5)
If you wanted just a few rows, you could use the head() function like this:
head(episodes)# A tibble: 6 × 10
series episode bakers_appeared bakers_out bakers_remaining star_bakers
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 10 2 8 0
2 1 2 8 2 6 0
3 1 3 6 1 5 0
4 1 4 5 1 4 0
5 1 5 4 1 3 0
6 1 6 3 0 3 0
# ℹ 4 more variables: technical_winners <int>, sb_name <chr>,
# winner_name <chr>, eliminated <chr>If you wanted smaller, more focused output on what’s in the table then the names() and dim() functions will yield vectors of column names and the dimensions of the table.
names(episodes) [1] "series" "episode" "bakers_appeared"
[4] "bakers_out" "bakers_remaining" "star_bakers"
[7] "technical_winners" "sb_name" "winner_name"
[10] "eliminated" dim(episodes)[1] 94 10Take note that the convention for table dimensions in R is first the number of rows (94) and then the number of variables or columns (10). If you’re using the RStudio IDE, then the special View() function will put a table into a spreadsheet-like view:
View(episodes)Please note that if you’re using Quarto or R Markdown, having View() in a chunk is generally not a good idea if you’re intending on rendering the document for distribution (since the effect of View() is to provide a secondary ‘view’ of your data). It’s best to use it in only an interactive context.
Sometimes you may want to see just a small portion of your input data. We can use gt_preview() from the gt package to get the first x rows of data and the last y rows of data (these parameters can be set by the top_n and bottom_n arguments of gt_preview()). Let’s try it with the bakers dataset.
gt_preview(bakers)| series | baker | star_baker | technical_winner | technical_top3 | technical_bottom | technical_highest | technical_lowest | technical_median | series_winner | series_runner_up | total_episodes_appeared | first_date_appeared | last_date_appeared | first_date_us | last_date_us | percent_episodes_appeared | percent_technical_top3 | baker_full | age | occupation | hometown | baker_last | baker_first | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Annetha | 0 | 0 | 1 | 1 | 2 | 7 | 4.5 | 0 | 0 | 2 | 2010-08-17 | 2010-08-24 | NA | NA | 33.33333 | 50.00000 | Annetha Mills | 30 | Midwife | Essex | Mills | Annetha |
| 2 | 1 | David | 0 | 0 | 1 | 3 | 3 | 8 | 4.5 | 0 | 0 | 4 | 2010-08-17 | 2010-09-07 | NA | NA | 66.66667 | 25.00000 | David Chambers | 31 | Entrepreneur | Milton Keynes | Chambers | David |
| 3 | 1 | Edd | 0 | 2 | 4 | 1 | 1 | 6 | 2.0 | 1 | 0 | 6 | 2010-08-17 | 2010-09-21 | NA | NA | 100.00000 | 66.66667 | Edward "Edd" Kimber | 24 | Debt collector for Yorkshire Bank | Bradford | Kimber | Edward |
| 4 | 1 | Jasminder | 0 | 0 | 2 | 2 | 2 | 5 | 3.0 | 0 | 0 | 5 | 2010-08-17 | 2010-09-14 | NA | NA | 83.33333 | 40.00000 | Jasminder Randhawa | 45 | Assistant Credit Control Manager | Birmingham | Randhawa | Jasminder |
| 5 | 1 | Jonathan | 0 | 1 | 1 | 2 | 1 | 9 | 6.0 | 0 | 0 | 3 | 2010-08-17 | 2010-08-31 | NA | NA | 50.00000 | 33.33333 | Jonathan Shepherd | 25 | Research Analyst | St Albans | Shepherd | Jonathan |
| 6..119 | ||||||||||||||||||||||||
| 120 | 10 | Steph | 0 | 1 | 6 | 4 | 1 | 10 | 3.0 | 0 | 0 | 10 | NA | NA | NA | NA | 100.00000 | 60.00000 | Steph Blackwell | 28 | Shop assistant | Chester | Blackwell | Steph |
What you get by default is the first five rows and the last row of the bakers dataset. We can see that what’s not shown are rows 6 to 119 (it’s shown as 6..119 in the table stub).
If you wanted to show the first and last 10 rows of the bakers dataset, that’s not a problem. It can be accomplished with the top_n and bottom_n arguments, like this:
gt_preview(bakers, top_n = 10, bottom_n = 10)| series | baker | star_baker | technical_winner | technical_top3 | technical_bottom | technical_highest | technical_lowest | technical_median | series_winner | series_runner_up | total_episodes_appeared | first_date_appeared | last_date_appeared | first_date_us | last_date_us | percent_episodes_appeared | percent_technical_top3 | baker_full | age | occupation | hometown | baker_last | baker_first | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | Annetha | 0 | 0 | 1 | 1 | 2 | 7 | 4.5 | 0 | 0 | 2 | 2010-08-17 | 2010-08-24 | NA | NA | 33.33333 | 50.00000 | Annetha Mills | 30 | Midwife | Essex | Mills | Annetha |
| 2 | 1 | David | 0 | 0 | 1 | 3 | 3 | 8 | 4.5 | 0 | 0 | 4 | 2010-08-17 | 2010-09-07 | NA | NA | 66.66667 | 25.00000 | David Chambers | 31 | Entrepreneur | Milton Keynes | Chambers | David |
| 3 | 1 | Edd | 0 | 2 | 4 | 1 | 1 | 6 | 2.0 | 1 | 0 | 6 | 2010-08-17 | 2010-09-21 | NA | NA | 100.00000 | 66.66667 | Edward "Edd" Kimber | 24 | Debt collector for Yorkshire Bank | Bradford | Kimber | Edward |
| 4 | 1 | Jasminder | 0 | 0 | 2 | 2 | 2 | 5 | 3.0 | 0 | 0 | 5 | 2010-08-17 | 2010-09-14 | NA | NA | 83.33333 | 40.00000 | Jasminder Randhawa | 45 | Assistant Credit Control Manager | Birmingham | Randhawa | Jasminder |
| 5 | 1 | Jonathan | 0 | 1 | 1 | 2 | 1 | 9 | 6.0 | 0 | 0 | 3 | 2010-08-17 | 2010-08-31 | NA | NA | 50.00000 | 33.33333 | Jonathan Shepherd | 25 | Research Analyst | St Albans | Shepherd | Jonathan |
| 6 | 1 | Lea | 0 | 0 | 0 | 1 | 10 | 10 | 10.0 | 0 | 0 | 1 | 2010-08-17 | 2010-08-17 | NA | NA | 16.66667 | 0.00000 | Lea Harris | 51 | Retired | Midlothian, Scotland | Harris | Lea |
| 7 | 1 | Louise | 0 | 0 | 0 | 1 | 4 | 4 | 4.0 | 0 | 0 | 2 | 2010-08-17 | 2010-08-24 | NA | NA | 33.33333 | 0.00000 | Louise Brimelow | 44 | Police Officer | Manchester | Brimelow | Louise |
| 8 | 1 | Mark | 0 | 0 | 0 | 0 | NA | NA | NA | 0 | 0 | 1 | 2010-08-17 | 2010-08-17 | NA | NA | 16.66667 | 0.00000 | Mark Whithers | 48 | Bus Driver | South Wales | Whithers | Mark |
| 9 | 1 | Miranda | 0 | 2 | 4 | 1 | 1 | 8 | 3.0 | 0 | 0 | 6 | 2010-08-17 | 2010-09-21 | NA | NA | 100.00000 | 66.66667 | Miranda Gore Browne | 37 | Food buyer for Marks & Spencer | Midhurst, West Sussex | Browne | Miranda |
| 10 | 1 | Ruth | 0 | 0 | 2 | 2 | 2 | 5 | 3.5 | 0 | 0 | 6 | 2010-08-17 | 2010-09-21 | NA | NA | 100.00000 | 33.33333 | Ruth Clemens | 31 | Retail manager/Housewife | Poynton, Cheshire | Clemens | Ruth |
| 11..110 | ||||||||||||||||||||||||
| 111 | 10 | David | 0 | 2 | 8 | 2 | 1 | 10 | 2.0 | 1 | 0 | 10 | NA | NA | NA | NA | 100.00000 | 80.00000 | David Atherton | 36 | International health adviser | Whitby | Atherton | David |
| 112 | 10 | Helena | 0 | 1 | 1 | 4 | 1 | 12 | 9.0 | 0 | 0 | 5 | NA | NA | NA | NA | 50.00000 | 20.00000 | Helena Garcia | 40 | Online project manager | Leeds | Garcia | Helena |
| 113 | 10 | Henry | 0 | 2 | 5 | 3 | 1 | 6 | 3.0 | 0 | 0 | 8 | NA | NA | NA | NA | 80.00000 | 62.50000 | Henry Bird | 20 | Student | Durham | Bird | Henry |
| 114 | 10 | Jamie | 0 | 0 | 0 | 2 | 11 | 13 | 12.0 | 0 | 0 | 2 | NA | NA | NA | NA | 20.00000 | 0.00000 | Jamie Finn | 20 | Part-time waiter | Surrey | Finn | Jamie |
| 115 | 10 | Michael | 0 | 0 | 0 | 7 | 4 | 11 | 6.0 | 0 | 0 | 7 | NA | NA | NA | NA | 70.00000 | 0.00000 | Michael Chakraverty | 26 | Theatre manager/fitness instructor | Stratford-upon-Avon | Chakraverty | Michael |
| 116 | 10 | Michelle | 0 | 0 | 0 | 5 | 5 | 8 | 6.0 | 0 | 0 | 5 | NA | NA | NA | NA | 50.00000 | 0.00000 | Michelle Evans-Fecci | 35 | Print shop administrator | Tenby, Wales | Evans-Fecci | Michelle |
| 117 | 10 | Phil | 0 | 0 | 1 | 3 | 3 | 10 | 7.0 | 0 | 0 | 4 | NA | NA | NA | NA | 40.00000 | 25.00000 | Phil Thorne | 56 | HGV driver | Rainham | Thorne | Phil |
| 118 | 10 | Priya | 0 | 0 | 1 | 5 | 2 | 10 | 7.0 | 0 | 0 | 6 | NA | NA | NA | NA | 60.00000 | 16.66667 | Priya O'Shea | 34 | Marketing consultant | Leicester | O'Shea | Priya |
| 119 | 10 | Rosie | 0 | 2 | 4 | 5 | 1 | 9 | 4.0 | 0 | 0 | 9 | NA | NA | NA | NA | 90.00000 | 44.44444 | Rosie Brandreth-Poynter | 28 | Veterinary surgeon | Somerset | Brandreth-Poynter | Rosie |
| 120 | 10 | Steph | 0 | 1 | 6 | 4 | 1 | 10 | 3.0 | 0 | 0 | 10 | NA | NA | NA | NA | 100.00000 | 60.00000 | Steph Blackwell | 28 | Shop assistant | Chester | Blackwell | Steph |
It’s a relatively simple function that gt_preview(), but it comes in handy if you want a nicer display of the head and tail of a dataset.
1.2 Using glimpse() to Go Sideways
While inspecting rows of your raw data isn’t always the best thing it could be useful for quickly understanding how the different variables fit together. The glimpse() function (accessible from the dplyr package) allows you to have a look at the first few records of a dataset. This is somewhat like the head() function seen earlier but turned sideways:
glimpse(episodes)Rows: 94
Columns: 10
$ series <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ episode <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
$ bakers_appeared <int> 10, 8, 6, 5, 4, 3, 12, 11, 10, 8, 7, 5, 4, 3, 12, 11…
$ bakers_out <int> 2, 2, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1…
$ bakers_remaining <int> 8, 6, 5, 4, 3, 3, 11, 10, 8, 7, 5, 4, 3, 3, 11, 10, …
$ star_bakers <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1…
$ technical_winners <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sb_name <chr> NA, NA, NA, NA, NA, NA, "Holly", "Jason", "Yasmin", …
$ winner_name <chr> NA, NA, NA, NA, NA, "Edd", NA, NA, NA, NA, NA, NA, N…
$ eliminated <chr> "Lea, Mark", "Annetha, Louise", "Jonathan", "David",…Unlike the tibble view (with head() or not), you get to see all of the columns in the data table. The interesting thing about glimpse() is that it invisibly returns the data that’s given to it. Because of that, you can have safely have one or several glimpse() calls in a data transformation pipeline and each of those will print the state of the data at different junctures.
episodes |>
glimpse() |>
select(series, episode, winner_name) |>
filter(!is.na(winner_name)) |>
glimpse()Rows: 94
Columns: 10
$ series <dbl> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3…
$ episode <dbl> 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4…
$ bakers_appeared <int> 10, 8, 6, 5, 4, 3, 12, 11, 10, 8, 7, 5, 4, 3, 12, 11…
$ bakers_out <int> 2, 2, 1, 1, 1, 0, 1, 1, 2, 1, 2, 1, 1, 0, 1, 1, 1, 1…
$ bakers_remaining <int> 8, 6, 5, 4, 3, 3, 11, 10, 8, 7, 5, 4, 3, 3, 11, 10, …
$ star_bakers <int> 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 1…
$ technical_winners <int> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ sb_name <chr> NA, NA, NA, NA, NA, NA, "Holly", "Jason", "Yasmin", …
$ winner_name <chr> NA, NA, NA, NA, NA, "Edd", NA, NA, NA, NA, NA, NA, N…
$ eliminated <chr> "Lea, Mark", "Annetha, Louise", "Jonathan", "David",…
Rows: 10
Columns: 3
$ series <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
$ episode <dbl> 6, 8, 10, 10, 10, 10, 10, 10, 10, 10
$ winner_name <chr> "Edd", "Joanne", "John", "Frances", "Nancy", "Nadiya", "Ca…As can be seen above, the original dataset was printed with glimpse() and it was also passed to select() and filter() statements just before a final glimpse call (to see the transformed data). The output is two glimpse() outputs stacked atop each other.
1.3 Getting Data Summaries
Something else that’s very useful during the exploration phase of data work is the summary() function. It’ll break down each column of data into their own summaries.
summary(episodes) series episode bakers_appeared bakers_out
Min. : 1.000 Min. : 1.000 Min. : 3.000 Min. :0.0000
1st Qu.: 3.250 1st Qu.: 3.000 1st Qu.: 5.000 1st Qu.:1.0000
Median : 6.000 Median : 5.000 Median : 7.000 Median :1.0000
Mean : 5.766 Mean : 5.287 Mean : 7.553 Mean :0.9468
3rd Qu.: 8.000 3rd Qu.: 8.000 3rd Qu.:10.000 3rd Qu.:1.0000
Max. :10.000 Max. :10.000 Max. :13.000 Max. :2.0000
bakers_remaining star_bakers technical_winners sb_name
Min. : 3.000 Min. :0.0000 Min. :0.0000 Length:94
1st Qu.: 4.000 1st Qu.:1.0000 1st Qu.:1.0000 Class :character
Median : 6.500 Median :1.0000 Median :1.0000 Mode :character
Mean : 6.606 Mean :0.8404 Mean :0.9894
3rd Qu.: 9.000 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :12.000 Max. :2.0000 Max. :1.0000
winner_name eliminated
Length:94 Length:94
Class :character Class :character
Mode :character Mode :character
For columns that are numeric, the summary() function automatically calculates the following summary statistics for each column of the table:
-
Min: The minimum value -
1st Qu: The first quartile value (25th percentile) -
Median: The median value -
3rd Qu: The third quartile value (75th percentile) -
Max: The maximum value
There are a few character columns in the episodes dataset (e.g., sb_name, etc.) and summary() doesn’t do much with those other than state that they are indeed of the character class. If there were to be any NA values, summary() would report how many on a column-by-column basis.
For a more comprehensive look at a dataset, the skim() function from the skimr package offers a report that is broken down by variable type. Using skim() with the episodes dataset from bakeoff will give us an overall data summary, information for the character variables (sb_name, winner_name, and eliminated) such as n_missing, complete_rate, and, summary statistics for the numeric variables.
skim(episodes)── Data Summary ────────────────────────
Values
Name episodes
Number of rows 94
Number of columns 10
_______________________
Column type frequency:
character 3
numeric 7
________________________
Group variables None
── Variable type: character ───────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 sb_name 16 0.830 3 12 0 47 0
2 winner_name 84 0.106 3 7 0 10 0
3 eliminated 13 0.862 3 16 0 76 0
── Variable type: numeric ─────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 series 0 1 5.77 2.77 1 3.25 6 8 10 ▆▇▇▇▇
2 episode 0 1 5.29 2.83 1 3 5 8 10 ▇▇▇▇▆
3 bakers_appeared 0 1 7.55 2.97 3 5 7 10 13 ▇▅▅▅▃
4 bakers_out 0 1 0.947 0.472 0 1 1 1 2 ▂▁▇▁▁
5 bakers_remaining 0 1 6.61 2.80 3 4 6.5 9 12 ▇▅▅▅▃
6 star_bakers 0 1 0.840 0.396 0 1 1 1 2 ▂▁▇▁▁
7 technical_winners 0 1 0.989 0.103 0 1 1 1 1 ▁▁▁▁▇
The numeric variable types really get the deluxe treatment here with a statistical summary consisting of the mean, the standard deviation (sd), percentile values, and little histograms! It doesn’t take very long at all to get such a summary so it’s worth it every time for new and unfamiliar datasets.
1.4 Rolling Our Own Tabular Data
Creating your own tabular data can be really useful for sharing (especially when you need to create a particular one for debugging something) and for having a table of manageable size for learning purposes. To that end, we’ll learn how to make our own tibbles from scratch. Although we customarily get our data from other sources (e.g., CSV files, database tables, Excel files, etc.), there are a few good reasons for wanting to handcraft our own tibble objects:
- To have simple tables for experimentation with functions that operate on tabular data
- To reduce the need to use Excel or some other data entry systems (for small enough data)
- To create small tables that interface with larger tables (e.g., joining, combining, etc.)
- To gain a better understanding how tibbles work under the hood
We can create tibbles in a few different ways but let’s focus on tibble construction using either of two functions available in the dplyr package: tibble() and the similarly-named tribble().
1.4.1 Creating Tibbles with the tibble() Function
Let’s have a look at a few examples of tibble-making first with tibble(), which takes in named vectors as arguments. In the following example, we use two equal-length vectors (called a and b).
Using tibble() with equal-length vectors to make a tibble.
Notes on the Code
L.2 This will become columna.L.3 This is to be column
b. # A tibble: 4 × 2
a b
<dbl> <chr>
1 3 a
2 5 b
3 2 g
4 6 b As can be seen, the type of each column is based on the type of the vector. The order of columns in the output table is based on the order of the names provided inside tibble().
Let’s make another tibble in a similar manner, but with a single value for a (the value 3 will be repeated down its column).
Using tibble() with two vectors: one of length 1 and the other of length 4.
Notes on the Code
L.2 Only one value fora! That's okay, it will be repeated.L.3 This will become column
b, a column of character-based values. # A tibble: 4 × 2
a b
<dbl> <chr>
1 3 a
2 3 b
3 3 g
4 3 b In the printed tibble the value 3 in column a is indeed repeated down.
The key is to provide either n-length (n here signifies the total number of rows in the table) or some combination of n-length and length-1 vectors. The length-1 value will be repeated down. Any vector with a length between 1 and n will result in an error.
We can also pass in NA (missing) values by including NAs in the appropriate vector. In the next example, we incorporate NA values in the two n-length vectors.
Using tibble() with two vectors that contain NA values.
Notes on the Code
L.2 We intentionally placed anNA value among other values in column a.L.3 There is also an
NA value in the b column. # A tibble: 4 × 2
a b
<dbl> <chr>
1 3 a
2 5 <NA>
3 2 g
4 NA b The resulting tibble here shows that those NA values in the numeric and character input vectors appear in the output tibble in their expected locations.
In the next code listing, an NA value is used in a vector of length 1. What will happen? Will the NA values be repeated down in the column? Let’s have a look.
Using a single-length vector with an NA value in tibble().
Notes on the Code
L.2 Using a singleNA (and nothing else) gives us a certain type of NA: a logical NA (yes, there are different types). # A tibble: 4 × 2
a b
<lgl> <chr>
1 NA a
2 NA b
3 NA g
4 NA b Yes. The NA is repeated down the a column. We can see that column a’s type is <lgl>, or, logical.
Using just NA in a column does result in repeated NAs, however, the column is classified as a logical column (which is meant for TRUE or FALSE values, likely not was intended). If we want this column to be a character column, we should use a specific type of NA: NA_character_. (There are other missing value constants for other types: NA_real_, NA_integer_, and NA_complex_.) Let’s replace a = NA with a = NA_character_:
Using a single-length vector with an NA_character_ value in tibble().
Notes on the Code
L.2 We are now being specific about the type ofNAs we want (the character version). # A tibble: 4 × 2
a b
<chr> <chr>
1 <NA> a
2 <NA> b
3 <NA> g
4 <NA> b And we get a column type of <chr> for a, which is what we wanted.
1.4.2 Creating Tibbles a Different Way with the tribble() Function
We can use the tribble() function as an alternative constructor for tibble objects. This next example with tribble() reproduces a tibble generated in a previous code listing:
Creating a tibble using the tribble() function.
Notes on the Code
L.1 Astribble() is very close in spelling to tibble(), be a little careful here.L.2 The column names are prepended by the tilde character, and we don't use quotes.
L.6 The last (hanging) comma here is fine to keep. It won't result in an error.
tribble(
~a, ~b,
3, "a",
5, "b",
2, "g",
6, "b",
)# A tibble: 4 × 2
a b
<dbl> <chr>
1 3 a
2 5 b
3 2 g
4 6 b The resulting tibble appears just as we laid out the values. As can be seen in the code listing, the table values aren’t provided as vectors but instead are laid out as column names and values in manner that approximates the structure of a table. Importantly the column names are preceded by a tilde (~) character, and, commas separate all values. This way of building a simple tibble can be useful when having values side-by-side is important for minimizing the potential for error.
1.5 Summary
- There are many datasets kicking around in R packages; after loading the package (and discovering the datasets) you simply use the dataset name to print it
- You can use a number of functions to look at a dataset that’s a table:
print(),head(), dplyr’sglimpse(),View()in RStudio, gt’sgt_preview(), and skimr’sskim() - Get the column names of a table with
names()orcolnames(); get the dimensions withdim() - The base R function
summary()gives you a nice summary of a table and it’s always there for you - You can easily make your own tibbles with either
tibble()ortribble()