10 Making Datasets Longer

Our chapter pattern

Explore: The first steps with a new pattern (e.g., get info you need, take stock, smell checks).

Understand: Learning anything can involve asking questions (what kinds? How to actually translate to code? Interpreting outputs personally, may lead to more exploration).

Explain: How do we take the answers and rationalize with the investigations? That’s covered in this pattern. How to document to oneself, programmatic text for resilient reporting. This is really about talking to yourself, in a way.

Share: Communication and collaboration. We’ll show you ways you can make everything presentable and create outputs in a format that allows for easy consumption and feedback. An example: reordering plots or tables to make them easier to interpret.

Tidy (and long) data is needed for ggplot-ing. Long (and tidy) data is needed to make those ggplots. If your data happens to be wide, or not long enough, you can try to use gplot but you might end up a little sad. It’s not easy to use non-tidy data with gplot; it might be OK for some plots but it will seriously make it difficult for many other types of plots. In this chapter, we’ll get the wide-to-long table reshaping pattern down. In this way, you won’t end up disappointed that you can’t make cool plots for your reporting work. You’ll instead know to use pivot_longer() like a boss and look damn good whilst doing it!

As always, we need packages. Here are the ones we’ll be using for this chapter:

tidyverse: to use the dplyr and tidyr packages
bakeoff: for the spice_test_wide dataset
gt: for various wide datasets we will make longer

10.1 Explore

The bakeoff package has so many great datasets. One that’s plum perfect for this chapter is the spice_test_wide table. It even has ‘wide’ in the name so you know it’s topical. Let’s have a look at it by printing it out:

spice_test_wide

# A tibble: 4 × 7
  baker  guess_1  guess_2                  guess_3 correct_1 correct_2 correct_3
  <chr>  <chr>    <chr>                    <chr>       <dbl>     <dbl>     <dbl>
1 Emma   Cinnamon Cloves                   Nutmeg          1         0         1
2 Harry  Cinnamon Cardamom                 Nutmeg          1         1         1
3 Ruby   Cinnamon Cumin                    Nutmeg          1         0         1
4 Zainab Cardamom Couldn't remember the n… Cinnam…         0         0         0

This a small one! Only 4 rows and 7 columns, which is perfect for our exploration into data lengthening. The tell-tale sign of tabular data being too wide is repetition of column names (e.g., guess_1, guess_2, etc.). Want we want instead is a row for each observation. Here is a proposal for new columns (and what they would contain) that would probably make this table tidier:

baker: the baker name (same as the original, wide table)
guess_number: an index value for the guess (possible values: 1, 2, or 3)
guess_response: the response given by the baker when prompted
guess_correct: was the response the correct answer? (TRUE or FALSE)

The challenge here (and this is common in this pattern) is that some data is encoded in the column names. The numbers in the column names refer to the guess_number we want. Because of this, we have to use pattern matching with pivot_longer() to smartly break apart the column names into components.

Thankfully, the argument called names_pattern is available and ready for this type of operation. The pattern we need to use is "(.+)_(.+)". This matches some non-zero sequence of characters to the left of the "_", and a sequence of character to the right of the underscore character. The parentheses represent capture groups, where the one to the left is first and the one to the right is the second (it’s all very logical). This works in conjunction with the names_to argument. Here, we provide a two-element vector where the first element is the ".value" keyword and the second is "number". The first element indicates to tidyr that the text fragment before the "_" is the column name; this produces the number and guess columns. It’s best to see this in action:

spice_test_long <-
  spice_test_wide |>
  pivot_longer(
    cols = -baker,
    names_to = c(".value", "number"),
    names_pattern = "(.+)_(.+)"
  )

spice_test_long

# A tibble: 12 × 4
   baker  number guess                      correct
   <chr>  <chr>  <chr>                        <dbl>
 1 Emma   1      Cinnamon                         1
 2 Emma   2      Cloves                           0
 3 Emma   3      Nutmeg                           1
 4 Harry  1      Cinnamon                         1
 5 Harry  2      Cardamom                         1
 6 Harry  3      Nutmeg                           1
 7 Ruby   1      Cinnamon                         1
 8 Ruby   2      Cumin                            0
 9 Ruby   3      Nutmeg                           1
10 Zainab 1      Cardamom                         0
11 Zainab 2      Couldn't remember the name       0
12 Zainab 3      Cinnamon                         0

That pivot_longer() statement got us really close to where we need to be! We’d rather have the column names be a bit different though (we planned to have guess_number, guess_response, and guess_correct). We can easily use dplyr’s rename() to take care of all that. Two more things:

The correct column (renaming that to guess_correct) should have logical values
The number column (renaming that to guess_number) should have integer values

Those last two touch-ups can be handled with dplyr’s mutate() function. Here’s what’s needed to finish off this long table.

spice_test_long <-
  spice_test_long |>
  rename(
    guess_number = number,
    guess_response = guess,
    guess_correct = correct
  ) |>
  mutate(
    guess_correct = as.logical(guess_correct),
    guess_number = as.integer(guess_number)
  )

spice_test_long

# A tibble: 12 × 4
   baker  guess_number guess_response             guess_correct
   <chr>         <int> <chr>                      <lgl>        
 1 Emma              1 Cinnamon                   TRUE         
 2 Emma              2 Cloves                     FALSE        
 3 Emma              3 Nutmeg                     TRUE         
 4 Harry             1 Cinnamon                   TRUE         
 5 Harry             2 Cardamom                   TRUE         
 6 Harry             3 Nutmeg                     TRUE         
 7 Ruby              1 Cinnamon                   TRUE         
 8 Ruby              2 Cumin                      FALSE        
 9 Ruby              3 Nutmeg                     TRUE         
10 Zainab            1 Cardamom                   FALSE        
11 Zainab            2 Couldn't remember the name FALSE        
12 Zainab            3 Cinnamon                   FALSE

Now, this was a particularly challenging exercise in transforming a wide dataset to a long, tidy form. But this type of table might be encountered every so often (it’s somewhat common within public datasets) so the lessons learned here could prove to be valuable.

10.2 Understand

The gt package has quite a few datasets that are considered to be wide. This is because that package is concerned with generating summary tables, and, the tables often end up a bit on the wide side before tabulation (this is natural). Let’s take advantage of the abundance of wide tables in that package and use the towny dataset for our next for examples. With that dataset, we’ll be able to better understand pivot_longer() and the process of lengthening datasets in general.

Let’s look at towny, by printing it out:

towny

# A tibble: 414 × 25
   name      website status csd_type census_div latitude longitude land_area_km2
   <chr>     <chr>   <chr>  <chr>    <chr>         <dbl>     <dbl>         <dbl>
 1 Addingto… https:… lower… township Lennox an…     45       -77.2        1294. 
 2 Adelaide… https:… lower… township Middlesex      43.0     -81.7         331. 
 3 Adjala-T… https:… lower… township Simcoe         44.1     -79.9         372. 
 4 Admaston… https:… lower… township Renfrew        45.5     -76.9         520. 
 5 Ajax      https:… lower… town     Durham         43.9     -79.0          66.6
 6 Alberton  https:… singl… township Rainy Riv…     48.6     -93.5         117. 
 7 Alfred a… https:… lower… township Prescott …     45.6     -74.9         392. 
 8 Algonqui… https:… lower… township Haliburton     45.4     -78.8        1000. 
 9 Alnwick/… https:… lower… township Northumbe…     44.1     -78.0         398. 
10 Amaranth  https:… lower… township Dufferin       44.0     -80.2         265. 
# ℹ 404 more rows
# ℹ 17 more variables: population_1996 <int>, population_2001 <int>,
#   population_2006 <int>, population_2011 <int>, population_2016 <int>,
#   population_2021 <int>, density_1996 <dbl>, density_2001 <dbl>,
#   density_2006 <dbl>, density_2011 <dbl>, density_2016 <dbl>,
#   density_2021 <dbl>, pop_change_1996_2001_pct <dbl>,
#   pop_change_2001_2006_pct <dbl>, pop_change_2006_2011_pct <dbl>, …

Alright that tibble is packed full of information, and it’s totally fine to simplify the table so our use of pivot_longer() runs more smoothly. Let’s do that, here’s the vision for the long table:

Take only the name, census_div, and population_* columns
Incorporate a year column, a population column, and density column

How do we do this? One way is to break the problem down and make two long tables, then join them together. It’s not a bad strategy so let’s go with it. Let’s start by making a long table that focuses on populations for the different measurement years.

towny_pops <-
  towny |>
  select(name, census_div, starts_with("population")) |>
  pivot_longer(
    cols = starts_with("population"),
    names_to = "year",
    names_prefix = "population_",
    values_to = "population"
  ) |>
  mutate(year = as.integer(year))

towny_pops

# A tibble: 2,484 × 4
   name                census_div            year population
   <chr>               <chr>                <int>      <int>
 1 Addington Highlands Lennox and Addington  1996       2429
 2 Addington Highlands Lennox and Addington  2001       2402
 3 Addington Highlands Lennox and Addington  2006       2512
 4 Addington Highlands Lennox and Addington  2011       2517
 5 Addington Highlands Lennox and Addington  2016       2318
 6 Addington Highlands Lennox and Addington  2021       2534
 7 Adelaide Metcalfe   Middlesex             1996       3128
 8 Adelaide Metcalfe   Middlesex             2001       3149
 9 Adelaide Metcalfe   Middlesex             2006       3135
10 Adelaide Metcalfe   Middlesex             2011       3028
# ℹ 2,474 more rows

Nice! That worked out quite well. Just looking at the first ten rows shows that the pivot_longer() expression used was correct. In the above, we used the names_prefix argument (giving it the value of "population_"). This snips out the "population_" from each value that goes in the year column. And because the values in year came from column names, a mutate() statement, with as.integer(), was used to convert the character-based years to integer values.

Let’s now make an analogous, long table for the density table. It’ll be pretty much the same except the the last column (using density instead of population). Not surprisingly, the statements to create the towny_density table are strikingly similar the ones just above. It’s also reassuring to see that the row count for this table is the same as in the last one.

towny_density <-
  towny |>
  select(name, census_div, starts_with("density")) |>
  pivot_longer(
    cols = starts_with("density"),
    names_to = "year",
    names_prefix = "density_",
    values_to = "density"
  ) |>
  mutate(year = as.integer(year))

towny_density

# A tibble: 2,484 × 4
   name                census_div            year density
   <chr>               <chr>                <int>   <dbl>
 1 Addington Highlands Lennox and Addington  1996    1.88
 2 Addington Highlands Lennox and Addington  2001    1.86
 3 Addington Highlands Lennox and Addington  2006    1.94
 4 Addington Highlands Lennox and Addington  2011    1.95
 5 Addington Highlands Lennox and Addington  2016    1.79
 6 Addington Highlands Lennox and Addington  2021    1.96
 7 Adelaide Metcalfe   Middlesex             1996    9.45
 8 Adelaide Metcalfe   Middlesex             2001    9.51
 9 Adelaide Metcalfe   Middlesex             2006    9.47
10 Adelaide Metcalfe   Middlesex             2011    9.14
# ℹ 2,474 more rows

Now, it’s joining time! We have a fine selection of joins to choose from. This time, however, we’ll use the full join through use of the dplyr full_join() function. The by argument here is crucial to get correct. We can supply a vector of column names that are common to both tables. This will make full_join() perform a natural join. The columns that are common to both tables are "name", "census_div", and "year" so those’ll be used in the by argument.

towny_pops_density <-
  towny_pops |>
  full_join(towny_density, by = c("name", "census_div", "year"))

towny_pops_density

# A tibble: 2,484 × 5
   name                census_div            year population density
   <chr>               <chr>                <int>      <int>   <dbl>
 1 Addington Highlands Lennox and Addington  1996       2429    1.88
 2 Addington Highlands Lennox and Addington  2001       2402    1.86
 3 Addington Highlands Lennox and Addington  2006       2512    1.94
 4 Addington Highlands Lennox and Addington  2011       2517    1.95
 5 Addington Highlands Lennox and Addington  2016       2318    1.79
 6 Addington Highlands Lennox and Addington  2021       2534    1.96
 7 Adelaide Metcalfe   Middlesex             1996       3128    9.45
 8 Adelaide Metcalfe   Middlesex             2001       3149    9.51
 9 Adelaide Metcalfe   Middlesex             2006       3135    9.47
10 Adelaide Metcalfe   Middlesex             2011       3028    9.14
# ℹ 2,474 more rows

The result is a table with the same number of rows (2,484), as the previous two. We have all the columns we wanted to have too. This is what success looks like.

If we wanted to, though, we could make the table even longer! Instead of having population and density columns we could instead have measure and value columns. The measure column would contain either "population" or "density" and value would have either a population value or a density value (corresponding to the label within measure).

To do this, the first step is to keep only the needed columns in towny. We’ll use select() to keep the name, census_div, population_*, and density_* columns. Then, we’ll pivot_longer() and that’s it! Here’s the necessary code and resultant table:

towny |>
  select(
    name, census_div,
    starts_with("population"),
    starts_with("density")
  ) |>
  pivot_longer(
    cols = c(starts_with("population"), starts_with("density")),
    names_to = "measure",
    names_pattern = "(.+)_",
    values_to = "value"
  )

# A tibble: 4,968 × 4
   name                census_div           measure      value
   <chr>               <chr>                <chr>        <dbl>
 1 Addington Highlands Lennox and Addington population 2429   
 2 Addington Highlands Lennox and Addington population 2402   
 3 Addington Highlands Lennox and Addington population 2512   
 4 Addington Highlands Lennox and Addington population 2517   
 5 Addington Highlands Lennox and Addington population 2318   
 6 Addington Highlands Lennox and Addington population 2534   
 7 Addington Highlands Lennox and Addington density       1.88
 8 Addington Highlands Lennox and Addington density       1.86
 9 Addington Highlands Lennox and Addington density       1.94
10 Addington Highlands Lennox and Addington density       1.95
# ℹ 4,958 more rows

This is twice as many rows as the previous long table (4,968 rows!). As we did before with the lengthening of the spice_test_wide dataset, we used the name_pattern argument (pattern used: "(.+)_").

10.3 Explain

Now that we have explored small and wide data tables (changing them to long versions) and experimented with a fairly large (yet wide) table, let’s work with a wide dataset with measurement data. This one is called illness and it comes from the gt package. Let’s look at it:

illness

# A tibble: 39 × 11
   test     units  day_3  day_4  day_5  day_6  day_7  day_8  day_9 norm_l norm_u
   <chr>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Viral l… copi… 1.20e4 4.2 e3 1.6 e3 8.3 e2 7.6 e2 5.2 e2 250        NA   NA  
 2 WBC      x10^… 5.26e0 4.26e0 9.92e0 1.05e1 2.48e1 3.03e1  19.0       4   10  
 3 Neutrop… x10^… 4.87e0 4.72e0 7.92e0 1.82e1 2.21e1 2.72e1  16.6       2    8  
 4 RBC      x10^… 5.72e0 5.98e0 4.23e0 4.83e0 4.12e0 2.68e0   3.32      4    5.5
 5 Hb       g / L 1.53e2 1.35e2 1.26e2 1.15e2 7.5 e1 8.7 e1  95       120  160  
 6 PLT      x10^… 6.7 e1 3.86e1 2.74e1 2.62e1 7.41e1 3.62e1  25.6     100  300  
 7 ALT      U/L   1.28e4 1.26e4 6.43e3 4.26e3 1.62e3 6.73e2 512.        9   50  
 8 AST      U/L   2.37e4 2.14e4 1.47e4 8.69e3 2.19e3 1.14e3 782.       15   40  
 9 TBIL     umol… 1.17e2 1.44e2 1.37e2 1.58e2 1.27e2 1.05e2 163.        0   18.8
10 DBIL     umol… 7.14e1 1.05e2 9.46e1 1.44e2 1.18e2 8.36e1 126.        0    6.8
# ℹ 29 more rows

Let’s waste no time and make this long! Code:

illness_long <-
  illness |>
  select(-starts_with("norm")) |>
  pivot_longer(
    cols = starts_with("day"),
    names_to = "day",
    names_pattern = "_(.+)",
    values_to = "value"
  ) |>
  unite(col = "test", test, units, sep = ", ") |>
  mutate(day = as.integer(day))

illness_long

# A tibble: 273 × 3
   test                        day    value
   <chr>                     <int>    <dbl>
 1 Viral load, copies per mL     3 12000   
 2 Viral load, copies per mL     4  4200   
 3 Viral load, copies per mL     5  1600   
 4 Viral load, copies per mL     6   830   
 5 Viral load, copies per mL     7   760   
 6 Viral load, copies per mL     8   520   
 7 Viral load, copies per mL     9   250   
 8 WBC, x10^9 / L                3     5.26
 9 WBC, x10^9 / L                4     4.26
10 WBC, x10^9 / L                5     9.92
# ℹ 263 more rows

We adjusted the value for the names_pattern argument to "_(.+)" so that the number part of the "day_*" column names goes into the new column day. The tidyr package has a useful function called unite(). We’re using that above to join together the test and units columns together. In that way, we get the name of the test and the units for that test in the same column (called that test again).

10.4 Share

Now that we have long data in the object illness_long, let’s make a ggplot plot. We’ll cut down the amount of test results to be plotted first (with filter()). Then, facet_wrap() will be used to facet the plots by the test.

illness_long |>
  filter(grepl("PLT|ALT|AST|TBIL|DBIL|PT|APTT|PTA|DD", test)) |>
  ggplot(mapping = aes(x = day, y = value)) +
  geom_line() +
  facet_wrap(
    facets = vars(test),
    scales = "free_y",
    ncol = 3
  )

This doesn’t look too bad, so long as the number of plots included is reasonably limited. There are situations where many more facets could be used but, in this case, we have lengthy labels and different y-axis scales (two things that take up more space).

While the plot above can certainly be improved, the thing we really want to be aware of is that the illness_long table (being long and tidy) was prepared for this type of plot. The only extra data manipulation operation involved was using filter() to limit the number of plots shown. Attempting to make a similar plot with the illness dataset without the reshaping is difficult with ggplot, so this pattern is essential is you do want to plot data that comes to you in a wide form.