Explore: The first steps with a new pattern (e.g., get info you need, take stock, smell checks).
Understand: Learning anything can involve asking questions (what kinds? How to actually translate to code? Interpreting outputs personally, may lead to more exploration).
Explain: How do we take the answers and rationalize with the investigations? That’s covered in this pattern. How to document to oneself, programmatic text for resilient reporting. This is really about talking to yourself, in a way.
Share: Communication and collaboration. We’ll show you ways you can make everything presentable and create outputs in a format that allows for easy consumption and feedback. An example: reordering plots or tables to make them easier to interpret.
Wide data… why do you want it? It is awesome for summary tables. Display tables just present and look better when they have a bit of width to them. Long data, that’s great for plotting, no question, but we won’t do any of that in this chapter. This chapter is dedicated to summary tables and all of their charms.
Here’s the collection of packages needed for this chapter:
tidyverse: for the use of the dplyr and tidyr packages
gt: for its datasets and table-making capabilities
bakeoff: for the ratings dataset
palmerpenguins: for the penguins dataset
janitor: for cleaning column names
11.1 Explore
Let’s get a little practice in make wide tables from long tables. The bakeoff package’s ratings dataset is really interesting as it contains a wealth of information concerning viewership for every episode across the ten series. Let’s have a look at that dataset once again by printing the tibble.
ratings
# A tibble: 94 × 11
series episode uk_airdate viewers_7day viewers_28day network_rank
<dbl> <dbl> <date> <dbl> <dbl> <dbl>
1 1 1 2010-08-17 2.24 7 NA
2 1 2 2010-08-24 3 3 NA
3 1 3 2010-08-31 3 2 NA
4 1 4 2010-09-07 2.6 4 NA
5 1 5 2010-09-14 3.03 1 NA
6 1 6 2010-09-21 2.75 1 NA
7 2 1 2011-08-16 3.1 2 NA
8 2 2 2011-08-23 3.53 2 NA
9 2 3 2011-08-30 3.82 1 NA
10 2 4 2011-09-06 3.6 1 NA
# ℹ 84 more rows
# ℹ 5 more variables: channels_rank <dbl>, bbc_iplayer_requests <dbl>,
# episode_count <dbl>, us_season <dbl>, us_airdate <chr>
Let’s make it so that we get the following wide-table structure:
Each row represents a series (from 1 to 10)
The 7-day viewership data (in millions of viewers) will appear in columns for each episode in a season
all other data (like airdate, rankings, etc.) will be omitted
We start by keeping only the data columns we need. In this case, that’s the series, episode, and viewers_7day columns (and we do this with dplyr’s select() function). Then, it’s a matter of using pivot_wider() with the names_from and values_from arguments.
It might be weird to have column names that are numbers. If you think it is strange and rather have more text-y column names, then we can use the names_prefix argument to change all that. In this alternate bit of code, we do pretty much the same operations as before, except that names_prefix = "episode_" is employed to prefix each new column name with episode_.
The orginal dataset has viewership numbers represented in millions. So a value like 3.85 really means 3,850,000. If you wanted to expand these values to real number values you could mutate all of the columns with dplyr, where each column is multiplied by 1e6 (1 million). Or, you could do it in the pivot_wider() statement! With the values_fn argument, you could pass in a function that takes x (any input value) and returns a variation of that. We’ll use function(x) x * 1e6 for values_fn and what we’ll get is a wide table with all of the viewership numbers in the millions (i.e., big numbers).
If we wanted to incorporate the air date along with the viewership figures, two columns could be specified in the values_from argument. We can no longer use the values_fn with a supplied function because we’re getting both numeric and Date columns this time around (and the function would try to multiply dates by one million). The new-column naming will mash together a values_from column name (either viewers_7day or uk_airdate) with the values from the episode column. We can have it so that some text gets inserted in between these text fragments, and we’ll do that with the names_sep argument.
This makes for a pretty wide table (21 columns!) but only 10 rows. We basically got what we asked for. Because we can’t use values_fn to multiply the viewers_7day_* columns by 1e6 the alternative is to use dplyr’s mutate() with the across() helper function (to target the right columns for the multiplying function). Though that last incantation is somewhat difficult to remember how to use, we did get it done here.
11.2 Understand
You’ll sometimes run into the case where you have a lot of columns in your dataset that you want to widen. But many of those columns will be extraneous, in that you don’t want them in the pivoted table. We solved for this in our first example with pivot_wider() by using an initial select() statement. Here it is again:
However, we could take advantage of id_cols and state which of the columns makes for an ‘ID’ for a row. Put another way, the combined information from the columns will uniquely identify each column. Anything not in id_cols will be dropped in the pivoted table. Here’s a variation of the earlier example that uses id_cols in pivot_wider() and drops the select() statement.
It gives us the same output table. Which is great to see. Importantly, any columns provided to id_cols must not be given to either of the names_from or values_from arguments. The above example used only one column in id_cols but conceivably you could have a few columns that, only in combination, concretely and uniquely describe the row (a good example is a first name, a last name, and an address).
A very pro move to use with pivot_wider() is to combine it with group_by()/summarize(). The summarizing part could happen before pivoting and here’s an example where the summarizing of the penguins dataset happens right before the pivoting operation.
Here’s a slightly different pivoting-after-summarizing example. Where we transpose the columns provided to the id_cols and names_from arguments. This gives us the three different years as rows and the different islands as columns.
With some practice, using pivot_wider() in combination with a summary of a dataset will become second nature. It’s a very powerful pattern, as summarizing is undeniably useful (especially for making summary tables meant for publication) and pivoting that to a wide form of your choosing makes the summary more readable.
11.3 Explain
With the previous examples demonstrating a wide range of uses for pivot_wider(), we’ve really gotten to grips with how to effectively use that great function. There is a pattern, though, that may be encountered less commonly: converting a wide dataset to a different wide dataset. Put another way, the variables that represent the bulk of the table columns should be replaced with names from another column. While this sort of scenario is difficult to articulate, we can explain the problem and work through the solution with the examples that will follow.
The illness dataset in the gt package starts out as a wide type of dataset. Put another way it is not tidy and we can see that right away given there are multiple columns for different observations (e.g., day_3, day_4, etc.).
In the above form, we have the different tests performed on a patient as a different row. Now, the tests were performed every day (from day 3 to day 8) and the result of each type of test on each day is written in the cell that intersects these two different things. It’s just tabulated in this way. But what if you wanted the different days of measurement to be enumerated in a column, having the different tests occupy a different column? There’s a way to do that and it involves two steps: (1) pivoting to long data with pivot_longer(), and (2) pivoting to a wide form again with pivot_wider().
Let’s do this one step at a time. Starting, as stated earlier, with a pivot_longer() statement. To simplify things, we’ll use only the test and the day_* columns.
Now this is some tidy data. The columns we have now are: test, day, and value. And it’s certainly a lot longer than the original table, with 273 rows (compared to the 39 in the original illness dataset). We had to specially handle the day_* column names. To turn them into values in the new day column, the names_prefix = "day_" option needed to be specified.
Because the data is tidy and long we can now make it wide, but in a different way. The pivot_wider() statement is relatively simple (just needing a column each for names_from and values_from). To clean up the new column names, we’ll employ the clean_names() function from the janitor package.
This is not just a wide table, it’s super wide with 40 columns. Totally fine if using this data for further analysis but maybe not what you want if you’re making a summary table for publication from the widened data. There are two things you can do to limit the amount of columns in the this wide output: (1) use the select() function to pick the columns to keep, or (2) filter() the test column in the data just before (or even just after) the pivot_longer() statement.
11.4 Share
Let’s take all of these nicely-pivoted tables and transform them into shareable gt tables. Tables of this type are simply meant to get the gt treatment, and we could take advantage of numerous interesting features in that package to make the tables shine.
The ratings_by_series table shows the viewership for each episode. And the data is organized left-to-right by episode number. Each row represents a different series. After introducing the data to the gt() function (and putting the series column in the table stub with rowname_col = "series"), we can do things like:
add a table title (with tab_header())
including a source note (with tab_source_note())
format the viewership numbers (with fmt_number())
replacing NA values with an empty string (through sub_missing())
putting a spanner label over the episode numbers (with tab_spanner())
inserting a label in the ‘stubhead’ cell (with tab_stubhead())
color-coding the data cells (via data_color())
ratings_by_series|>gt(rowname_col ="series")|>tab_header( title ="Viewership of Bakeoff Episodes", subtitle ="Numbers represent millions of viewers")|>tab_source_note( source_note ="Data taken from the bakeoff package.")|>fmt_number(columns =-series, decimals =1)|>sub_missing(missing_text ="")|>tab_spanner( label ="Episode", columns =everything())|>tab_stubhead(label ="Series")|>data_color( columns =everything(), palette =c("white", "red"), na_color ="white")
Viewership of Bakeoff Episodes
Numbers represent millions of viewers
Series
Episode
1
2
3
4
5
6
7
8
9
10
1
2.2
3.0
3.0
2.6
3.0
2.8
2
3.1
3.5
3.8
3.6
3.8
4.2
4.4
5.1
3
3.9
4.6
4.5
4.7
4.6
4.8
5.1
5.3
5.7
6.7
4
6.6
6.6
7.2
6.8
7.0
7.3
7.8
7.4
7.4
9.4
5
8.5
8.8
9.3
10.2
9.9
10.1
10.3
9.0
10.7
13.5
6
11.6
11.6
12.0
12.4
12.4
12.0
12.3
11.1
12.7
15.1
7
13.6
13.4
13.0
13.3
13.1
13.1
13.4
13.3
13.4
15.9
8
9.5
9.2
8.7
8.6
8.6
8.6
9.0
9.0
9.0
10.0
9
9.6
9.3
8.9
8.9
8.7
8.9
9.2
9.7
9.5
10.3
10
9.6
9.4
8.9
9.0
9.3
8.7
9.0
9.2
9.3
10.1
Data taken from the bakeoff package.
The finalized table looks good enough to share with others! The title lets people quickly know what’s in the table. The colorizing of data makes it apparent that Series 7 did super excellent numbers (with Series 6 running a close second in terms of viewership). Little touches like carefully formatting numbers and handling NA values make all the different when presenting data to others!
The ratings_by_series_num_expand table is a variation of ratings_by_series, where numbers for viewership are expanded to full numbers (rather than being abbreviated numbers, if you will). That table could be made in a similar way, and a good formatting option for that one would be to use this fmt_number() statement:
With this slightly different data and replacement of the fmt_number() statement with the one above 10.2 becomes 10.2M and the subtitle text given in the tab_header() statement could be removed. It’s sometimes good to have alternatives when presenting data and both tables would be acceptable in most contexts.
Let’s now plot the penguins-based tables. We have two, penguins_average_mass_by_year and penguins_average_mass_by_island, and they are strikingly similar. Here we’ll see that we can make gt tables that look decent even when the input tables are super small.
penguins_average_mass_by_year|>gt()|>tab_header(title ="Average Mass of Penguins on Different Islands")|>tab_spanner(label ="Year", columns =-island)|>fmt_number(decimals =1, pattern ="{x} g")|>cols_width(everything()~px(125))|>cols_align(align ="left", columns =island)|>cols_label(island ="Island")|>opt_align_table_header(align ="left")
Average Mass of Penguins on Different Islands
Island
Year
2007
2008
2009
Biscoe
4,740.9 g
4,628.1 g
4,792.8 g
Dream
3,684.2 g
3,779.4 g
3,691.5 g
Torgersen
3,763.2 g
3,856.2 g
3,489.1 g
penguins_average_mass_by_island|>gt()|>tab_header(title ="Average Mass of Penguins in Different Years")|>tab_spanner(label ="Island", columns =-year)|>fmt_number(columns =-year, decimals =1, pattern ="{x} g")|>cols_width(everything()~px(125))|>cols_align(align ="left", columns =year)|>cols_label(year ="Year")|>opt_align_table_header(align ="left")
Average Mass of Penguins in Different Years
Year
Island
Biscoe
Dream
Torgersen
2007
4,740.9 g
3,684.2 g
3,763.2 g
2008
4,628.1 g
3,779.4 g
3,856.2 g
2009
4,792.8 g
3,691.5 g
3,489.1 g
A good thing to do with very small tables is to widen them just a bit with cols_width(). This gives the few columns we have a little space to breathe. In both of these gt tables, we used the same statements to effectively make similar tables. We had to handle some things a little differently (like not accidentally formatting the year column with decimal places) but they are more similar than they are different.
With the illness dataset treatment performed earlier, we used clean_names(). That’s nice, but the names are actually better not cleaned when being tabulated. We want labels to look like labels! We can take this all a bit further and take advantage of gt’s ability to render units in column labels. To do this, the test name will be pasted to the units info in a way that includes a line break (with "<br>"), and the ‘units’ part will be enclosed in "{{" and "}}" (needed by gt to interpret this text as ‘units notation’). The act of actually interpreting this text in that way is handled by the cols_label() statement (the .process_units = TRUE part sets this in motion). Here is but one way that re-widened illness data can be made into a shareable gt table:
illness|>dplyr::mutate(test =paste0(test, ",<br>{{", units, "}}"))|>dplyr::slice_head(n =8)|>dplyr::select(-c(starts_with("norm"), units))|>tidyr::pivot_longer( cols =starts_with("day"), names_to ="day", names_prefix ="day_", values_to ="value")|>tidyr::pivot_wider( names_from =test, values_from =value)|>gt(rowname_col ="day")|>tab_header(title ="Six tests performed daily on patient")|>tab_source_note(source_note ="Data taken from the `illness` dataset.")|>tab_stubhead(label ="Day")|>cols_label( .fn =md, .process_units =TRUE)|>cols_width(stub()~px(50),everything()~px(120))|>opt_align_table_header(align ="left")
Six tests performed daily on patient
Day
Viral load, copies per mL
WBC, ×109/L
Neutrophils, ×109/L
RBC, ×1012/L
Hb, g/L
PLT, ×109/L
ALT, U/L
AST, U/L
3
12000
5.26
4.87
5.72
153
67.0
12835.0
23672.0
4
4200
4.26
4.72
5.98
135
38.6
12632.0
21368.0
5
1600
9.92
7.92
4.23
126
27.4
6426.7
14730.0
6
830
10.49
18.21
4.83
115
26.2
4263.1
8691.0
7
760
24.77
22.08
4.12
75
74.1
1623.7
2189.0
8
520
30.26
27.17
2.68
87
36.2
672.6
1145.0
9
250
19.03
16.59
3.32
95
25.6
512.4
782.5
Data taken from the `illness` dataset.
This presents very nicely! Having the units incorporated into the column labels is beneficial to the user (and you don’t have to explain what the numbers mean elsewhere in the table). The table could be further embellished with colorized data cells or explanations of the tests (in the form of footnotes, via tab_footnote()).