13 Batch Creation of Tables

Our chapter pattern

Explore: The first steps with a new pattern (e.g., get info you need, take stock, smell checks).

Understand: Learning anything can involve asking questions (what kinds? How to actually translate to code? Interpreting outputs personally, may lead to more exploration).

Explain: How do we take the answers and rationalize with the investigations? That’s covered in this pattern. How to document to oneself, programmatic text for resilient reporting. This is really about talking to yourself, in a way.

Share: Communication and collaboration. We’ll show you ways you can make everything presentable and create outputs in a format that allows for easy consumption and feedback. An example: reordering plots or tables to make them easier to interpret.

In this chapter we explore the pattern of taking changing data and dependably making great looking tables: in batch mode. We’ll first explore a dataset and try out some data manipulation that gets us close to what we need. That working code will be developed into a reusable function and some testing will be done to give us confidence it won’t fail. Finally, we’ll take a look at basic some ways that function could be stored and used for generating a graphic that could be used in a report or presentation. This sort of thing is often part of a larger data sharing workflow. Getting this pattern down can make data sharing in an organization go much more smoothly, giving you the confidence to scale up the approach to multiple streams of data coming in, and generating multiple tabular data products to different audiences.

We’ll need a few packages loaded in for this chapter:

tidyverse: for various functions across a collection of packages (e.g., dplyr, tidyr, etc.)
intendo: for the dataset we’ll be using (all_revenue)
pointblank: for slicing up tibbles and performing data checks
gt: for making wonderful-looking tables
glue: for putting strings together elegantly

13.1 Explore

We’re going to get some data from the intendo package. This data package contains synthetic datasets that deal with activity and revenue from an online game. The dataset we want for our examples is called all_revenue and that’s accessed by using the all_revenue() function.

all_revenue <- all_revenue()

It’s good to have a glimpse of the data, if only to see what’s there before doing anything substantial with the dataset. Let’s use glimpse() from the dplyr package:

dplyr::glimpse(all_revenue)

Rows: 195,917
Columns: 11
$ player_id        <chr> "YMKOHGVFZWLJ836", "SBPFOHCVMNQI568", "SBPFOHCVMNQI56…
$ session_id       <chr> "YMKOHGVFZWLJ836-lgf9n3qe", "SBPFOHCVMNQI568-xsjpfhz2…
$ session_start    <dttm> 2015-01-01 01:45:33, 2015-01-01 10:13:16, 2015-01-01…
$ time             <dttm> 2015-01-01 01:56:51, 2015-01-01 10:17:10, 2015-01-01…
$ item_type        <chr> "iap", "ad", "ad", "iap", "ad", "ad", "ad", "ad", "ad…
$ item_name        <chr> "gold4", "ad_5sec", "ad_5sec", "offer3", "ad_5sec", "…
$ item_revenue     <dbl> 13.993, 0.060, 0.048, 13.491, 0.042, 0.030, 0.260, 0.…
$ session_duration <dbl> 12.8, 34.9, 34.9, 34.9, 34.9, 34.9, 31.5, 31.5, 31.5,…
$ start_day        <date> 2015-01-01, 2015-01-01, 2015-01-01, 2015-01-01, 2015…
$ acquisition      <chr> "organic", "apple", "apple", "apple", "apple", "apple…
$ country          <chr> "United States", "Japan", "Japan", "Japan", "Japan", …

This dataset has revenue data encompassing an entire year (2015) but, for the purpose of an initial look at the data, we only want a slice of the data. We’re going to get a slice that corresponds to an arbitrary range of days.

Let’s use pointblank’s tt_time_slice() function to get slices of that all_revenue table. We’ll first define the initial date and the ending date for the range of data, and pass those date-time strings to separate invocations of tt_time_slice(). The end_m1_date represents the final full day of data, needed later for a friendly labeling of the date range.

initial_date <- "2015-01-10"
ending_date  <- "2015-01-15"
end_m1_date  <- "2015-01-14"

all_rev_jan_segment <- 
  all_revenue |>
  tt_time_slice(
    time_column = "session_start",
    slice_point = ending_date,
    keep = "left"
  ) |>
  tt_time_slice(
    time_column = "session_start",
    slice_point = initial_date,
    keep = "right"
  )
  
all_rev_jan_segment

# A tibble: 417 × 11
   player_id       session_id  session_start       time                item_type
   <chr>           <chr>       <dttm>              <dttm>              <chr>    
 1 LSNXTPAWEDBY179 LSNXTPAWED… 2015-01-10 00:46:00 2015-01-10 01:15:48 ad       
 2 LSNXTPAWEDBY179 LSNXTPAWED… 2015-01-10 00:46:00 2015-01-10 01:26:24 ad       
 3 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 03:58:10 iap      
 4 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:14:34 iap      
 5 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:21:52 ad       
 6 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:26:04 ad       
 7 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:27:46 ad       
 8 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:28:16 ad       
 9 TYUDEKBWSZNF763 TYUDEKBWSZ… 2015-01-10 04:34:31 2015-01-10 04:36:19 ad       
10 TYUDEKBWSZNF763 TYUDEKBWSZ… 2015-01-10 04:34:31 2015-01-10 04:43:55 ad       
# ℹ 407 more rows
# ℹ 6 more variables: item_name <chr>, item_revenue <dbl>,
#   session_duration <dbl>, start_day <date>, acquisition <chr>, country <chr>

A quick check for whether the slicing worked can involve using gt’s gt_preview(), which provides the first five rows and the last row in a small table:

gt_preview(all_rev_jan_segment)

	player_id	session_id	session_start	time	item_type	item_name	item_revenue	session_duration	start_day	acquisition	country
1	LSNXTPAWEDBY179	LSNXTPAWEDBY179-4nsqmc51	2015-01-10 00:46:00	2015-01-10 01:15:48	ad	ad_5sec	0.040	40.6	2015-01-09	other_campaign	United States
2	LSNXTPAWEDBY179	LSNXTPAWEDBY179-4nsqmc51	2015-01-10 00:46:00	2015-01-10 01:26:24	ad	ad_10sec	0.140	40.6	2015-01-09	other_campaign	United States
3	IYOPSWCHJBRA492	IYOPSWCHJBRA492-lgu9bo1s	2015-01-10 03:56:34	2015-01-10 03:58:10	iap	gems1	1.743	32.6	2015-01-04	google	South Africa
4	IYOPSWCHJBRA492	IYOPSWCHJBRA492-lgu9bo1s	2015-01-10 03:56:34	2015-01-10 04:14:34	iap	gold2	1.393	32.6	2015-01-04	google	South Africa
5	IYOPSWCHJBRA492	IYOPSWCHJBRA492-lgu9bo1s	2015-01-10 03:56:34	2015-01-10 04:21:52	ad	ad_survey	0.994	32.6	2015-01-04	google	South Africa
6..416
417	QNVGITZEKYPC816	QNVGITZEKYPC816-mbw87yxs	2015-01-14 23:17:37	2015-01-14 23:38:07	ad	ad_survey	0.780	22.9	2015-01-13	other_campaign	Germany

This looks good as we can see early morning values on "2015-01-10" and near-midnight values on "2015-01-15" (in the session_start column). We might want to do a few more data validation checks, just to be sure what we’re starting with is of high enough quality. We’ll use pointblank for that and introduce the all_rev_jan_segment to these three basic data quality checks:

Do all rows have session_start values within the expected time range?
Are all rows distinct from each other?
Are all rows complete (i.e., having no missing values anywhere)?

all_rev_jan_segment |>
  col_vals_between(
    columns = session_start,
    left = lubridate::ymd_hms("2015-01-10 00:00:00"),
    right = lubridate::ymd_hms("2015-01-15 11:59:59")
  ) |>
  rows_complete() |>
  rows_distinct()

# A tibble: 417 × 11
   player_id       session_id  session_start       time                item_type
   <chr>           <chr>       <dttm>              <dttm>              <chr>    
 1 LSNXTPAWEDBY179 LSNXTPAWED… 2015-01-10 00:46:00 2015-01-10 01:15:48 ad       
 2 LSNXTPAWEDBY179 LSNXTPAWED… 2015-01-10 00:46:00 2015-01-10 01:26:24 ad       
 3 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 03:58:10 iap      
 4 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:14:34 iap      
 5 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:21:52 ad       
 6 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:26:04 ad       
 7 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:27:46 ad       
 8 IYOPSWCHJBRA492 IYOPSWCHJB… 2015-01-10 03:56:34 2015-01-10 04:28:16 ad       
 9 TYUDEKBWSZNF763 TYUDEKBWSZ… 2015-01-10 04:34:31 2015-01-10 04:36:19 ad       
10 TYUDEKBWSZNF763 TYUDEKBWSZ… 2015-01-10 04:34:31 2015-01-10 04:43:55 ad       
# ℹ 407 more rows
# ℹ 6 more variables: item_name <chr>, item_revenue <dbl>,
#   session_duration <dbl>, start_day <date>, acquisition <chr>, country <chr>

Because the table was returned with no errors reported, all of the data validation checks passed with flying colors! There are of course many more data checks that could potentially be performed but even just a few basic checks are valuable and we could move forward, feeling a bit more confident about the input data.

Let’s generate a summary of the top-five selling items in each day of this abbreviated dataset using functions from dplyr.

daily_revenue_top5 <-
  all_rev_jan_segment |>
  filter(item_type == "iap") |>
  mutate(date = as.Date(time)) |>
  group_by(date, item_name) |>
  summarize(
    revenue = sum(item_revenue),
    .groups = "drop"
  ) |>
  arrange(desc(date), desc(revenue)) |>
  group_by(date) |>
  filter(row_number() %in% 1:5) |>
  mutate(rank = 1:5) |>
  ungroup()

daily_revenue_top5

# A tibble: 25 × 4
   date       item_name revenue  rank
   <date>     <chr>       <dbl> <int>
 1 2015-01-14 gems4       120.      1
 2 2015-01-14 gems5        91.0     2
 3 2015-01-14 gold7        60.0     3
 4 2015-01-14 offer2       42.0     4
 5 2015-01-14 offer5       34.8     5
 6 2015-01-13 gems5        91.0     1
 7 2015-01-13 offer4       26.0     2
 8 2015-01-13 offer3       24.0     3
 9 2015-01-13 gold4        22.0     4
10 2015-01-13 gold5        20.3     5
# ℹ 15 more rows

Making a gt table from this summary makes for a very presentable deliverable.

daily_revenue_top5 |>
  gt(rowname_col = "item_name", groupname_col = "date") |>
  fmt_currency(columns = revenue) |>
  cols_width(everything() ~ px(200)) |>
  fmt_roman(columns = rank) |>
  cols_merge(columns = c(item_name, rank), pattern = "{2}. {1}") |>
  tab_header(
    title = "Daily Top Five IAPs",
    subtitle = md(glue("Data from `{initial_date}` to `{end_m1_date}`."))
  ) |>
  tab_options(column_labels.hidden = TRUE) |>
  opt_all_caps() |>
  opt_vertical_padding(scale = 0.5) |>
  tab_style(
    style = cell_text(align = "center"),
    locations = cells_row_groups()
  )

2015-01-14
Daily Top Five IAPs
Data from `2015-01-10` to `2015-01-14`.
I. gems4	$119.98
II. gems5	$90.99
III. gold7	$60.00
IV. offer2	$41.96
V. offer5	$34.79
2015-01-13
I. gems5	$90.99
II. offer4	$25.99
III. offer3	$23.98
IV. gold4	$21.99
V. gold5	$20.29
2015-01-12
I. gems5	$90.99
II. gems3	$44.98
III. gems4	$41.99
IV. gold5	$40.59
V. gold6	$24.00
2015-01-11
I. gems4	$41.99
II. offer3	$31.48
III. offer2	$30.97
IV. offer4	$29.98
V. gold5	$20.29
2015-01-10
I. gold7	$83.99
II. offer4	$49.97
III. gold6	$41.99
IV. offer3	$20.99
V. gems4	$18.00

This looks pretty nice. And things are set up for even more flexibility. I mean, you might get asked for all sorts modifications of this summary table. Some examples:

Give me the top 10 items per day, going back 4 days from today
I’d like to see the best performing ad types
This would be much better with the company colors applied somewhere

To make this all possible we can generate a function. The ultimate function for getting a table of revenue figures really.

13.2 Understand

We have three basic components to making that reporting table:

Data acquisition
Data transformation and summarization
Data tabulation for presentation

If we are to make a single, dependable function, it would be acceptable (and easier for you) to get all these components into that function. Let’s start with a name and call this function get_sales_tbl().

Now let’s think of what’s needed in terms of options. We know we should allow for some variation and flexibilty. We don’t want too much of that because it could result in a lot of code that could get brittle and harder to maintain. The art in this sort of thing is finding a middle ground and extended later as necessity warrants. Let’s start by popping in all the R code we used before into this function, with zero options:

get_sales_tbl <- function() {
  
  initial_date <- "2015-01-10"
  ending_date  <- "2015-01-15"
  end_m1_date  <- "2015-01-14"

  all_rev_jan_segment <- 
    all_revenue |>
    tt_time_slice(
      time_column = "session_start",
      slice_point = ending_date,
      keep = "left"
    ) |>
    tt_time_slice(
      time_column = "session_start",
      slice_point = initial_date,
      keep = "right"
    )
  
  daily_revenue_top5 <-
    all_rev_jan_segment |>
    filter(item_type == "iap") |>
    mutate(date = as.Date(time)) |>
    group_by(date, item_name) |>
    summarize(
      revenue = sum(item_revenue),
      .groups = "drop"
    ) |>
    arrange(desc(date), desc(revenue)) |>
    group_by(date) |>
    filter(row_number() %in% 1:5) |>
    mutate(rank = 1:5) |>
    ungroup()
  
  daily_revenue_top5 |>
    gt(rowname_col = "item_name", groupname_col = "date") |>
    fmt_currency(columns = revenue) |>
    cols_width(everything() ~ px(200)) |>
    fmt_roman(columns = rank) |>
    cols_merge(columns = c(item_name, rank), pattern = "{2}. {1}") |>
    tab_header(
      title = "Daily Top Five IAPs",
      subtitle = md(glue("Data from `{initial_date}` to `{end_m1_date}`."))
    ) |>
    tab_options(column_labels.hidden = TRUE) |>
    opt_all_caps() |>
    opt_vertical_padding(scale = 0.5) |>
    tab_style(
      style = cell_text(align = "center"),
      locations = cells_row_groups()
    )
}

What was made was a function that’s not too functional. It has no arguments (y’know, options). It’ll give you the same table every single time. Obviously, not what anyone wants but it’s a good starting point because everything is here. We just need to make this usable for the real-world now.

In terms of design constraints, let’s make it this is always some sort of daily ranking of something from the source table. In terms of options, we want these three:

A way to modify the date range
Option to modify the number of items in the ranking.
Two styling options: basic and company-themed

Let’s put that into the function’s signature, along with some defaults:

get_sales_tbl <- function(
    data,
    final_day = lubridate::today() - lubridate::days(2),
    start_day = lubridate::today() - lubridate::days(6),
    days_back = NULL,
    n = 5,
    styling = c("basic", "company")
) {

...

}

The data argument is the dataset. This is the data frame or tibble that first gets processed by the dplyr code. The pairing of final_day and start_day allow for input of dates. The defaults make it so that the system’s clock time is involved, getting a start day and a final day two days back from the present day. As an additional option, the start_day input could be overridden by providing a days_back number (good if you don’t really want to do any calendar-based calculations for larger ranges of days). The n argument is for the number of ranked items to include in each day. Finally, styling allows for choosing a theme for the gt-based table output.

The new function, after incorporation of the proposed options, might look something like this:

get_sales_tbl <- function(
    data,
    final_day = lubridate::today() - lubridate::days(2),
    start_day = lubridate::today() - lubridate::days(6),
    days_back = NULL,
    n = 5,
    styling = c("company", "basic")
) {
  
  styling <- match.arg(styling)

  final_day <- as.Date(final_day)
  
  if (!is.null(days_back) && is.numeric(days_back)) {
    start_day <- final_day - lubridate::days(days_back)
  } else {
    start_day <- as.Date(start_day)
  }
  
  days <- seq(start_day, final_day, by = "days")

  filtered_data <- 
    data |>
    dplyr::mutate(sale_day = as.Date(time)) |>
    dplyr::filter(sale_day %in% days)
  
  .top_n_sequence <- seq(1, n)
  
  daily_revenue_ranked <-
    filtered_data |>
    filter(item_type == "iap") |>
    mutate(date = as.Date(time)) |>
    group_by(date, item_name) |>
    summarize(
      revenue = sum(item_revenue),
      .groups = "drop"
    ) |>
    arrange(desc(date), desc(revenue)) |>
    group_by(date) |>
    filter(row_number() %in% .top_n_sequence) |>
    mutate(rank = .top_n_sequence) |>
    ungroup()
  
  output_tbl <-
    daily_revenue_ranked |>
    gt(rowname_col = "item_name", groupname_col = "date") |>
    fmt_currency(columns = revenue) |>
    cols_width(everything() ~ px(200)) |>
    fmt_roman(columns = rank) |>
    cols_merge(columns = c(item_name, rank), pattern = "{2}. {1}") |>
    tab_header(
      title = "Daily Top Sales Items",
      subtitle = md(glue("Data from `{start_day}` to `{final_day}`."))
    ) |>
    tab_options(column_labels.hidden = TRUE) |>
    opt_all_caps() |>
    opt_vertical_padding(scale = 0.5) |>
    tab_style(
      style = cell_text(align = "center"),
      locations = cells_row_groups()
    )
  
  if (styling == "company") {
    
    output_tbl <-
      output_tbl |>
      tab_options(
        table.background.color = adjust_luminance("beige", steps = 1.5),
        table.border.bottom.width = px(3),
        table.border.top.width = px(3),
        table.border.bottom.color = "black",
        table.border.top.color = "black",
        row_group.background.color = "gray98",
        source_notes.background.color = "darkorange"
      ) |>
      tab_style(
        style = cell_fill(color = "palegreen"),
        locations = cells_body(columns = revenue)
      ) |>
      tab_source_note(
        source_note = md("**NOTE:** This document is for internal use only and should not to be distributed to external parties.") 
      )
  }
  
  output_tbl
}

This shiny new function is now flexible and it really works! Always test your custom-built functions with a lot of sensible input combinations. The more important the function is, the more robust and well-tested it should become. If we use get_sales_tbl() with the dataset above and the final_day = "2015-06-30" and days_back = 2 options, we do get a nice-looking table:

get_sales_tbl(
  all_revenue,
  final_day = "2015-06-30",
  days_back = 2
)

2015-06-30
Daily Top Sales Items
Data from `2015-06-28` to `2015-06-30`.
I. gems5	$610.95
II. gold7	$515.96
III. gold6	$461.92
IV. offer5	$220.32
V. offer4	$203.90
2015-06-29
I. gems5	$623.95
II. offer5	$304.39
III. gold6	$215.96
IV. gems4	$173.97
V. offer4	$147.93
2015-06-28
I. gems5	$1,013.92
II. gold6	$527.91
III. gold7	$419.96
IV. offer5	$379.77
V. gems4	$293.95
NOTE: This document is for internal use only and should not to be distributed to external parties.

However, if we use the default options (i.e., by invoking get_sales_tbl(all_revenue)) we get an error. Turns out that the dataset obtained from using intendo::all_revenue() only contains data from 2015. So, the defaults are optimized for data that continuously updating to the present day. This is great in practice for that common scenario but not so good with a historical data extract like we have here. There is always the possibility to further tune the function to work with data extracts that have hard time boundaries. Sure it would add some complexity to both the function body and to the interface but if that is valid use case then that’s something that should be done!

13.3 Explain

So we have this new function available. It’s generally very useful but if we want somebody else to use it, we really ought to describe what’s going on and what the options are. Let’s have a look at the function’s signature again:

get_sales_tbl <- function(
    data,
    final_day = lubridate::today() - lubridate::days(2),
    start_day = lubridate::today() - lubridate::days(6),
    days_back = NULL,
    n = 5,
    styling = c("basic", "company")
) {

...

}

What we generally have is the function name and six different arguments. That makes at least seven different things to describe: what the function does, and what each of the arguments do. If we are keeping this function around as a piece of code in an .R file or something similar (i.e., not an R package), we can choose to write some documentation above as a series of comments. Here is something serviceable:

# Get a table of top sales data from our revenue table for a span of time
#
# Arguments:
# data - daily revenue data that is extracted from our database revenue table
# final_day - the end of the time period, default is 2 days before today
# start_day - the start of the time period, default is 6 days before today
# days_back - an option to choose a span of time that starts n days before today
# n - the number of items to include in the ranking
# styling - either 'basic' styling or 'company' styling (presentations, reports)

get_sales_tbl <- function(
    data,
    final_day = lubridate::today() - lubridate::days(2),
    start_day = lubridate::today() - lubridate::days(6),
    days_back = NULL,
    n = 5,
    styling = c("basic", "company")
) {

...

}

Providing these basic bits of documentation shouldn’t take very long and the payoff is a lasting reference for yourself and for others you might want to share the code with.

13.4 Share

Given that you have a useful function that suits a business need, one could hypothetically use it whenever it’s necessary to generate a table of data. However, the people do need to actually see the table and it probably won’t often be on your personal machine. The good news is that tables produced by gt can be shared in HTML documents (i.e., in web pages, emails, even standalone HTML files), as PDF documents, or in presentations.

Let’s suppose you want to put the table in a presentation slide. There are a few ways to fit this into your existing presentation workflow. Perhaps you’re using Powerpoint, Google Slides, or Keynote and you really just need a graphic. If that’s the case, gt enables one to produce a PNG file with the gtsave() function. Just give it a filename that ends in .png, like this:

revenue_mid_aug <-
  get_sales_tbl(
    all_revenue,
    final_day = "2015-08-15",
    days_back = 2,
    styling = "company"
  )

gtsave(revenue_mid_aug, filename = "revenue_mid_aug.png")

This will indeed generate the "revenue_mid_aug.png" file which is a table graphic at high resolution (the text is crisp and so zooming in for presentation purposes will be totally acceptable). If you do the same thing with the insertion of plots, this won’t be much different from that practice.

If you want to publish a report, then R Markdown and Quarto are good choices for that. A table made by the get_sales_tbl() function should work just fine within those environments. Of course, you’ll need to have that function available before using it. If you’re keeping pieces of code for reporting lightweight, that probably meaning having a stable of R scripts at the ready. In such a scenario, you can source the R script containing the get_sales_tbl() function with the source() function. Just do that before calling the function; a good place to put source() calls is right below all library() statements, preferably in a setup chunk.

In a real-world setting, this sort of strategy for making useful functions that interact with your data, placing them in R scripts, and sourcing them into your reports is quite viable and will get you pretty far!