7  Importing Data from Common Formats

Explore: The first steps with a new pattern (e.g., get info you need, take stock, smell checks).

Understand: Learning anything can involve asking questions (what kinds? How to actually translate to code? Interpreting outputs personally, may lead to more exploration).

Explain: How do we take the answers and rationalize with the investigations? That’s covered in this pattern. How to document to oneself, programmatic text for resilient reporting. This is really about talking to yourself, in a way.

Share: Communication and collaboration. We’ll show you ways you can make everything presentable and create outputs in a format that allows for easy consumption and feedback. An example: reordering plots or tables to make them easier to interpret.

Previously, we had used datasets available in the dspatterns, bakeoff, intendo, etc. packages to get the data we needed. This is all fine and well for the purpose of learning to use data transformation and visualization functions. However, real world data won’t often be found in R packages. You may encounter useful data in text files, Excel files, or on the web in myriad forms. While there are hundreds of ways data can be structured and stored, there are a few formats that are quite common and so the focus of this section will be on the methods for importing these common formats into R (as tibbles). These formats are CSV and Excel files.

Here’s a breakdown of packages needed for this chapter:

7.1 Explore

To make things easy for us, the dspatterns package has a function that writes any dspatterns-based dataset to a CSV file: dataset_to_csv(). Remember that we need to load the dspatterns package first, and this can be done by executing library(dspatterns) in the RStudio console. The datasets you’ll end up importing will be varied and will serve different purposes. Occasionally, you’ll need a dataset that has data associated with your main subject of study. For examples in this section we will be experimenting with variations of a dataset called us_cities that has populations for cities in United States. Why would you need such a dataset in the real world? Well, suppose you are conducting a study that looks at sales by city. You might have a dataset with a listing of such sales but, sadly, you don’t have the populations of these cities (which may help explain the sales figures). By obtaining an associated dataset on city populations, we can potentially combine the city populations with a summary of total sales by city. Before joining such datasets together, we first need to fully understand how to read in datasets from files.

The us_cities dataset inside the dspatterns package can be written as a CSV file like this:

Writing us_cities.csv to the working directory with dataset_to_csv().

dataset_to_csv(dataset = "us_cities")

A message should appear in your R console, which goes something like this:

#> The CSV file `us_cities.csv` was written to the working directory
#> ...

Further information on the us_cities dataset can be found by executing help(us_cities) in the RStudio console.

In case we don’t fully believe the message given by dataset_to_csv(), we can have a look at the Files tab in RStudio (in the lower-right pane, first tab). The file should be there. If it isn’t, we can try two things. We should check that the path above the file view is same as that given in the message received after executing the dataset_to_csv() function. If it’s different for whatever reason, we could navigate back to the project directory by clicking on the icon to the right of the path (the blue, R-in-cube icon). One last thing to try is clicking the refresh button just above (it could be that, on some file systems, a refresh of the directory listing is necessary).

Using readr to Import Data from CSV Files

Now that we have learned about how to set up our projects, we will now import data from CSV (comma-separated value) files. CSVs are quite commonplace as method for storing tabular data in plain text. Many software programs that deal with tabular data can often export such data to CSV. While the CSV ‘format’ isn’t rigorously specified (there really is a lot of variation), we commonly find that

  1. each line is a separate record,
  2. the records are divided into fields (separated by delimiters such as the comma, but sometimes by spaces or tab characters),
  3. there is the same number of fields per record, and
  4. there is often a header line that defines the column names.

While such variations in CSV formatting can’t really be avoided, the Tidyverse readr package has many options to successfully deal with any such potential issues.

Reading in Our Sample CSV file with the read_csv() Function

Now that we have a sample CSV file on disk, let’s read in that file into a tibble object. We can do this using the read_csv() function from the readr package. This package is available in the Tidyverse so we need to first execute library(tidyverse) before using that function. As before, we will illustrate our upcoming examples as R Markdown code chunks, so, the next code listing will look familiar to us.

Loading the tidyverse and dspatterns packages.

When using read_csv() the file name must of course be specified but since the file is in the project directory, we don’t need to provide path information. Let’s assign the resulting table to the city_table variable.

Using the read_csv() function to read the us_cities.csv to a tibble object.

Notes on the CodeL.1 This use of read_csv() provides only the minimal amount of required information: the file to read in.
city_table <- read_csv(file = "us_cities.csv")
Rows: 765 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): city, state_id, state_name
dbl (2): pop_urb, pop_mun

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
city_table
# A tibble: 765 × 5
   city       state_id state_name pop_urb pop_mun
   <chr>      <chr>    <chr>        <dbl>   <dbl>
 1 Anchorage  AK       Alaska      253421  294356
 2 Auburn     AL       Alabama      89577   63973
 3 Birmingham AL       Alabama     744189  210710
 4 Decatur    AL       Alabama      68816   54405
 5 Dothan     AL       Alabama      71620   68202
 6 Hoover     AL       Alabama      84920   84920
 7 Huntsville AL       Alabama     309716  194585
 8 Mobile     AL       Alabama     318084  190265
 9 Montgomery AL       Alabama     255908  199518
10 Tuscaloosa AL       Alabama     154204  100287
# ℹ 755 more rows

Since CSV files don’t contain any information on column types, all CSV readers like read_csv() have to guess the column types upon reading. The message about parsing (in the console) appears during the function call. The printed table output looks to be reasonable, however.

So, the CSV was successfully read into the city_table variable as a tibble. We verified this by printing the tibble object. However, we did get a message during import that told us that the data was parsed with a particular column specification. In this case, readr successfully made a workable table. However, we might have wanted the population values in the pop_urb (urban population) and pop_mun (municipal population) columns to be of the integer type (instead of double, which is the default numerical type in R). We can also imagine a situation where we might have ID values as numbers in a CSV but we would want them to be in a table as character values (and not as integer values). The good news is that we can provide our desired column types with the col_types argument. It’s easiest to provide this as an encoded string of letters (one per column, in the order present in the CSV file). Here are the encodings:

  • c: character
  • d: double
  • i: integer
  • l: logical
  • f: factor
  • D: date
  • T: date-time
  • t: time
  • ?: guess the column type
  • _: skip the column entirely

For the first three character-based columns of us_cities.csv (city, state_id, state_name) we would use three c’s and for the integer values in the last two columns, two i’s are needed. This makes the string for col_types become "cccii". The next code listing rewrites the code in the previous code listing, providing the col_types to the read_csv() function call and getting us the a tibble with our desired column types.

Using the read_csv() with us_cities.csv, specifying the exact column types in col_types.

Notes on the CodeL.1 Using compact representation of column types as "cccii" provides more information to read_csv() so that we get our desired column types.
city_table <- read_csv(file = "us_cities.csv", col_types = "cccii")

city_table
# A tibble: 765 × 5
   city       state_id state_name pop_urb pop_mun
   <chr>      <chr>    <chr>        <int>   <int>
 1 Anchorage  AK       Alaska      253421  294356
 2 Auburn     AL       Alabama      89577   63973
 3 Birmingham AL       Alabama     744189  210710
 4 Decatur    AL       Alabama      68816   54405
 5 Dothan     AL       Alabama      71620   68202
 6 Hoover     AL       Alabama      84920   84920
 7 Huntsville AL       Alabama     309716  194585
 8 Mobile     AL       Alabama     318084  190265
 9 Montgomery AL       Alabama     255908  199518
10 Tuscaloosa AL       Alabama     154204  100287
# ℹ 755 more rows

Through inspection of the column types in city_table, we can see that the last two columns are now of the integer type.

Additional read_csv() Options for Non-standard CSVs

There are many more options we can use with read_csv() and these begin to be important as we encounter CSV files that contain oddities within them. For example, what if we want to have proper NA values in our tibble to account for missing values but the CSV file, instead of using empty strings or "NA" (which are both parsed correctly as NA values), we see the word "missing" instead? In this case, we could use the na = "missing" option and then read_csv() would interpret those instances as NAs.

Here an unusual one: some CSVs contain comment lines within them. This is nonstandard, but it does exist in useful CSV datasets you might find on the Web. To correctly parse these kinds of files, it’s useful to inspect the file as text in RStudio and identify which characters signify the start of the comment lines. (We can conveniently do this in RStudio by clicking on the file in the Files pane and selecting the View File option.) Often, the comment line begins with a # sign. Knowing this, we could provide the comment = "#" option to read_csv() and this results in such lines being completely ignored.

Using readxl to Import Excel Data

Excel is a spreadsheet application that’s made by Microsoft. It’s quite popular and so there is a wealth of data stored in the .xls and .xlsx file formats within organizations and available as public data. It is widely used across many different fields because it’s an easily accessible tool for organizing, analyzing, and storing data.

The dspatterns package will again provide us the Excel files we need for the examples, eliminating the need to download Excel files from the Internet. We will use the readxl package, available in the Tidyverse, to import data tables within Excel files for all of our examples. But before we do that, we will need to consider the ways in which we can best prepare Excel files for import. The reason for this is that there are so many ways to add data to Excel files, and we’d want to ensure that the data is in the tidiest form possible before importing (remembering that tidy tools expect tidy data). We’ll have a look at the ideal representation of data inside Excel as it relates to importing that data to R, and, propose some strategies for cleaning up Excel files are problematic.

Writing an example Excel file to the project directory

The dspatterns package has a few functions that produce example Excel files. As always, ensure that library(dspatterns) is executed in the RStudio console if this is a new R session. We will use the create_excel_file() function to write an example Excel file to the working directory. The Excel document that the function creates has two sheets (and these are sometimes called tabs). The first sheet (called us_cities) contains all of the data from the us_cities dataset in the cleanest manner possible. There are several qualities in this sheet that make it the data easy to import into R:

  • There is only a single data table
  • The data has a single header row for column names in the first row of the sheet
  • All data rows follow the header (from row 2) and there are no gaps (i.e., empty rows)
  • All tabular data encompasses a rectangular region (i.e., there should be no cells with entries elsewhere 

The us_cities sheet in the example Excel file is a paragon of clean data as an input for the read_excel() function. This will make your life easier.

The second sheet (called us_cities_messy) contains the us_cities data, however, there is additional data strewn about in different cells. We can look to the first sheet as the idealized example of data stored in an Excel file. It won’t provide us with any data import problems, and it’s something to strive for when managing data in Excel files. The second is a more real-world example of an Excel sheet you’ll often encounter. Perhaps a colleague provided this file as a source of data for a project, perhaps it’s your own file, it really doesn’t matter. What does matter is that the data must be cleaned/reorganized before importing it into R.

Ensuring that we are within an RStudio Project, let’s use create_excel_file() to generate the example Excel file.

Using the create_excel_file() function to write us_cities.xlsx to the project directory.

This function doesn’t require any options. It will just write the Excel file that’s stored in the dspatterns package to the working directory. You’ll get a helpful message in the console that describes what happened:

#> The Excel file `us_cities.xlsx` was written to the working directory

7.2 Understand

We can further verify that the file was written by looking for it in RStudio’s Files tab (lower-right pane). Only one type of file is generated from the create_excel_file() function and that’s us_cities.xlsx.

Reading in Our Sample Excel File with the read_excel() Function

Now that we have a sample Excel file on disk, let’s read the first sheet of that document (the ideal, clean sheet) into a tibble object. We can do this using the read_excel() function from the readxl package. This package is available in the Tidyverse and will be available on your system when installing the Tidyverse package (through install.packages(tidyverse)). However, the readxl package functions won’t be available when executing library(tidyverse). Instead, we need to explicitly use library(readxl) beforehand. So, keep in mind that library(readxl) must accompany your suite of library() statements when importing Excel files.

When using read_excel() the file name needs to be specified just as in the case of read_csv(). As before, we don’t need to add directory info since the file is available in the project directory. Because Excel workbooks can have multiple sheets, there is the sheet argument for specifying which one to use for importing. We can either specify the sheet name, or, provides its index (starting from 1). The default is to read the first sheet in the specified Excel file, and that’s just what we’ll do first. Let’s assign the resulting table to the city_table variable.

Using the read_excel() function to read the first sheet from the us_cities.xlsx file to a tibble object.

Notes on the CodeL.1 This use of read_excel() provides only the minimal amount of required information: the file to read in. By default, the first sheet of the workbook will be read.
city_table_x <- read_excel(path = "us_cities.xlsx")

city_table_x
# A tibble: 765 × 5
   city       state_id state_name pop_urb pop_mun
   <chr>      <chr>    <chr>        <dbl>   <dbl>
 1 Anchorage  AK       Alaska      253421  294356
 2 Auburn     AL       Alabama      89577   63973
 3 Birmingham AL       Alabama     744189  210710
 4 Decatur    AL       Alabama      68816   54405
 5 Dothan     AL       Alabama      71620   68202
 6 Hoover     AL       Alabama      84920   84920
 7 Huntsville AL       Alabama     309716  194585
 8 Mobile     AL       Alabama     318084  190265
 9 Montgomery AL       Alabama     255908  199518
10 Tuscaloosa AL       Alabama     154204  100287
# ℹ 755 more rows

Excel files contain cell format information and, as long as there is consistency, this use of read_excel() obtains the correct column types upon reading. This is confirmed by the console printing of the tibble object.

The imported table, city_table_x, is the same as the table in the us_cities dataset (using all.equal(city_table_x, us_cities) in the RStudio console will return TRUE). Thus, the importing of this table presented no difficulty to us, and this is because the Excel sheet was formatted to only contain the table data and nothing extraneous to it.

Unlike our prior use of read_csv() with an example CSV file, we did not get a message telling us about the parsing of columns. Whereas CSV files don’t contain any type information about their columns, Excel files have cell formats (which can be text, numeric, etc.). In this imported sheet, there was consistent cell formatting per each column so read_excel() was able to assign the correct column types in the output tibble.

While the col_types argument is present in read_excel(), it doesn’t work in the same way as col_types argument in read_csv(). If we want control over the column types, we have to supply a vector of the following read_excel()-specific types/directives to col_types:

  • text: character
  • numeric: double
  • logical: logical
  • date: date
  • guess: guess the column type
  • skip: skips the column entirely

Since we have three character-based columns and two numeric columns, the read_excel() function call could be re-written to be more explicit as read_excel("us_cities.xlsx", col_types = c(rep("text", 3), rep("numeric", 2))) (where the rep() function creates a vector that repeats a value n times). There is no way to create integer columns so if we wanted the pop_urb and pop_mun columns to be of the integer type, the only option is to mutate() with dplyr after importing.

So that was the first sheet of the Excel workbook. The second sheet contains pretty much the same data (based on the us_cities dataset) but it won’t fare as well during import. We only need to glance at the tibble output to see that there are some major problems.

Using the read_excel() function to read in the second sheet from the us_cities.xlsx Excel file.

Notes on the CodeL.1 This read_excel() call specifies the sheet as 2 (it's the second sheet).
city_table_x_m <- read_excel("us_cities.xlsx", sheet = 2)

city_table_x_m
New names:
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
# A tibble: 767 × 8
   Data obtained from https://simple…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8 
   <lgl>                               <chr> <chr> <chr> <chr> <chr> <lgl> <chr>
 1 NA                                  iden… <NA>  <NA>  popu… <NA>  NA    <NA> 
 2 NA                                  city  stat… stat… urba… muni… NA    Rati…
 3 NA                                  Anch… AK    Alas… 2534… 2943… NA    1.16…
 4 NA                                  Aubu… AL    Alab… 89577 63973 NA    0.71…
 5 NA                                  Birm… AL    Alab… 7441… 2107… NA    0.28…
 6 NA                                  Deca… AL    Alab… 68816 54405 NA    0.79…
 7 NA                                  Doth… AL    Alab… 71620 68202 NA    0.95…
 8 NA                                  Hoov… AL    Alab… 84920 84920 NA    1    
 9 NA                                  Hunt… AL    Alab… 3097… 1945… NA    0.62…
10 NA                                  Mobi… AL    Alab… 3180… 1902… NA    0.59…
# ℹ 757 more rows
# ℹ abbreviated name:
#   ¹​`Data obtained from https://simplemaps.com/data/us-cities`

From this code, we get a message about new names. What’s happening is that the function can’t find column names on the top row, so, it’s creating unique names for us. The resulting tibble has some extra columns and rows that are not intended to be table records (at least at the top, which is all we can see in this view).

7.3 Explain

Upon looking at the Excel file (the image below provides an annotated guide to its problems) we can see why the import result is the way that it is. This is all fixable though, so long as we know what the particular problems are. However, we really need to have a close look at the sheet in Excel proper.

The us_cities_messy sheet in the example Excel file has a number of issues that require a much more sophisticated use of the read_excel() function.

The first big problem is that the table (which is comprised of the header row and the records) begins on the third row. Everything else above that is not the table and should be discarded. The read_excel() function has the skip argument (just as read_csv() does) to skip rows at the top of the sheet. We will use skip = 2, and this first problem will be solved.

The next issue is extra columns that we don’t want in the tibble. The read_csv() function will automatically skip over the first empty column (because we eliminated the first two rows, and no text remains in column A), but we have two extra columns at the far right that should be omitted. A good option here is to use the "skip" directive for those undesired columns for the col_types argument. This will end up being col_types = c(rep("text", 3), rep("numeric", 2), rep("skip", 2)). Then, those last two columns are gone.

A final thing to do is to define those "N/A" values seen in the Excel sheet as genuine NA values. For this, the na argument will be used, giving it the "N/A" string so that the function will recode those strings as NAs.

Given all of these considerations, here is the finalized function call:

Using the read_excel() function with the us_cities_messy sheet, take 2.

Notes on the CodeL.5 This skips the first two rows, which are clearly not tabular data.
L.6 While defining the types is good, choosing which columns to skip (i.e., get rid of) is also good and needed.
L.7 We see some cell values being 'N/A'. They really should be blank but that's okay. This is fixable with the na option.
city_table_x_m <- 
  read_excel(
    "us_cities.xlsx",
    sheet = 2, 
    skip = 2,
    col_types = c(rep("text", 3), rep("numeric", 2), rep("skip", 2)),
    na = "N/A"
  )

city_table_x_m
# A tibble: 765 × 5
   city       `state id` `state name` `urban population` `municipal population`
   <chr>      <chr>      <chr>                     <dbl>                  <dbl>
 1 Anchorage  AK         Alaska                   253421                 294356
 2 Auburn     AL         Alabama                   89577                  63973
 3 Birmingham AL         Alabama                  744189                 210710
 4 Decatur    AL         Alabama                   68816                  54405
 5 Dothan     AL         Alabama                   71620                  68202
 6 Hoover     AL         Alabama                   84920                  84920
 7 Huntsville AL         Alabama                  309716                 194585
 8 Mobile     AL         Alabama                  318084                 190265
 9 Montgomery AL         Alabama                  255908                 199518
10 Tuscaloosa AL         Alabama                  154204                 100287
# ℹ 755 more rows

Upon printing city_table_x_m to the console, we can see that we now have a usable table.

If the column names of the city_table_x_m table are not satisfactory, we can use dplyr’s rename() function to modify some of them.

Using dplyr's rename() function to change column names.

Notes on the CodeL.3 The rename() function is from the dplyr package. The syntax is to have the replacement name first, making it equal to the existing column name (none of these names are in quotes).
L.4 We have to use backticks for all of the existing names because these names have spaces in them (R only allows such names if they are enclosed in backticks).
city_table_x_m <- 
  city_table_x_m |>
  rename(
    state_id = `state id`,
    state_name = `state name`,
    pop_urb = `urban population`,
    pop_mun = `municipal population`
  )

city_table_x_m
# A tibble: 765 × 5
   city       state_id state_name pop_urb pop_mun
   <chr>      <chr>    <chr>        <dbl>   <dbl>
 1 Anchorage  AK       Alaska      253421  294356
 2 Auburn     AL       Alabama      89577   63973
 3 Birmingham AL       Alabama     744189  210710
 4 Decatur    AL       Alabama      68816   54405
 5 Dothan     AL       Alabama      71620   68202
 6 Hoover     AL       Alabama      84920   84920
 7 Huntsville AL       Alabama     309716  194585
 8 Mobile     AL       Alabama     318084  190265
 9 Montgomery AL       Alabama     255908  199518
10 Tuscaloosa AL       Alabama     154204  100287
# ℹ 755 more rows

We can see that the column names are now the same as those in the us_cities dataset. If one wanted to remove some columns, then that only requires the use of the dplyr select() function.

Would it have been easier to hand edit the messy Excel sheet before importing? Perhaps, but this is a choice that depends on the situation at hand. If you created the Excel file and its only purpose was to get data into R, then that’s a good case for correcting your own mistakes in that Excel sheet. If a collaborator generated the file and it’s treated as ‘raw data’, perhaps likely to be updated various times (and thus requiring periodic importing into R), then it’s wise to not hand edit the file. That practice is error-prone and not leads to problems with reproducibility. Often, it’s best to deal with these files as they are and make use of the helpful options in the readxl and dplyr packages to handle the importing.

We learned a lot about importing CSVs and Excel files. And managing our projects and their associated files is much easier now that we’ve learned how to create and manage RStudio Projects. CSV data? Import all that with the readr package’s excellent read_csv() function. Got Excel files? That’s a bit more work since Excel files often contain a multitude of datasets. But once you understand what’s in such a file we can use the read_excel() function (from readxl) to read in data from Excel sheets.

There will always be problems and we encountered some common ones when importing both CSV and Excel files. The read_csv() and read_excel() functions do, however, come packed with options to ensure that data could be imported properly!

7.4 Share

You now know a ton about importing data from CSVs and even Excel files. What are you going to do with that knowledge? You’ve got to give something back and share. Suppose you spent 45 minutes to an hour dealing with the importing of table from an Excel file. And futher suppose that someone else you know will eventually need to get that same data. What we can do (to be nice) is create some shareable code and documentation around how that exact thing is done.

Here is how that might look in an .R script:


#
# Obtaining clean data in R from the `us_cities.xlsx` file given to us
# from our data provider
#

# First ensure that the `us_cities.xlsx` file is in your working directory
# (i.e., in the same place as this `us_cities.R` script)

# Load in packages
library(readxl)      # For the `read_excel()` function
library(tidyverse)   # Just for the `rename()` function
library(pointblank)  # For various data quality checks

# 1. Reading in the data from the Excel file
# - the data we need is on sheet 2
# - we have to skip to row 3 (that's where the column names are)
# - the correct column types are provided (we have to skip the reading
#   of the last two columns because that's just some analysis our
#   data provider did)
# - missing values in this file are consistently written as `N/A`; the
#   `na` arg here is given `"N/A"` so we get proper NAs

city_table <- 
  read_excel(
    "us_cities.xlsx",
    sheet = 2, 
    skip = 2,
    col_types = c(rep("text", 3), rep("numeric", 2), rep("skip", 2)),
    na = "N/A"
  )

# 2. Renaming column labels to avoid spaces
# - the renaming here fits better with our conventions and crucially
#   removes the space characters
city_table <- 
  city_table |>
  rename(
    state_id = `state id`,
    state_name = `state name`,
    pop_urb = `urban population`,
    pop_mun = `municipal population`
  )

# 3. Data quality checks
# - these pointblank checks ensure that everything is fine
# - if you get an error when running this, there is yet another
#   problem with this data; we'll then have to have yet another
#   difficult conversation with our data provider

# All rows are distinct
city_table |> rows_distinct()

# Should be no NAs in the first three columns
city_table |> rows_complete(columns = vars(city, state_id, state_name))

# First three columns are character, last two are numeric
city_table |> col_is_character(columns = vars(city, state_id, state_name))
city_table |> col_is_numeric(columns = vars(pop_urb, pop_mun))

# State ID values are all two letters long (and in capitals)
city_table |> col_vals_regex(columns = state_id, regex = "^[A-Z]{2}$")

# Ensure there are no spuriously large population values
city_table |> col_vals_lt(columns = pop_urb, value = 25E6, na_pass = TRUE)
city_table |> col_vals_lt(columns = pop_mun, value = 15E6, na_pass = TRUE)

If in this hypothetical scenario nobody else really needs this information, the situation might yet arise where you yourself will need this info again (to obtain that dataset or perhaps something close to it). What I’m saying is that knowledge like this is valuable and findings like those shown here should be written down (either for yourself or someone else).