Import Multiple Files/Sheets

Satellite

Data Import

Learn how to import multiple sheets from an Excel file, or several Excel files.

Objectives

Learn to efficiently import data from several files
Learn to efficiently import data from several sheets in the same file

Setup

Download and decompress the below folder containing data and an .Rproj file for this session.

Navigate into the new folder and create a new script called import_multiple.R. Add an appropriate header to the script and load {rio} and {here}. Create a path_data_raw object that points to the directory data/raw of the course folder.

Introduction

In the import data core session your learned how to use {rio} package to import many file formats, and in particular .xlsx and .csv files, which will probably cover 90% of your needs. In this session we will tackle the very common problem of scattered data.

You may have already encountered these situations:

You have one Excel file with data in several sheets (linelist data, lab data, bed occupancy) and you need to import all datasets in R
You have several Excel files (linelist data, lab data, bed occupancy) with data to import.
You have an Excel file with data in several sheets, but this time all the sheets are formatted the same way, and you wish to gather the data in the same dataset.
You have multiple Excel files, all formatted the same (one linelist per site, for example) and you want to import and gather in the same data frame

The two first cases can be solved with several calls to the different sheets or files, and storing the data in different data frames. If there are not too many files or sheets, it’s fine. The two last cases should however be automatized!

The following lesson will help with all of them anyway: the {rio} package has a import_list() function that can import several items in one go. These items can be files or sheets depending on what you provide as arguments.

Import Multiple Sheets

The import_list() function can import all the sheets from an Excel file and store them into a list. It avoids you having to write multiple import() commands for each file separately.

# PSEUDO CODE -- DO NOT RUN

# Returns a list of data frames:
import_list(here(path_data_raw, "example_linelist.xlsx"))

Now imagine someone gave you lab data with the positive, negative and inconclusive scattered in three different sheets. Let’s import this data.

Go to the sub-folder data/raw/satellite_multiple/multiple_sheets. There are two modified version of the lab data. Open them and look at their structure. Do you see how the second file differs form the first one?

We will tackle the first file, as it is less messy.

Use the import_list() function to import all the sheets of msf_laboratory_moissala_2023-09-24_first.xlsxin one go, and store it in an object called lab_data_list.

The object lab_data_list is a list that contains several data frames (one data frame per sheet). Maybe you are not sure what to do with it yet. For now, let’s just peek at it. You can inspect its structure with well known tools:

# Structure of the list:
str(lab_data_list)

List of 3
 $ inconclusive:'data.frame':   58 obs. of  5 variables:
  ..$ MSF Number ID    : num [1:58] 147 414 455 469 533 ...
  ..$ Laboratory id    : chr [1:58] "BOU-4" "MOI-5" "DAN-12" "DAN-13" ...
  ..$ Date of the test : POSIXct[1:58], format: "2022-10-12 18:40:49" "2022-12-09 05:55:56" ...
  ..$ CT value         : logi [1:58] NA NA NA NA NA NA ...
  ..$ Final Test Result: chr [1:58] "inconclusive" "inconclusive" "inconclusive" "inconclusive" ...
 $ negative    :'data.frame':   101 obs. of  5 variables:
  ..$ MSF Number ID    : num [1:101] 13 397 416 490 767 ...
  ..$ Laboratory id    : chr [1:101] "DAN-1" "DAN-9" "BOU-12" "BEK-12" ...
  ..$ Date of the test : POSIXct[1:101], format: "2022-09-10 21:25:04" "2022-11-21 20:02:44" ...
  ..$ CT value         : logi [1:101] NA NA NA NA NA NA ...
  ..$ Final Test Result: chr [1:101] "negative" "negative" "negative" "negative" ...
 $ positive    :'data.frame':   888 obs. of  5 variables:
  ..$ MSF Number ID    : num [1:888] 1 10 21 39 62 71 73 75 105 107 ...
  ..$ Laboratory id    : chr [1:888] "BED-1" "KOU-1" "DAN-2" "BEK-1" ...
  ..$ Date of the test : POSIXct[1:888], format: "2022-08-20 00:00:00" "2022-09-04 15:21:36" ...
  ..$ CT value         : num [1:888] 27 26.9 28.1 27.2 26.3 26 27.8 28.9 26.8 25.9 ...
  ..$ Final Test Result: chr [1:888] "positive" "positive" "positive" "positive" ...

# Nb of items in the list:
length(lab_data_list)

[1] 3

# Name of the items of the list:
names(lab_data_list)

[1] "inconclusive" "negative"     "positive"

You can also go and click on the list in the Environment panel to inspect its structure, and open it in the Viewer, as we did for data frames.

The View is a bit different. Click on the buttons you see and see what happens.

Hint: There are buttons that you can click on both on the left and right of the Viewer pane, but the ones on the right appear only when you hover your mouse over a line.

You can access elements of the list with these commands

# Access by name:
head(lab_data_list$positive)

  MSF Number ID Laboratory id    Date of the test CT value Final Test Result
1             1         BED-1 2022-08-20 00:00:00     27.0          positive
2            10         KOU-1 2022-09-04 15:21:36     26.9          positive
3            21         DAN-2 2022-09-12 00:54:38     28.1          positive
4            39         BEK-1 2022-09-15 02:31:22     27.2          positive
5            62         BOU-1 2022-09-30 17:51:01     26.3          positive
6            71         BEK-2 2022-10-02 15:03:51     26.0          positive

# Access the content of the first element:
head(lab_data_list[[1]])

  MSF Number ID Laboratory id    Date of the test CT value Final Test Result
1           147         BOU-4 2022-10-12 18:40:49       NA      inconclusive
2           414         MOI-5 2022-12-09 05:55:56       NA      inconclusive
3           455        DAN-12 2022-11-22 17:09:53       NA      inconclusive
4           469        DAN-13 2022-11-18 07:03:57       NA      inconclusive
5           533         MOI-9 2023-01-05 03:38:26       NA      inconclusive
6           882        DAN-21 2023-02-05 21:57:09       NA      inconclusive

# Access by position the first element of the list:
lab_data_list[1]

$inconclusive
   MSF Number ID  Laboratory id    Date of the test CT value Final Test Result
1            147          BOU-4 2022-10-12 18:40:49       NA      inconclusive
2            414          MOI-5 2022-12-09 05:55:56       NA      inconclusive
3            455         DAN-12 2022-11-22 17:09:53       NA      inconclusive
4            469         DAN-13 2022-11-18 07:03:57       NA      inconclusive
5            533          MOI-9 2023-01-05 03:38:26       NA      inconclusive
6            882         DAN-21 2023-02-05 21:57:09       NA      inconclusive
7           1070         BEK-20 2023-01-29 19:46:36       NA      inconclusive
8           1163         BEK-25 2023-02-08 01:12:21       NA      inconclusive
9           1326         BED-11 2023-02-23 22:48:47       NA      inconclusive
10          1600         DAN-31 2023-03-03 12:47:31       NA      inconclusive
11          1869         MOI-50 2023-03-18 00:41:43       NA      inconclusive
12          1948         MOI-56 2023-04-01 08:46:37       NA      inconclusive
13          1987         BOU-42 2023-03-16 15:04:54       NA      inconclusive
14          2242         MOI-73 2023-03-27 10:37:40       NA      inconclusive
15          2268         MOI-74 2023-04-07 13:24:03       NA      inconclusive
16          2449         BOU-48 2023-04-29 23:22:57       NA      inconclusive
17          2536         MOI-91 2023-04-07 11:09:42       NA      inconclusive
18          2928        MOI-124 2023-04-21 09:21:41       NA      inconclusive
19          3068        MOI-136 2023-04-25 03:29:41       NA      inconclusive
20          3580        MOI-166 2023-05-07 09:01:07       NA      inconclusive
21          3759        MOI-175 2023-06-02 18:53:39       NA      inconclusive
22          3820        MOI-177 2023-05-12 06:22:39       NA      inconclusive
23          3928         BED-61 2023-06-09 07:45:48       NA      inconclusive
24          4050         BOU-74 2023-06-18 12:26:59       NA      inconclusive
25          4316         BED-70 2023-06-02 23:11:48       NA      inconclusive
26          4331         BOU-80 2023-06-06 15:23:00       NA      inconclusive
27          4609        MOI-216 2023-07-14 06:03:25       NA      inconclusive
28          4680        MOI-225 2023-07-19 17:24:27       NA      inconclusive
29          4721        MOI-229 2023-07-22 12:43:41       NA      inconclusive
30          5025         BOU-86 2023-06-05 04:24:20       NA      inconclusive
31          5111        MOI-249 2023-08-04 17:51:58       NA      inconclusive
32          5283        MOI-261 2023-08-12 09:09:33       NA      inconclusive
33          5337         BED-91 2023-06-28 16:46:56       NA      inconclusive
34          5634        MOI-281 2023-08-27 08:31:02       NA      inconclusive
35          5844        MOI-297 2023-06-23 13:13:51       NA      inconclusive
36          6142        BOU-105 2023-09-06 17:26:24       NA      inconclusive
37          6591        MOI-346 2023-09-26 10:45:51       NA      inconclusive
38          6675        MOI-353 2023-08-02 07:21:04       NA      inconclusive
39          6788        BOU-113 2023-08-31 16:29:50       NA      inconclusive
40          6945        MOI-369 2023-07-13 08:48:11       NA      inconclusive
41          7236        BED-123 2023-07-31 09:45:47       NA      inconclusive
42          7314        MOI-395 2023-08-12 11:32:27       NA      inconclusive
43          7594        MOI-417 2023-08-15 09:00:03       NA      inconclusive
44          7743        MOI-432 2023-09-22 11:59:36       NA      inconclusive
45          7776        MOI-437 2023-10-21 17:40:38       NA      inconclusive
46          7876        MOI-442 2023-09-18 00:03:53       NA      inconclusive
47          7900        MOI-445 2023-08-06 18:07:45       NA      inconclusive
48          8076        MOI-454 2023-10-01 19:23:50       NA      inconclusive
49          8317        MOI-471 2023-08-09 17:52:31       NA      inconclusive
50          8332        MOI-473 2023-08-21 03:51:30       NA      inconclusive
51          8772        MOI-496 2023-09-25 00:47:21       NA      inconclusive
52          8834        MOI-502 2023-10-20 12:12:01       NA      inconclusive
53          9089        BED-156 2023-10-22 04:24:13       NA      inconclusive
54          9090        MOI-519 2023-10-15 19:57:32       NA      inconclusive
55          9564        MOI-550 2023-10-15 12:26:19       NA      inconclusive
56          9859        BED-166 2023-11-06 09:27:17       NA      inconclusive
57          9917        MOI-585 2023-11-12 22:42:52       NA      inconclusive
58          9917 MOI-585_retest 2023-11-14 22:42:52       NA      inconclusive

Bind Into a data frame

If you know that the data in all the sheets have the same structure, you can tell R to bind all of these tables into a single data frame (row wise) by setting the argument rbind to TRUE:

# PSEUDO CODE -- DO NOT RUN

# Returns a single data frame:
import_list(
  here(path_data_raw, "example_linelist.xlsx"), 
  rbind = TRUE,                    # Tells R to bind list elements together
  rbind_label = "Sheet_position"   # Optional: give the sheet name of the source of data
)

Use the import_list() function to import all the sheets of msf_laboratory_moissala_2023-09-24_first.xlsx in one go, and store it in an object called lab_data, which is a single data frame containing all the data.

You may use the rbind_label = argument and provide a string for the column name, or not and see what happens.

Slightly More Messy Sheets

Now, let’s focus on the second data frame provided, msf_laboratory_moissala_2023-09-24_second.xlsx. By now you should have realized that this data is more messy:

the CT value column exists only in the third sheet
the Laboratory id is not in the same position in the three sheets
The date of the test does not have the same name in all sheets

Let’s see how the import_list() function behaves in these conditions:

lab_data_messy <- import_list(
  here(path_data_raw, "multiple_sheets",
       "msf_laboratory_moissala_2023-09-24_second.xlsx"),
  rbind = TRUE,
  rbind_label = "Sheet_position")

Run the above code and use the tools that you know to inspect the structure and content of the lab_data_messy object.

What happened to the CT value columns? Was it imported? If yes, how was it filled?
What happened to the Laboratory id columns? Was it imported? If yes, how was it filled?
What happened to the dates columns? Were they ignored?

As you can see the function behaves reasonably:

columns are matched by name, so if a column exists in all sheets and has the same name, the data is gathered in the same column, independently of its position
the corollary is that columns with different names are treated as different columns, so check your names!
if a column exists in one sheet and not the others, the column will exist in the final data frame, but the values will be NA for the rows coming from sheets without the column (don’t remember what NA means?).

You are now ready to import files with multiple sheets! Now let’s tackle data scattered in several files.

Import Multiple files

Fortunately, the import_list() function can also accept a vector of file names. In this case, it will import each of these files and store them into a list, unless you can use rbind = TRUE.

# PSEUDO CODE -- DO NOT RUN

# Create a vector of (fictitious) files
vector_paths <- c(
  here(path_data_raw, "example_linelist_A.xlsx"),
  here(path_data_raw, "example_linelist_B.xlsx"),
  here(path_data_raw, "example_linelist_C.xlsx"),
  here(path_data_raw, "example_linelist_D.xlsx"),
  here(path_data_raw, "example_linelist_E.xlsx"),
  here(path_data_raw, "example_linelist_F.xlsx"),
  here(path_data_raw, "example_linelist_G.xlsx"),
  here(path_data_raw, "example_linelist_H.xlsx"),
  here(path_data_raw, "example_linelist_I.xlsx"),
  here(path_data_raw, "example_linelist_J.xlsx"),
  here(path_data_raw, "example_linelist_K.xlsx"),
  here(path_data_raw, "example_linelist_L.xlsx"),
  here(path_data_raw, "example_linelist_M.xlsx")
)

import_list(vector_paths) # returns a list

Go to the sub-folder data/raw/multiple_files/. Open the Excel files to check their content. Would it make sense to concatenate the data together at import in this case?

Import all the (English) Excel files in one go and store them in a data frame called linelist_data.

Inspect the data frame created: how many lines? How many columns? Was everything imported as you expected?

Get File Paths with `file.path()`

List All the Files

Manually listing the files that you want to import was fine for demonstrations purposes so that you understand what to pass to the import_list() function, but it is slow and error prone.

We will now use the function list.files() to automatically retrieve the file paths. Its main argument, path takes a file path, and the function returns a simple vector of file names.

path_mult <- here::here(path_data_raw, "multiple_files") # optional, to shorten the following code

list.files(path = path_mult) # returns a vector of file names

 [1] "a_document.txt"                     "a_non_data_file.txt"               
 [3] "also_not_data.doc"                  "moissala_linelist_EN_BEDAYA.xlsx"  
 [5] "moissala_linelist_EN_BEDJONDO.xlsx" "moissala_linelist_EN_BEKOUROU.xlsx"
 [7] "moissala_linelist_EN_BOUNA.xlsx"    "moissala_linelist_EN_GOUNDI.xlsx"  
 [9] "moissala_linelist_EN_KOUMRA.xlsx"   "moissala_linelist_EN_MOISSALA.xlsx"

We can ask the function to return the full paths of the files with the argument full.names = TRUE:

list.files(path = path_mult,
           full.names = TRUE)

 [1] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/a_document.txt"                    
 [2] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/a_non_data_file.txt"               
 [3] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/also_not_data.doc"                 
 [4] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_BEDAYA.xlsx"  
 [5] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_BEDJONDO.xlsx"
 [6] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_BEKOUROU.xlsx"
 [7] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_BOUNA.xlsx"   
 [8] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_GOUNDI.xlsx"  
 [9] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_KOUMRA.xlsx"  
[10] "/tmp/RtmpRJBDYU/file78043d9d3c3c/data/session-specific/import-multiple/satellite_multiple_EN/data/raw/multiple_files/moissala_linelist_EN_MOISSALA.xlsx"

Note

Remember that thanks to {here} and relative paths within the course directory, these absolute paths will adapt to the file structure and work on your computer.

Keep Only the Excel files

Now, this is nice and all but there are those pesky files that are not Excel which we would like to avoid. An easy way is to use the pattern argument of list.files(). It takes a string of text that is matched to the names of files, to keep only the ones you want.

The pattern argument accepts regular expressions, or regex for short, which are way out of our scope here. Here we are only giving you pointers to create a simple pattern.

The function will return file names that match the pattern:

# All file names that contain "bedaya":
list.files(path = path_mult, pattern = "BEDAYA")

[1] "moissala_linelist_EN_BEDAYA.xlsx"

# All file names that contain "msf_linelist":
list.files(path = path_mult, pattern = "moissala_linelist")

[1] "moissala_linelist_EN_BEDAYA.xlsx"   "moissala_linelist_EN_BEDJONDO.xlsx"
[3] "moissala_linelist_EN_BEKOUROU.xlsx" "moissala_linelist_EN_BOUNA.xlsx"   
[5] "moissala_linelist_EN_GOUNDI.xlsx"   "moissala_linelist_EN_KOUMRA.xlsx"  
[7] "moissala_linelist_EN_MOISSALA.xlsx"

# All file names that contain "msf_ll":
list.files(path = path_mult, pattern = "msf_ll")  # Nothing matches

character(0)

# All the files that contain "doc":
list.files(path = path_mult, pattern = "doc")

[1] "a_document.txt"    "also_not_data.doc"

Did you notice at the end, two files contained “doc” in their path, one in its name, the other in its file extension? If you want to select only the Word document, you need a more precise pattern.

# Get this word document for sure
list.files(path = path_mult, pattern = "\\.doc$")

[1] "also_not_data.doc"

Here we modified the pattern to make it more restrictive:

the \\. at the start is the regex way of saying that there should be a “.” before “doc”.
the $ at the end of the pattern indicates that this pattern should be at the end of the file name.

So with this pattern we are looking for files that contain “.doc” at the end of their names.

Now, write a command using list.files() to automatically list all files that end with “.xlsx”. Then, update your previous code to import these files in one data frame, as we did before.

Tip

The argument recursive = TRUE allows you to list files present in sub-directories of your path target, which can be handy if you have nested folders containing data.

Done!

Congratulation, you can now import several files or sheets in one go! It will be very useful to some of the aggregated data files “by week” or “by month” that you probably are familiar with. And maybe this was your first contact with regex, which is a whole subject, so don’t worry if that part was complicated, this is something you can learn latern when you need it.

Going Further

An introduction to regular expressions

Objectives

Setup

Introduction

Import Multiple Sheets

Bind Into a data frame

Slightly More Messy Sheets

Import Multiple files

Get File Paths with file.path()

List All the Files

Keep Only the Excel files

Done!

Going Further

Get File Paths with `file.path()`