# PSEUDO CODE -- DO NOT RUN
# returns a list of data frames
import_list(here(path_data_raw, "example_linelist.xlsx")) Import Multiple Files/Sheets
Objectives
- Learn to efficiently import data from several files
- Learn to efficiently import data from several sheets in the same file
Setup
We expect you to be working in the course project. You can add this tutorial to the import_data.R script that you created in the Import data core session, or create a new script.
If you are working in a new script, add metadata about it at the top and load the {rio} and {here} packages. Then create a path_data_raw object that points to the directory data / raw of the course folder.
Introduction
In the import data core session your learned how to use {rio} package to import many file formats, and in particular .xlsx and .csv files, which will probably cover 90% of your needs. In this session we will tackle the very common problem of scattered data.
You may have already encountered these situations:
You have one Excel file with data in several sheets (linelist data, lab data, bed occupancy) and you need to import all datasets in R
You have several Excel files (linelist data, lab data, bed occupancy) with data to import.
You have an Excel file with data in several sheets, but this time all the sheets are formatted the same way, and you wish to gather the data in the same dataset.
You have multiple Excel files, all formatted the same (one linelist per site, for example) and you want to import and gather in the same data frame
The two first cases can be solved with several calls to the different sheets or files, and storing the data in different data frames. If there are not too many files or sheets, it’s fine. The two last cases should however be automatized!
The following lesson will help with all of them anyway: the {rio} package has a import_list() function that can import several items in one go. These items can be files or sheets depending on what you provide as arguments.
Import Multiple Sheets
The import_list() function can import all the sheets from an Excel file and store them into a list. It avoids you having to write multiple import() commands for each file separately.
Now imagine someone gave you lab data with the positive, negative and inconclusive scattered in three different sheets. Let’s import this data.
Go to the sub-folder data/raw/satellite_multiple/multiple_sheets. There are two modified version of the lab data. Open them and look at their structure. Do you see how the second file differs form the first one?
We will tackle the first file, as it is less messy.
Use the import_list() function to import all the sheets of msf_laboratory_moissala_2023-09-24_first.xlsxin one go, and store it in an object called lab_data_list.
The object lab_data_list is a list that contains several data frames (one data frame per sheet). Maybe you are not sure what to do with it yet. For now, let’s just peek at it. You can inspect its structure with well known tools:
str(lab_data_list) # Structure of the listList of 3
$ inconclusive:'data.frame': 58 obs. of 5 variables:
..$ MSF Number ID : num [1:58] 147 414 455 469 533 ...
..$ Laboratory id : chr [1:58] "BOU-4" "MOI-5" "DAN-12" "DAN-13" ...
..$ Date of the test : POSIXct[1:58], format: "2022-10-12 18:40:49" "2022-12-09 05:55:56" ...
..$ CT value : logi [1:58] NA NA NA NA NA NA ...
..$ Final Test Result: chr [1:58] "inconclusive" "inconclusive" "inconclusive" "inconclusive" ...
$ negative :'data.frame': 101 obs. of 5 variables:
..$ MSF Number ID : num [1:101] 13 397 416 490 767 ...
..$ Laboratory id : chr [1:101] "DAN-1" "DAN-9" "BOU-12" "BEK-12" ...
..$ Date of the test : POSIXct[1:101], format: "2022-09-10 21:25:04" "2022-11-21 20:02:44" ...
..$ CT value : logi [1:101] NA NA NA NA NA NA ...
..$ Final Test Result: chr [1:101] "negative" "negative" "negative" "negative" ...
$ positive :'data.frame': 888 obs. of 5 variables:
..$ MSF Number ID : num [1:888] 1 10 21 39 62 71 73 75 105 107 ...
..$ Laboratory id : chr [1:888] "BED-1" "KOU-1" "DAN-2" "BEK-1" ...
..$ Date of the test : POSIXct[1:888], format: "2022-08-20 00:00:00" "2022-09-04 15:21:36" ...
..$ CT value : num [1:888] 27 26.9 28.1 27.2 26.3 26 27.8 28.9 26.8 25.9 ...
..$ Final Test Result: chr [1:888] "positive" "positive" "positive" "positive" ...
length(lab_data_list) # Nb of items in the list[1] 3
names(lab_data_list) # Name of the items of the list[1] "inconclusive" "negative" "positive"
You can also go and click on the list in the Environment panel to inspect its structure, and open it in the Viewer, as we did for data frames.
The View is a bit different. Click on the buttons you see and see what happens.
Hint There are buttons that you can click on both on the left and right of the Viewer pane.
You can access elements of the list with these commands
head(lab_data_list$positive) # Access by name MSF Number ID Laboratory id Date of the test CT value Final Test Result
1 1 BED-1 2022-08-20 00:00:00 27.0 positive
2 10 KOU-1 2022-09-04 15:21:36 26.9 positive
3 21 DAN-2 2022-09-12 00:54:38 28.1 positive
4 39 BEK-1 2022-09-15 02:31:22 27.2 positive
5 62 BOU-1 2022-09-30 17:51:01 26.3 positive
6 71 BEK-2 2022-10-02 15:03:51 26.0 positive
head(lab_data_list[[1]]) # Access the content of the first element MSF Number ID Laboratory id Date of the test CT value Final Test Result
1 147 BOU-4 2022-10-12 18:40:49 NA inconclusive
2 414 MOI-5 2022-12-09 05:55:56 NA inconclusive
3 455 DAN-12 2022-11-22 17:09:53 NA inconclusive
4 469 DAN-13 2022-11-18 07:03:57 NA inconclusive
5 533 MOI-9 2023-01-05 03:38:26 NA inconclusive
6 882 DAN-21 2023-02-05 21:57:09 NA inconclusive
lab_data_list[1] # Access by position the first element of the list$inconclusive
MSF Number ID Laboratory id Date of the test CT value Final Test Result
1 147 BOU-4 2022-10-12 18:40:49 NA inconclusive
2 414 MOI-5 2022-12-09 05:55:56 NA inconclusive
3 455 DAN-12 2022-11-22 17:09:53 NA inconclusive
4 469 DAN-13 2022-11-18 07:03:57 NA inconclusive
5 533 MOI-9 2023-01-05 03:38:26 NA inconclusive
6 882 DAN-21 2023-02-05 21:57:09 NA inconclusive
7 1070 BEK-20 2023-01-29 19:46:36 NA inconclusive
8 1163 BEK-25 2023-02-08 01:12:21 NA inconclusive
9 1326 BED-11 2023-02-23 22:48:47 NA inconclusive
10 1600 DAN-31 2023-03-03 12:47:31 NA inconclusive
11 1869 MOI-50 2023-03-18 00:41:43 NA inconclusive
12 1948 MOI-56 2023-04-01 08:46:37 NA inconclusive
13 1987 BOU-42 2023-03-16 15:04:54 NA inconclusive
14 2242 MOI-73 2023-03-27 10:37:40 NA inconclusive
15 2268 MOI-74 2023-04-07 13:24:03 NA inconclusive
16 2449 BOU-48 2023-04-29 23:22:57 NA inconclusive
17 2536 MOI-91 2023-04-07 11:09:42 NA inconclusive
18 2928 MOI-124 2023-04-21 09:21:41 NA inconclusive
19 3068 MOI-136 2023-04-25 03:29:41 NA inconclusive
20 3580 MOI-166 2023-05-07 09:01:07 NA inconclusive
21 3759 MOI-175 2023-06-02 18:53:39 NA inconclusive
22 3820 MOI-177 2023-05-12 06:22:39 NA inconclusive
23 3928 BED-61 2023-06-09 07:45:48 NA inconclusive
24 4050 BOU-74 2023-06-18 12:26:59 NA inconclusive
25 4316 BED-70 2023-06-02 23:11:48 NA inconclusive
26 4331 BOU-80 2023-06-06 15:23:00 NA inconclusive
27 4609 MOI-216 2023-07-14 06:03:25 NA inconclusive
28 4680 MOI-225 2023-07-19 17:24:27 NA inconclusive
29 4721 MOI-229 2023-07-22 12:43:41 NA inconclusive
30 5025 BOU-86 2023-06-05 04:24:20 NA inconclusive
31 5111 MOI-249 2023-08-04 17:51:58 NA inconclusive
32 5283 MOI-261 2023-08-12 09:09:33 NA inconclusive
33 5337 BED-91 2023-06-28 16:46:56 NA inconclusive
34 5634 MOI-281 2023-08-27 08:31:02 NA inconclusive
35 5844 MOI-297 2023-06-23 13:13:51 NA inconclusive
36 6142 BOU-105 2023-09-06 17:26:24 NA inconclusive
37 6591 MOI-346 2023-09-26 10:45:51 NA inconclusive
38 6675 MOI-353 2023-08-02 07:21:04 NA inconclusive
39 6788 BOU-113 2023-08-31 16:29:50 NA inconclusive
40 6945 MOI-369 2023-07-13 08:48:11 NA inconclusive
41 7236 BED-123 2023-07-31 09:45:47 NA inconclusive
42 7314 MOI-395 2023-08-12 11:32:27 NA inconclusive
43 7594 MOI-417 2023-08-15 09:00:03 NA inconclusive
44 7743 MOI-432 2023-09-22 11:59:36 NA inconclusive
45 7776 MOI-437 2023-10-21 17:40:38 NA inconclusive
46 7876 MOI-442 2023-09-18 00:03:53 NA inconclusive
47 7900 MOI-445 2023-08-06 18:07:45 NA inconclusive
48 8076 MOI-454 2023-10-01 19:23:50 NA inconclusive
49 8317 MOI-471 2023-08-09 17:52:31 NA inconclusive
50 8332 MOI-473 2023-08-21 03:51:30 NA inconclusive
51 8772 MOI-496 2023-09-25 00:47:21 NA inconclusive
52 8834 MOI-502 2023-10-20 12:12:01 NA inconclusive
53 9089 BED-156 2023-10-22 04:24:13 NA inconclusive
54 9090 MOI-519 2023-10-15 19:57:32 NA inconclusive
55 9564 MOI-550 2023-10-15 12:26:19 NA inconclusive
56 9859 BED-166 2023-11-06 09:27:17 NA inconclusive
57 9917 MOI-585 2023-11-12 22:42:52 NA inconclusive
58 9917 MOI-585_retest 2023-11-14 22:42:52 NA inconclusive
Bind Into a data frame
If you know that the data in all the sheets have the same structure, you can tell R to bind all of these tables into a single data frame (row wise) by setting the argument rbind to TRUE:
# PSEUDO CODE -- DO NOT RUN
# Returns a single data frame:
import_list(
here(path_data_raw, "example_linelist.xlsx"),
rbind = TRUE, # Tells R to bind list elements together
rbind_label = "Sheet_position" # Optional: give the sheet name of the source of data
)Use the import_list() function to import all the sheets of msf_laboratory_moissala_2023-09-24_first.xlsxin one go, and store it in an object called lab_data, which is a single data frame containing all the data.
You may use the rbind_label = argument and provide a string for the column name, or not and see what happens.
Slightly More Messy Sheets
Now, let’s focus on the second data frame provided, msf_laboratory_moissala_2023-09-24_second.xlsx. By now you should have realized that this data is more messy:
- the
CT valuecolumn exists only in the third sheet - the
Laboratory idis not in the same position in the three sheets - The date of the test does not have the same name in all sheets
Let’s see how the import_list() function behaves in these conditions:
lab_data_messy <- import_list(
here(path_data_raw, "multiple_sheets",
"msf_laboratory_moissala_2023-09-24_second.xlsx"),
rbind = TRUE,
rbind_label = "Sheet_position")Run the above code and use the tools that you know to inspect the structure and content of the lab_data_messy object.
- What happened to the
CT valuecolumns? Was it imported? If yes, how was it filled? - What happened to the
Laboratory idcolumns? Was it imported? If yes, how was it filled? - What happened to the dates columns? Were they ignored?
As you can see the function behaves reasonably:
- columns are matched by name, so if a column exists in all sheets and has the same name, the data is gathered in the same column, independently of its position
- the corollary is that columns with different names are treated as different columns, so check your names!
- if a column exists in one sheet and not the others, the column will exist in the final data frame, but the values will be
NAfor the rows coming from sheets without the column (don’t remember whatNAmeans?).
You are now ready to import files with multiple sheets! Now let’s tackle data scattered in several files.
Import Multiple files
Fortunately, the import_list() function can also accept a vector of file names. In this case, it will import each of these files and store them into a list, unless you can use rbind = TRUE.
# PSEUDO CODE -- DO NOT RUN
# Create a vector of (fictitious) files
vector_paths <- c(
here(path_data_raw, "example_linelist_A.xlsx"),
here(path_data_raw, "example_linelist_B.xlsx"),
here(path_data_raw, "example_linelist_C.xlsx"),
here(path_data_raw, "example_linelist_D.xlsx"),
here(path_data_raw, "example_linelist_E.xlsx"),
here(path_data_raw, "example_linelist_F.xlsx"),
here(path_data_raw, "example_linelist_G.xlsx"),
here(path_data_raw, "example_linelist_H.xlsx"),
here(path_data_raw, "example_linelist_I.xlsx"),
here(path_data_raw, "example_linelist_J.xlsx"),
here(path_data_raw, "example_linelist_K.xlsx"),
here(path_data_raw, "example_linelist_L.xlsx"),
here(path_data_raw, "example_linelist_M.xlsx")
)
import_list(vector_paths) # returns a listGo to the sub-folder data / raw /satellite_multiple/multiple_files/EN. Open the files to check their content. Would it make sense to concatenate the data together at import in this case?
Import all the (English) Excel files in one go and store them in a data frame called linelist_data.
Inspect the data frame created: how many lines? How many columns? Was everything imported as you expected?
Get File Paths with file.path()
List All the Files
Manually listing the files that you want to import was fine for demonstrations purposes so that you understand what to pass to the import_list() function, but it is slow and error prone.
We will now use the function list.files() to automatically retrieve the file paths. Its main argument, path takes a file path, and the function returns a simple vector of file names.
path_mult <- here::here(path_data_raw, "multiple_files", "EN") # optional, to shorten the following code
list.files(path = path_mult) # returns a vector of file names [1] "a_document.txt" "a_non_data_file.txt"
[3] "also_not_data.doc" "moissala_linelist_EN_BEDAYA.xlsx"
[5] "moissala_linelist_EN_BEDJONDO.xlsx" "moissala_linelist_EN_BEKOUROU.xlsx"
[7] "moissala_linelist_EN_BOUNA.xlsx" "moissala_linelist_EN_GOUNDI.xlsx"
[9] "moissala_linelist_EN_KOUMRA.xlsx" "moissala_linelist_EN_MOISSALA.xlsx"
We can ask the function to return the full paths of the files with the argument full.names = TRUE:
list.files(path = path_mult,
full.names = TRUE) [1] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/a_document.txt"
[2] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/a_non_data_file.txt"
[3] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/also_not_data.doc"
[4] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_BEDAYA.xlsx"
[5] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_BEDJONDO.xlsx"
[6] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_BEKOUROU.xlsx"
[7] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_BOUNA.xlsx"
[8] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_GOUNDI.xlsx"
[9] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_KOUMRA.xlsx"
[10] "/tmp/RtmpRJBDYU/file780420143dab/data/raw/satellite_multiple/multiple_files/EN/moissala_linelist_EN_MOISSALA.xlsx"
Remember that thanks to {here} and relative paths within the course directory, these absolute paths will adapt to the file structure and work on your computer.
Keep Only the Excel files
Now, this is nice and all but there are those pesky files that are not Excel which we would like to avoid. An easy way is to use the pattern argument of list.files(). It takes a string of text that is matched to the names of files, to keep only the ones you want.
The pattern argument accepts regular expressions, or regex for short, which are way out of our scope here. Here we are only giving you pointers to create a simple pattern.
The function will return file names that match the pattern:
# All file names that contain "bedaya":
list.files(path = path_mult, pattern = "BEDAYA") [1] "moissala_linelist_EN_BEDAYA.xlsx"
# All file names that contain "msf_linelist":
list.files(path = path_mult, pattern = "moissala_linelist") [1] "moissala_linelist_EN_BEDAYA.xlsx" "moissala_linelist_EN_BEDJONDO.xlsx"
[3] "moissala_linelist_EN_BEKOUROU.xlsx" "moissala_linelist_EN_BOUNA.xlsx"
[5] "moissala_linelist_EN_GOUNDI.xlsx" "moissala_linelist_EN_KOUMRA.xlsx"
[7] "moissala_linelist_EN_MOISSALA.xlsx"
# All file names that contain "msf_ll":
list.files(path = path_mult, pattern = "msf_ll") # Nothing matchescharacter(0)
# All the files that contain "doc"
list.files(path = path_mult, pattern = "doc")[1] "a_document.txt" "also_not_data.doc"
Did you notice at the end, two files contained “doc” in their path, one in its name, the other in its file extension? If you want to select only the Word document, you need a more precise pattern.
# Get this word document for sure
list.files(path = path_mult, pattern = "\\.doc$") [1] "also_not_data.doc"
Here we modified the pattern to make it more restrictive:
- the
\\.at the start is the regex way of saying that there should be a “.” before “doc”. - the
$at the end of the pattern indicates that this pattern should be at the end of the file name.
So with this pattern we are looking for files that contain “.doc” at the end of their names.
Now, write a command using list.files() to automatically list all files that end with “.xlsx”. Then, update your previous code to import these files in one data frame, as we did before.
The argument recursive = TRUE allows you to list files present in sub-directories of your path target, which can be handy if you have nested folders containing data.
Done!
Congratulation, you can now import several files or sheets in one go! It will be very useful to some of the aggregated data files “by week” or “by month” that you probably are familiar with. And maybe this was your first contact with regex, which is a whole subject, so don’t worry if that part was complicated, this is something you can learn latern when you need it.
Going Further
An introduction to regular expressions