Data Manipulation, Basics
{dplyr}
Objectives
- Learn basic data verbs from
{dplyr}
to:- Select specific columns (
select()
) - Rename columns (
rename()
) - Add new columns and change existing ones (
mutate()
) - Remove duplicate observations
- Select specific columns (
- Understand the pipe operator
|>
Setup
Dependencies. This session assumes that you know how to use RStudio and that you are able to import data. If you need a refresher on either of these topics, we encourage you to review the first two sessions in the learning pathway.
This session will work with the raw Moissala linelist data, which can be downloaded here:
Make sure this dataset is saved into the appropriate subdirectory of your R project and create a new script called data_verbs_practice.R
in your R
directory. Add an appropriate header and load the following packages: {here}
, {rio}
, and {tidyverse}
.
Finally, add an import section where you use {here}
and {rio}
to load your data into an object called df_raw
.
Manipulating Data with {dplyr}
Now that we know how to set up a project and import data, we can finally start to play around with it. Going forward we will be using several packages from the “tidyverse” to help us manipulate, summarize, and visualize our data. Today’s session will focus on data manipulation using a package called {dplyr}
.
What is {dplyr}
Data manipulation is the foundation of working with data in R and as such is foundational to the work we do as epidemiologists. In particular, data manipulation skills will be critical when trying to clean our data.
In R, the package {dplyr}
provides a large number of functions to help us manipulate data frames and perform many of the tasks that we will need to use on a daily basis, for example:
- Subsetting our data to remove certain variables
- Renaming certain variables
- Adding or modifying a variable
- Removing duplicate entries
In {dplyr}
each of these actions can be done with a particular function, which typically have an intuitive verb for a name. For example, renaming columns will use the function rename()
.
In today’s session we will look at the “data manipulation verb”, ie the function, needed for each of the above tasks as well as how to chain them all together into an efficient data pipeline.
You may have noticed that we asked you to load a package called {tidyverse}
rather than {dplyr}
in the setup. Loading {tidyverse}
will load several of the most useful packages from the broader tidyverse, including {dplyr}
and a couple other packages that we will see later in the session.
Basic Data Verbs
Selecting Specific Columns
A lot of the time when we receive a dataset it will have extra columns that we don’t need, either because those columns contain sensitive data or because our analysis will only focus on a subset of the data. This is where a function like select()
comes in handy.
Here is the basic syntax, note that this is pseudo-code and isn’t something you are intended to run yourself.
# DO NOT RUN (PSEUDO-CODE)
select(df_raw, first_column_to_keep, second_column_to_keep)
Here we see that the first argument is our dataset and each subsequent argument is the name of a column that we would like to keep. In the tidyverse, variables (ie column names) don’t need to be put into quotation marks. So for example, if we want to select the columns id
, sex
, and age
we can use the following:
select(df_raw, id, sex, age)
Use select()
to select the following variables in your dataset: id
, sex
, age
, sub_prefecture
, date_onset
, and outcome
. The head of your output should look something like this:
id sex age date_onset outcome
1 1 femme 36 2022-08-13 recovered
2 2 f 5 2022-08-18 <NA>
3 3 f 156 2022-08-17 recovered
4 6 homme 8 2022-08-22 recovered
5 7 m 7 2022-08-30 recovered
6 10 m 4 2022-08-30 recovered
Take a look at this output and then at df_raw
. We can see that df_raw
still contains all of the columns, which is what we want. But can you tell why it didn’t change?
Often, we want to keep most of the variables in our dataset and only remove one or two. We can use the above syntax to do this, but it can become pretty tedious to write out every column name. In these cases, instead of telling select what to **keep**, we can use a subtraction sign (
-) to tell it what to **remove**. For example, if we wanted to remove the
village_commune` column from our dataframe we can use the following:
select(df_raw, -village_commune)
Way easier!
Use the -
syntax in select()
to select all of the columns in df_raw
except: full_name
and age_unit
from your dataset.
Renaming Columns
Another common issue when we get new datasets is that the variable names are inconvenient. In those cases, rename()
can work wonders. Here’s the basic syntax:
# DO NOT RUN (PSEUDO CODE)
rename(df,
new_column_name = old_column_name,
another_new_name = another_old_name)
As in the case of select()
, and indeed in essentially all {dplyr}
verbs, the first argument is our daframe. Then each subsequent argument is a statement of new_column_name = old_column_name
telling R which columns to rename and the new names that we want to use, with each pair given its own line to improve readability. If we wanted to change village_commune
to simply be village
, for example, we can write:
rename(df_raw,
village = village_commune)
Use rename()
on df_raw
to change the columns sub_prefecture
, village_commune
, and health_facility_name
to be prefecture
, village
, and facility
respectively.
In the above exercise it may have been difficult to check if the output looked correct because R would have printed out the full data frame. In these cases it can be helpful to create a temporary object just to check if everything looks alright. You can call this object whatever you want, but a common name is tmp
.
Repeat the last exercise but this time assign the output to an object called tmp
and use names()
to check that the column names changed as you expected. The output of names()
should give you something like this:
[1] "id" "full_name" "sex"
[4] "age" "age_unit" "region"
[7] "prefecture" "village" "date_onset"
[10] "date_consultation" "hospitalisation" "date_admission"
[13] "facility" "malaria_rdt" "fever"
[16] "rash" "cough" "red_eye"
[19] "pneumonia" "encephalitis" "muac"
[22] "vacc_status" "vacc_doses" "outcome"
[25] "date_outcome"
Temporary objects, like the tmp
data frame you just created are just that: temporary. They are usually used to test if something has worked and designed to be overwritten each time you need to test something else. As such, you should not use these temporary objects as the input for other parts of your code. If you want to make a data frame that will be reused, such as a clean version of df_raw
, this should be done using an object with a proper name like df
or df_clean
.
Changing and Adding Columns
So now we know how to select and rename columns, but how do we modify them? This is where mutate()
comes into play. This function can be used both to add new columns and to change existing ones.
Let’s start with the basic mutate()
syntax needed to add a new column:
# DO NOT RUN (PSEUDO-CODE)
mutate(df,
new_column = action(existing_column),
another_new_column = another_action(another_existing_column))
In the above code, we are creating a new column (new_column
) by performing some sort of action (action()
) on an existing column in the dataframe (existing_column
). This action could be anything, it could use a function or be a simple arithmetic operation and can use one or more columns. For example, if we wanted to create a new column expressing MUAC in cm we could use the following:
mutate(df_raw,
muac_cm = muac / 100)
Use mutate()
to create a new column called age_years
that expresses age
in years rather than months. The head of your new age_years
column should look like this:
age_years
1 3.0000000
2 0.4166667
3 13.0000000
4 0.6666667
5 0.5833333
6 0.3333333
Great! But what if instead of creating a new column we instead wanted to change an existing one? You just need to use the existing column name on the left side of the =
instead of giving a new column name. For example, in the above MUAC code we would write:
mutate(df_raw,
muac = muac / 100)
We might want to keep age in months as well as years, so we won’t reassign that column. But there are some other columns that could stand to be changed. There are a lot of reasons we might want to change a column, two of the most common ones are:
- The format of a string needs changing
- The data type of a column is incorrect
Our dataset has both of these problems. For example, while it isn’t per se a problem that region
and sub_prefecture
are in all capitals, it also isn’t particularly nice. To fix this, we can use another function from the {tidyverse}
, this time from a package called {stringr}
to make these columns title case:
mutate(df_raw,
region = str_to_title(region),
sub_prefecture = str_to_title(sub_prefecture))
Use mutate()
to update the format of malaria_rdt
and outcome
to use title case. The head of these two columns should now look something like this:
malaria_rdt outcome
1 Negative Recovered
2 Negative <NA>
3 Negative Recovered
4 Negative Recovered
5 Negative Recovered
6 Negative Recovered
Notice that we didn’t need to load {stringr}
to do the above exercise. That’s because, like {dplyr}
this package is loaded when we load the {tidyverse}
.
That’s nicer. Now let’s consider the second issue, having variables with the wrong type.
Take a look at the data type of your columns, do any of them look strange?
Hint. str()
may be useful here.
Most of the columns look ok, but it seems theres something strange with the dates. Some of them are character type and others are something called POSIXct. We would much rather have all of these columns use the simple Date
type.
To convert to dates, we are going to call on yet another package from the the tidyverse, {lubridate}
. In particular, we are going to use the function ymd()
. For example:
mutate(df_raw,
date_outcome = ymd(date_outcome))
Use mutate()
and ymd()
to modify date_onset
and date_admission
to be Date
type. Use a temporary data frame tmp
to check that your code is doing what you want it to.
Removing Duplicates
Ok great! We know how to select, rename, and modify our data. Another task we will often need to do is removing duplicate entries. Fortunately this one is easily done using the function distinct()
, which has the following basic syntax:
# DO NOT RUN (PSEUDO-CODE)
distinct(df)
Notice that distinct only needs one argument by default, the dataset itself. This will look for and remove any duplicate observations in the data frame. There are some fancier ways of using distinct()
that will look for duplication on certain variables only, but that’s outside of the scope of today’s session.
Use distinct()
to create a temporary data frame, tmp
, that contains all the unique observations in df_raw
. Compare the number of rows in tmp
to that of df_raw
. Did we have any duplicates?
The Pipe Operator
So it looks like we have actually done quite a bit of cleaning while learning the core {dplyr}
verbs. We should probably try to put some of the above steps together to start building a basic data cleaning pipeline. So far we haven’t been saving any of our changes, except perhaps to a temporary data frame. It would be nice to keep them in a new clean df
object.
For example, if we want to effect the column renaming we did above to a reusable object we might write something like this:
<- rename(df_raw,
df prefecture = sub_prefecture,
village = village_commune,
facility = health_facility_name)
In general, it’s good practice to keep a raw version of your dataset, here df_raw
, that remains unmodified in your code. This is so that you always have it available in your environment as a reference and is always available at the start of your cleaning pipeline to improve reproducibility.
Now we have a new object, df
that we can do more operations on. Brilliant. For example, if we now wanted to select everything except for full_name
we could update the above code like this:
# Step 1: Rename Variables
<- rename(df_raw,
df prefecture = sub_prefecture,
village = village_commune,
facility = health_facility_name)
# Step 2: Select Variables to Keep
<- select(df,
df -full_name)
Notice that in this second step we are using df
as the input of select()
rather than df_raw
because we want to continue working on our modified version of the data. Let’s say now we want to add a column of age in years:
# Step 1: Rename Variables
<- rename(df_raw,
df prefecture = sub_prefecture,
village = village_commune,
facility = health_facility_name)
# Step 2: Select Variables to Keep
<- select(df,
df -full_name)
# Step 3: Add Age in Years
<- mutate(df,
df age_years = age / 12)
Hm, ok well this is working but it is starting to get repetitive. With each step we are reusing the output of the last step and then updating the same data frame, df
. It would be better if these actions could be chained together in a simpler way.
This is exactly what the pipe operator, |>
is for! The pipe has the following basic syntax:
# DO NOT RUN (PSEUDO-CODE)
|> action() input
Here the input on the left side (input
) is “piped into” the action on the right side (action()
). So for example instead of writing:
select(df_raw, id, sex)
We could instead write:
|>
df_raw select(id, sex)
Try out the above code to see if it works.
This can be used to chain multiple actions together and you will often see tidyverse style code that uses pipes in the following way:
# DO NOT RUN (PSEUDO-CODE)
<- df_raw |>
df first_action() |>
second_action() |>
third_action()
Notice that here each action in the pipe is given its own line. This is considered good practice as it makes your code easier to read and understand.
So, if we wanted to chain the example actions we saw above into a single pipe, we might write something like this:
<- df_raw |>
df rename(prefecture = sub_prefecture,
village = village_commune,
facility = health_facility_name) |>
select(-full_name) |>
mutate(age_years = age / 12)
That’s a lot easier than reassigning df
after each step!
Let’s see if we can put together what we learned above into a single pipeline! Use the pipe operator |>
, select()
, rename()
, mutate()
, str_to_title()
, ymd()
, and distinct()
to perform the following actions on df_raw
and assign the output to a new data frame called df
:
- Remove the variables
full_name
andage_unit
- Rename the following variables:
-
age
becomesage_months
-
sub_prefecture
becomesprefecture
-
village_commune
becomesvillage
-
health_facility_name
becomesfacility
-
- Add a variable
age_years
with patient age in years - Update
region
andprefecture
to use title case - Update all date columns to use
Date
type - Remove any duplicate observations
The head of your final data should look something like this:
id sex age_months region prefecture village date_onset
1 1 femme 36 Mandoul Moissala Sangana Koïtan 2022-08-13
2 2 f 5 Mandoul Moissala Mousdan 1 2022-08-18
3 3 f 156 Mandoul Moissala Djaroua Ii 2022-08-17
4 6 homme 8 Mandoul Moissala Monakoumba 2022-08-22
5 7 m 7 Mandoul Moissala Tétindaya 2022-08-30
6 10 m 4 Mandoul Moissala Danamadja 2022-08-30
date_consultation hospitalisation date_admission
1 2022-08-14 yes 2022-08-14
2 2022-08-25 yes 2022-08-25
3 2022-08-20 <NA> <NA>
4 2022-08-25 no <NA>
5 2022-09-02 no <NA>
6 2022-09-02 yes 2022-09-02
facility malaria_rdt fever rash cough red_eye
1 Hôpital du District de Moissala negative No <NA> Yes No
2 Hôpital du District de Moissala negative No No Yes No
3 CS Silambi negative Yes <NA> No No
4 Hôpital du District de Moissala negative No No No <NA>
5 CS Silambi negative <NA> No Yes Yes
6 Moissala Est negative Yes No No <NA>
pneumonia encephalitis muac vacc_status vacc_doses outcome date_outcome
1 No No 244 <NA> <NA> recovered 2022-08-18
2 <NA> No 232 No <NA> <NA> 2022-08-28
3 No <NA> 123 Yes - oral <NA> recovered <NA>
4 No No 210 No <NA> recovered <NA>
5 No No 80 No <NA> recovered <NA>
6 No No 220 No <NA> recovered 2022-09-03
age_years
1 3.0000000
2 0.4166667
3 13.0000000
4 0.6666667
5 0.5833333
6 0.3333333
Hint. Be careful with your column names here! If you renamed something you will need to use the new names for any subsequent parts of the pipe.
Amazing! That looks like a great start at a data cleaning pipeline. Keep this code handy, you will use it in the next session where we will look at another common part of data cleaning: recoding.
Done!
Well done, you’ve learned the fundamentals of data manipulation and how to string multiple commands together into a data manipulation pipe. Moving forward, solution files will focus less on being “exercise by exercise” and rather provide an example of what a real script might look like in a real world context. In this case, the solutions will then focus only on the final pipe that is created at the end of the session.
Going Further
Extra Exercises
A a line to your
mutate()
to update thehospitalisation
variable so that its text would be in title case as well.Perhaps you would prefer to use lower case for the
region
column rather than the title case, update your code to do this instead. Hint: you might want to check outstr_to_lower()
from{stringr}
.Create a
delay_consultation
column, that contains the number of days between the onset of symptoms and the consultation.