Data Manipulation, Basics

Core

Data Manipulation

Data Cleaning

An introduction to data manipulation and cleaning using {dplyr}

Objectives

Learn basic data verbs from {dplyr} to:
- Select specific columns (select())
- Rename columns (rename())
- Add new columns and change existing ones (mutate())
- Remove duplicate observations
Understand the pipe operator |>

Setup

Dependencies. This session assumes that you know how to use RStudio and that you are able to import data. If you need a refresher on either of these topics, we encourage you to review the first two sessions in the learning pathway.

This session will work with the raw Moissala linelist data, which can be downloaded here:

Make sure this dataset is saved into the appropriate subdirectory of your R project and create a new script called data_verbs_practice.R in your R directory. Add an appropriate header and load the following packages: {here}, {rio}, and {tidyverse}.

Finally, add an import section where you use {here} and {rio} to load your data into an object called df_raw.

Manipulating Data with `{dplyr}`

Now that we know how to set up a project and import data, we can finally start to play around with it. Going forward we will be using several packages from the “tidyverse” to help us manipulate, summarize, and visualize our data. Today’s session will focus on data manipulation using a package called {dplyr}.

What is `{dplyr}`

Data manipulation is the foundation of working with data in R and as such is foundational to the work we do as epidemiologists. In particular, data manipulation skills will be critical when trying to clean our data.

In R, the package {dplyr} provides a large number of functions to help us manipulate data frames and perform many of the tasks that we will need to use on a daily basis, for example:

Subsetting our data to remove certain variables
Renaming certain variables
Adding or modifying a variable
Removing duplicate entries

In {dplyr} each of these actions can be done with a particular function, which typically have an intuitive verb for a name. For example, renaming columns will use the function rename().

In today’s session we will look at the “data manipulation verb”, ie the function, needed for each of the above tasks as well as how to chain them all together into an efficient data pipeline.

Note

You may have noticed that we asked you to load a package called {tidyverse} rather than {dplyr} in the setup. Loading {tidyverse} will load several of the most useful packages from the broader tidyverse, including {dplyr} and a couple other packages that we will see later in the session.

Basic Data Verbs

Selecting Specific Columns

A lot of the time when we receive a dataset it will have extra columns that we don’t need, either because those columns contain sensitive data or because our analysis will only focus on a subset of the data. This is where a function like select() comes in handy.

Here is the basic syntax, note that this is pseudo-code and isn’t something you are intended to run yourself.

# DO NOT RUN (PSEUDO-CODE)
select(df_raw, first_column_to_keep, second_column_to_keep)

Here we see that the first argument is our dataset and each subsequent argument is the name of a column that we would like to keep. In the tidyverse, variables (ie column names) don’t need to be put into quotation marks. So for example, if we want to select the columns id, sex, and age we can use the following:

select(df_raw, id, sex, age)

Use select() to select the following variables in your dataset: id, sex, age, sub_prefecture, date_onset, and outcome. The head of your output should look something like this:

  id   sex age date_onset   outcome
1  1 femme  36 2022-08-13 recovered
2  2     f   5 2022-08-18      <NA>
3  3     f 156 2022-08-17 recovered
4  6 homme   8 2022-08-22 recovered
5  7     m   7 2022-08-30 recovered
6 10     m   4 2022-08-30 recovered

Take a look at this output and then at df_raw. We can see that df_raw still contains all of the columns, which is what we want. But can you tell why it didn’t change?

Often, we want to keep most of the variables in our dataset and only remove one or two. We can use the above syntax to do this, but it can become pretty tedious to write out every column name. In these cases, instead of telling select what to **keep**, we can use a subtraction sign (-) to tell it what to **remove**. For example, if we wanted to remove thevillage_commune` column from our dataframe we can use the following:

select(df_raw, -village_commune)

Way easier!

Use the - syntax in select() to select all of the columns in df_raw except: full_name and age_unit from your dataset.

Renaming Columns

Another common issue when we get new datasets is that the variable names are inconvenient. In those cases, rename() can work wonders. Here’s the basic syntax:

# DO NOT RUN (PSEUDO CODE)
rename(df,
       new_column_name = old_column_name,
       another_new_name = another_old_name)

As in the case of select(), and indeed in essentially all {dplyr} verbs, the first argument is our daframe. Then each subsequent argument is a statement of new_column_name = old_column_name telling R which columns to rename and the new names that we want to use, with each pair given its own line to improve readability. If we wanted to change village_commune to simply be village, for example, we can write:

rename(df_raw,
       village = village_commune)

Use rename() on df_raw to change the columns sub_prefecture, village_commune, and health_facility_name to be prefecture, village, and facility respectively.

In the above exercise it may have been difficult to check if the output looked correct because R would have printed out the full data frame. In these cases it can be helpful to create a temporary object just to check if everything looks alright. You can call this object whatever you want, but a common name is tmp.

Repeat the last exercise but this time assign the output to an object called tmp and use names() to check that the column names changed as you expected. The output of names() should give you something like this:

 [1] "id"                "full_name"         "sex"              
 [4] "age"               "age_unit"          "region"           
 [7] "prefecture"        "village"           "date_onset"       
[10] "date_consultation" "hospitalisation"   "date_admission"   
[13] "facility"          "malaria_rdt"       "fever"            
[16] "rash"              "cough"             "red_eye"          
[19] "pneumonia"         "encephalitis"      "muac"             
[22] "vacc_status"       "vacc_doses"        "outcome"          
[25] "date_outcome"

Important

Temporary objects, like the tmp data frame you just created are just that: temporary. They are usually used to test if something has worked and designed to be overwritten each time you need to test something else. As such, you should not use these temporary objects as the input for other parts of your code. If you want to make a data frame that will be reused, such as a clean version of df_raw, this should be done using an object with a proper name like df or df_clean.

Changing and Adding Columns

So now we know how to select and rename columns, but how do we modify them? This is where mutate() comes into play. This function can be used both to add new columns and to change existing ones.

Let’s start with the basic mutate() syntax needed to add a new column:

# DO NOT RUN (PSEUDO-CODE)
mutate(df,
       new_column = action(existing_column),
       another_new_column = another_action(another_existing_column))

In the above code, we are creating a new column (new_column) by performing some sort of action (action()) on an existing column in the dataframe (existing_column). This action could be anything, it could use a function or be a simple arithmetic operation and can use one or more columns. For example, if we wanted to create a new column expressing MUAC in cm we could use the following:

mutate(df_raw,
       muac_cm = muac / 100)

Use mutate() to create a new column called age_years that expresses age in years rather than months. The head of your new age_years column should look like this:

   age_years
1  3.0000000
2  0.4166667
3 13.0000000
4  0.6666667
5  0.5833333
6  0.3333333

Great! But what if instead of creating a new column we instead wanted to change an existing one? You just need to use the existing column name on the left side of the = instead of giving a new column name. For example, in the above MUAC code we would write:

mutate(df_raw,
       muac = muac / 100)

We might want to keep age in months as well as years, so we won’t reassign that column. But there are some other columns that could stand to be changed. There are a lot of reasons we might want to change a column, two of the most common ones are:

The format of a string needs changing
The data type of a column is incorrect

Our dataset has both of these problems. For example, while it isn’t per se a problem that region and sub_prefecture are in all capitals, it also isn’t particularly nice. To fix this, we can use another function from the {tidyverse}, this time from a package called {stringr} to make these columns title case:

mutate(df_raw,
       region = str_to_title(region),
       sub_prefecture = str_to_title(sub_prefecture))

Use mutate() to update the format of malaria_rdt and outcome to use title case. The head of these two columns should now look something like this:

  malaria_rdt   outcome
1    Negative Recovered
2    Negative      <NA>
3    Negative Recovered
4    Negative Recovered
5    Negative Recovered
6    Negative Recovered

Note

Notice that we didn’t need to load {stringr} to do the above exercise. That’s because, like {dplyr} this package is loaded when we load the {tidyverse}.

That’s nicer. Now let’s consider the second issue, having variables with the wrong type.

Take a look at the data type of your columns, do any of them look strange?

Hint. str() may be useful here.

Most of the columns look ok, but it seems theres something strange with the dates. Some of them are character type and others are something called POSIXct. We would much rather have all of these columns use the simple Date type.

To convert to dates, we are going to call on yet another package from the the tidyverse, {lubridate}. In particular, we are going to use the function ymd(). For example:

mutate(df_raw,
       date_outcome = ymd(date_outcome))

Use mutate() and ymd() to modify date_onset and date_admission to be Date type. Use a temporary data frame tmp to check that your code is doing what you want it to.

Removing Duplicates

Ok great! We know how to select, rename, and modify our data. Another task we will often need to do is removing duplicate entries. Fortunately this one is easily done using the function distinct(), which has the following basic syntax:

# DO NOT RUN (PSEUDO-CODE)
distinct(df)

Notice that distinct only needs one argument by default, the dataset itself. This will look for and remove any duplicate observations in the data frame. There are some fancier ways of using distinct() that will look for duplication on certain variables only, but that’s outside of the scope of today’s session.

Use distinct() to create a temporary data frame, tmp, that contains all the unique observations in df_raw. Compare the number of rows in tmp to that of df_raw. Did we have any duplicates?

The Pipe Operator

So it looks like we have actually done quite a bit of cleaning while learning the core {dplyr} verbs. We should probably try to put some of the above steps together to start building a basic data cleaning pipeline. So far we haven’t been saving any of our changes, except perhaps to a temporary data frame. It would be nice to keep them in a new clean df object.

For example, if we want to effect the column renaming we did above to a reusable object we might write something like this:

df <- rename(df_raw, 
             prefecture = sub_prefecture,
             village = village_commune,
             facility = health_facility_name)

Tip

In general, it’s good practice to keep a raw version of your dataset, here df_raw, that remains unmodified in your code. This is so that you always have it available in your environment as a reference and is always available at the start of your cleaning pipeline to improve reproducibility.

Now we have a new object, df that we can do more operations on. Brilliant. For example, if we now wanted to select everything except for full_name we could update the above code like this:

# Step 1: Rename Variables
df <- rename(df_raw, 
             prefecture = sub_prefecture,
             village = village_commune,
             facility = health_facility_name)

# Step 2: Select Variables to Keep
df <- select(df,
             -full_name)

Notice that in this second step we are using df as the input of select() rather than df_raw because we want to continue working on our modified version of the data. Let’s say now we want to add a column of age in years:

# Step 1: Rename Variables
df <- rename(df_raw, 
             prefecture = sub_prefecture,
             village = village_commune,
             facility = health_facility_name)

# Step 2: Select Variables to Keep
df <- select(df,
             -full_name)

# Step 3: Add Age in Years
df <- mutate(df,
             age_years = age / 12)

Hm, ok well this is working but it is starting to get repetitive. With each step we are reusing the output of the last step and then updating the same data frame, df. It would be better if these actions could be chained together in a simpler way.

This is exactly what the pipe operator, |> is for! The pipe has the following basic syntax:

# DO NOT RUN (PSEUDO-CODE)
input |> action()

Here the input on the left side (input) is “piped into” the action on the right side (action()). So for example instead of writing:

select(df_raw, id, sex)

We could instead write:

df_raw |>
  select(id, sex)

Try out the above code to see if it works.

This can be used to chain multiple actions together and you will often see tidyverse style code that uses pipes in the following way:

# DO NOT RUN (PSEUDO-CODE)
df <- df_raw |>
  first_action() |>
  second_action() |>
  third_action()

Tip

Notice that here each action in the pipe is given its own line. This is considered good practice as it makes your code easier to read and understand.

So, if we wanted to chain the example actions we saw above into a single pipe, we might write something like this:

df <- df_raw |>
  rename(prefecture = sub_prefecture,
         village = village_commune,
         facility = health_facility_name) |>
  select(-full_name) |>
  mutate(age_years = age / 12)

That’s a lot easier than reassigning df after each step!

Let’s see if we can put together what we learned above into a single pipeline! Use the pipe operator |>, select(), rename(), mutate(), str_to_title(), ymd(), and distinct() to perform the following actions on df_raw and assign the output to a new data frame called df:

Remove the variables full_name and age_unit
Rename the following variables:
- age becomes age_months
- sub_prefecture becomes prefecture
- village_commune becomes village
- health_facility_name becomes facility
Add a variable age_years with patient age in years
Update region and prefecture to use title case
Update all date columns to use Date type
Remove any duplicate observations

The head of your final data should look something like this:

  id   sex age_months  region prefecture        village date_onset
1  1 femme         36 Mandoul   Moissala Sangana Koïtan 2022-08-13
2  2     f          5 Mandoul   Moissala      Mousdan 1 2022-08-18
3  3     f        156 Mandoul   Moissala     Djaroua Ii 2022-08-17
4  6 homme          8 Mandoul   Moissala     Monakoumba 2022-08-22
5  7     m          7 Mandoul   Moissala      Tétindaya 2022-08-30
6 10     m          4 Mandoul   Moissala      Danamadja 2022-08-30
  date_consultation hospitalisation date_admission
1        2022-08-14             yes     2022-08-14
2        2022-08-25             yes     2022-08-25
3        2022-08-20            <NA>           <NA>
4        2022-08-25              no           <NA>
5        2022-09-02              no           <NA>
6        2022-09-02             yes     2022-09-02
                         facility malaria_rdt fever rash cough red_eye
1 Hôpital du District de Moissala    negative    No <NA>   Yes      No
2 Hôpital du District de Moissala    negative    No   No   Yes      No
3                      CS Silambi    negative   Yes <NA>    No      No
4 Hôpital du District de Moissala    negative    No   No    No    <NA>
5                      CS Silambi    negative  <NA>   No   Yes     Yes
6                    Moissala Est    negative   Yes   No    No    <NA>
  pneumonia encephalitis muac vacc_status vacc_doses   outcome date_outcome
1        No           No  244        <NA>       <NA> recovered   2022-08-18
2      <NA>           No  232          No       <NA>      <NA>   2022-08-28
3        No         <NA>  123  Yes - oral       <NA> recovered         <NA>
4        No           No  210          No       <NA> recovered         <NA>
5        No           No   80          No       <NA> recovered         <NA>
6        No           No  220          No       <NA> recovered   2022-09-03
   age_years
1  3.0000000
2  0.4166667
3 13.0000000
4  0.6666667
5  0.5833333
6  0.3333333

Hint. Be careful with your column names here! If you renamed something you will need to use the new names for any subsequent parts of the pipe.

Amazing! That looks like a great start at a data cleaning pipeline. Keep this code handy, you will use it in the next session where we will look at another common part of data cleaning: recoding.

Done!

Well done, you’ve learned the fundamentals of data manipulation and how to string multiple commands together into a data manipulation pipe. Moving forward, solution files will focus less on being “exercise by exercise” and rather provide an example of what a real script might look like in a real world context. In this case, the solutions will then focus only on the final pipe that is created at the end of the session.

Going Further

Extra Exercises

A a line to your mutate() to update the hospitalisation variable so that its text would be in title case as well.
Perhaps you would prefer to use lower case for the region column rather than the title case, update your code to do this instead. Hint: you might want to check out str_to_lower() from {stringr}.
Create a delay_consultation column, that contains the number of days between the onset of symptoms and the consultation.