Data Exploration
Objectives
- Perform quick exploration of an imported dataset
- Produce frequency tables for variables
Dependencies. This extra session assumes that you have completed the sessions introduction to R and R studio, and data importation.
Setup
For this session we will work with our raw Moissala measles linelist which can be downloaded here:
Make sure it is appropriately stored in data/raw
of your project. Then open a new script called data-exploration.R
, and make sure packages {here}
, {rio}
and {dplyr}
are loaded. Finally, import the data into R as an object called df_linelist
.
Data Exploration
Right after importing some data into R, we might want to take a look at it. When talking of data exploration we usually want to do a few things:
- Examine dimensions of the data (ie: how many rows and how many columns)
- Look at columns names
- Visualise the first or last few rows
- Determine the class of the variables
- Determine the range of values in continuous variables
- Observe the possible values in each categorical variable
This process is crucial and will allow us to familiarize ourselves with our data and identify issues that will be adressed during the data cleaning step.
Basic Exploration
The very first thing you want to know about your data is the dimensions, which refers to the number of rows and number of columns that make up your data. There are several ways to get this information in R:
- Look at your environment pane in RStudio and check for your data - the number next to it (
5230x25
) tells us it is a dataframe with5230
rows and25
columns. - Use
dim()
on your data to return a vector with both the number of rows and number of columns - Alternatively, use
ncol()
to get the number of columns andnrow()
for the number of rows
It’s good to remember these numbers so you can quickly spot if there are unexpected changes to your data during your analysis (ie: more/fewer rows or columns than expected).
Using the method of your choice, get the dimensions of your dataframe df_linelist
.
Variable Names
Because we are going to use the variable names very often during our analysis, we want to get familiar with them pretty early on. Also, we need to identify the ones that will need to be renamed during our data cleaning. The function names()
returns a vector of all the variable names in our dataframe:
names(df_linelist)
[1] "id" "full_name" "sex"
[4] "age" "age_unit" "region"
[7] "sub_prefecture" "village_commune" "date_onset"
[10] "date_consultation" "hospitalisation" "date_admission"
[13] "health_facility_name" "malaria_rdt" "fever"
[16] "rash" "cough" "red_eye"
[19] "pneumonia" "encephalitis" "muac"
[22] "vacc_status" "vacc_doses" "outcome"
[25] "date_outcome"
What do you think of the names in your dataset? Can you already spot some variables names you would like to rename?
Inspecting Your Data
It is also nice to inspect your data, it may be easier for you to spot some inconsistencies, variables with a lot of missing values, and it will allow you to see what values to expect in each of them. You can print your data in the console by:
- Running the
df_linelist
object alone (careful though, you may not want to do this if you have a large dataset) - Use the
head()
function to see the top 6 rows (you can increase this number using the argumentn
) - Use the
tail()
function to see the last 6 rows (again, you can increase this number using the argumentn
)
These methods will only print the first 40 rows of your data at most because that’s the limit of your console. Alternatively, you can use View()
to see your data in a tabular form. This will open a new window with your data displayed like like an Excel spreadsheet. Note, this command only displays the data, it doesn’t allow you to modify it.
Be very careful with View()
on large dataset as this may crash your RStudio session. To avoid this, you can print the output in the console.
Can you display the first 15 rows of your data? What happen when you change the width of your console pane and run the command again?
Variable Classes
We now want to check the class of the different variables. This is important as part of data cleaning involves making sure that numerical variables are class numeric
, dates Date
, and categorical variables are factor
or character
. You have already seen the class()
function, to check the class of a vector. In R, each variable of a dataframe is a vector. We can extract all the values of that vector using the $
sign, and pass it to the class function:
class(df_linelist$age)
Try extracting all the values from the sex
variable. What is the class of this variable?
You can also use str()
on your dataframe to check all the variables class at once:
str(df_linelist)
Use str()
to check the data type of each column. Does anything look odd? Remember that you can also use functions like is.character()
and is.numeric()
if you’d like to test the type of a particular column.
Exploring Continuous Variables
Now that you know how to extract the values from a variable, you may want to explore some of these values from the numeric
variables to check for inconsistencies. Let’s look for some summary statistics for these, and Base R provides many handy functions:
Function | Description | Example | Returns |
---|---|---|---|
min() | Minimum value | min(x) | Single minimum value |
max() | Maximum value | max(x) | Single maximum value |
mean() | Arithmetic average | mean(x) | Average value |
median() | Middle value | median(x) | Middle value |
range() | Min and max | range(x) | Vector of (min, max) |
IQR() | Interquartile range | IQR(x) | Q3 - Q1 |
quantile() | Specified quantiles | quantile(x, probs = c(0.25, 0.75)) | Requested quantiles |
sd() | Standard deviation | sd(x) | Standard deviation |
var() | Variance | var(x) | Variance |
sum() | Sum of values | sum(x) | Sum |
These functions require you to explicitly remove missing values (NA
) using the argument na.rm = TRUE
You can extract the values of a variables using $
, and pass them to any of those functions.
Use the $
syntax to get:
- The minimum value of
age
- The maximum of
muac
Any problems?
Exploring Categorical Variables
Finally, let’s look at the values in our categorical variables. For this we can use frequency tables. This is handy as:
- It allows us to quickly see the unique values in a categorical variable
- The number of observations for each of those categories
This is done using the function count()
from the package {dplyr}
, which accepts the a dataframe and the name of one (or more!) column(s) as arguments. It will then count the number of observations of each unique element in that column. For example, let’s see the possible values of the variable sex
:
count(df_linelist, sex)
The output is a new, smaller dataframe containing the number of patients observed stratified by sex
. It seems like this variable requires some recoding… We will do that in a later session.
Using your linelist data, look into the values for the outcome
variable. How does it look?
Now, try adding the argument sort = TRUE
to the count()
function. What did this argument do?
Done!
Well done taking a first look at your data!