Summary Tables
Objectives
- Create contingency tables with
count()
- Compute summary statistics by group using
summarize()
- Review how to subset rows using
filter()
and create/modify variables withmutate()
- Create ordered categorical variables
Setup
Dependencies. This session assumes that you know how to use RStudio that you are able to import data and that you know th basic data handling verbs that we have seen in the core sessions so far. If you need a refresher on either of these topics, we encourage you to review the core sessions in the learning pathway.
This session will use the clean version of the Moissala measles dataset. Download the data and store them in data/clean
in your practice project:
Open your RStudio Project and create a new script in the R folder called tables.R
with appropriate metadata and a “Packages” section that imports: {rio}
, {here}
and {tidyverse}
. Add an “Import Data” section that loads the cleaned version of the measles linelist.
Introduction: Data aggregation
OK so let’s recap, you have just performed one of the most important tasks of an epidemiologist: the data cleaning. Now that you have clean and standardized data, you can get into the real business and start analysing them. Analyses typically start with some tables and summaries that describe our data:
- Univariate frequency tables to count occurrences of different values
- Summary statistics of numerical variables (mean, median, standard deviation)
- Cross-tabulations to examine relationships between categorical variables
- Group-wise summaries to compare statistics across different subsets of the data
Counting Multiple Columns (Contingency Tables)
During the data exploration session, you have learned to create a frequency table for a single categorical variable using the count()
function. This is nice, but we often want to count the number observations based on two (or more!) variables.
These tables are called contingency tables. For example, knowing the number of patients by sub_prefecture
is great but we might want to stratify by both sub_prefecture
and age_group
to see if certain areas have unusually old patients. Doing this is easy, you just need to pass multiple column names to count()
:
|>
df_linelist count(sub_prefecture, age_group)
Create a new summary table counting the number of patients stratified by sub_prefecture
and hospitalisation
. What happens if you change the order of the arguments given to count()
?
Now, using count()
, answer the following questions:
- How many patients were female? What is the proportion?
- What are all the possible values of the
outcome
variable? - How many patients between 1 - 4 years have recovered?
Filtering out NA
s
When looking at the categories of outcome
, you should have spotted that some patients have missing values (NA
):
|>
df_linelist count(outcome) |>
mutate(prop = n / sum(n))
Observe the output of the code above. How can you also call the proportion of patients who died? Are you happy with this calculation?
The proportion of cases that died is also referred to as the Case Fatality Ratio (CFR). To precisely calculate the CFR we need to make sure that the denominator only includes patient for whom we are sure of their outcome (ie we need to remove all cases with NA
or left aginst medical advice
).
Remember that we can do this using filter()
. To filter for missing values (NA
) in a variable we can use the small function is.na(outcome)
. Adding a !
in front will do the opposite: removing missing values from outcome
:
|>
df_linelist filter(
!= "left against medical advice",
outcome !is.na(outcome)
|>
) count(outcome)
Which other conditionnal statement could you use in filter()
to obtain the same results
Now that we have removed the patients with unknown outcomes, we can add this before creating our frequency table to get the right CFR.
Using your filter, update your code to summarize the observed number of patients who survived and died as well as the CFR (proportion who died). Store this new dataframe into an object, cfr_df
.
Bonus. A useful “shortcut” function is drop_na()
from the package {tidyr}
that equates to filter(!is.na())
.
|>
df_linelist drop_na(outcome) |>
count(outcome)
drop_na()
is particularly useful as you can give it multiple column names to filter by. But be careful that doing so will remove all rows where one or more of those columns have a missing value.
Summary Table: Statistics by Sub Prefecture
Ok now that we have produced some simple frequency and contingency tables we may want to increase the complexity. A common task in epidemiology is to look at summary statistics within subsets of the data.
For example, we may be asked to produce patient statistics at the sub-prefecture level, ie: for each sub-prefecture in the data, we need to answer the following questions:
- How many patients consulted?
- What is their average age?
- What was the earliest date of admission?
- How many patients have been hospitalized?
- Among children under 6 months, how many have died?
This is exactly what the function summarize()
has been made for! It allows us to calculate summary statistics on a dataset, and the syntax is similar to that of mutate()
:
# DO NOT RUN (PSEUDO-CODE)
|>
df mutate(new_col = function_to_create(existing_col))
|>
df summarize(
.by = grouping_variable,
new_col = summary_function(existing_col)
)
Consider the following code, here we are summarizing the data to calculate the average age across all patients.
|>
df_linelist summarize(mean_age = mean(age))
# A tibble: 1 × 1
mean_age
<dbl>
1 6.82
Notice that this code yields a single value for average age. No grouping variable was provided so summarize()
returned one summary statistic for the whole dataset. To calculate the average age by a specific strata, we need to specify a grouping variable using the .by
argument:
|>
df_linelist summarize(
.by = sex, # Make the summary (here, the mean) by sex
mean_age = mean(age)
)
# A tibble: 2 × 2
sex mean_age
<chr> <dbl>
1 f 6.77
2 m 6.87
Take a look at the above results. How would you interpret them?
Now that we can use summarize()
we can use it to calculate some proper summary statistics by sub-prefecture. Let’s start by calling an empty summarize()
and grouping the data on sub_prefecture
.
Run the following code:
|>
df_linelist summarize(
.by = sub_prefecture
)
What happens when you run these lines?
Counts
We first want to look at the number of cases in each sub_prefecture
. This can be done using the helper function n()
:
|>
df_linelist summarize(
.by = sub_prefecture,
n_patients = n() # Count stuffs
)
Ok now let’s build a summary table for each sub_prefecture
. First start by replicating the above lines
Summarizing Continuous Variables
We can then use the mean()
, median()
, min()
, max()
functions (and other) to produce summaries for continuous variables. For example the average age:
|>
df_linelist summarize(
.by = sub_prefecture,
n_patients = n(),
mean_age = mean(age)
)
Add the minimum date of admission to your table for each of the sub_prefecture
? Are you happy with the results?
Remember that with the arithmetic functions such as mean()
, median()
, min()
, max()
, you need to explicitly tell R to remove NA
.
Counting with a Condition
We may also be interested in looking at the number of patients (rows) that fit a condition: the number of patients that were female. Counting by a logical condition can be done with the following syntax:
# DO NOT RUN (PSEUDO-CODE)
summarize(
sum_category = sum(LOGIC_TEST, na.rm = TRUE)
)
This sum allows us to count all the lines where our condition was met (returns TRUE
). For example:
|>
df_linelist summarize(
.by = sub_prefecture,
n_female = sum(sex == "f", na.rm = TRUE)
)
Add a variable to your table that counts the number of patients that have been hospitalized. (ie: rows that have yes
in variable hospitalisation
)
Other Statistics
Sometimes we want to produce a more complicated statistic, for example the mean age of all hospitalized patients. Here the syntax is a bit different:
# DO NOT RUN (PSEUDO-CODE)
|>
df summarize(
mean_category = mean(col_to_use[LOGIC_TEST], na.rm = TRUE)
)
Here, we have:
- Stated what summary statistic we want to use (
mean()
) - Indicated which column we want to calculate that statistic on (
col_to_use
) - Created a condition of which observations in that column to use in the calculation (
[LOGIC_TEST]
)
To give a concrete example, if we wanted to compute the mean of the age
variable but only for hospitalized patients (ie: in rows where hospitalisation == "yes"
) we would write:
|>
df_linelist summarize(
.by = sub_prefecture,
n_patients = n(),
mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE)
)
The use of a logical test in the example above is called logical indexing, where a condition is essentially being used to filter which observations you want to consider when performing a calculation. Logical indexing is very powerful but can also take some getting used to, so don’t worry too much if it isn’t perfectly clear at this stage.
Can you use this syntax to calculate the mean age of female patients in your table?
That is looking great! We are starting to get a pretty exhaustive grouped summary table with a lot of useful information by sub_prefecture
! An extra challenge for you:
CHALLENGE: Could you add a variable to your table that counts the number of patients that died among the ones that are < 6 months old.
Hint. You want to count rows (so use sum()
) that fill a specific condition for outcome (outcome == "dead"
), but only when age_group == "< 6 months"
Use the Output
Finally, remember that summarize()
returns a dataframe that we can then further manipulate (eg: with filter()
and mutate()
).
Add a mutate()
after producing your summary table to calculate:
- The proportion of hospitalized patients per sub-prefecture
- The proportion of female patients per sub-prefecture
The head of your final table should look like this:
# A tibble: 6 × 11
sub_prefecture n_patients mean_age min_admission n_female n_hosp mean_age_hosp
<chr> <int> <dbl> <date> <int> <int> <dbl>
1 Moissala 1808 6.84 2022-08-14 923 612 5.49
2 Bouna 1376 6.56 2023-01-11 669 412 5.67
3 Bedjondo 534 7.07 2023-06-09 251 184 5.21
4 Bekourou 496 6.84 2023-06-17 251 164 6.04
5 Bedaya 435 7.10 2023-07-04 209 147 6.16
6 Koumra 253 7.11 2023-08-14 138 84 6.26
# ℹ 4 more variables: mean_age_female <dbl>, n_death_u6m <int>,
# prop_female <dbl>, prop_hosp <dbl>
Done!
You should be proud of yourselves, making summary tables is an important skill to an epidemiologist, making it in R is very efficient! Don’t forget to save your code!
Going Further
Extra Exercises
-
Make a summary table that summarizes:
The number of patients
The proportion of male
The number of deaths
The CFR
-
The number of deaths among patients that had pneumonia
in all the different age groups !
Make a table that shows the proportion of patients by age with any measles vaccine (by oral recall or card) and those with 1 or 2 doses.
Make a table that compares the proportion of hospitalised and non-hospitalised patients with positive malaria RDT, fever, rash, cough, red eye, pneumonia, encephalitis, and “red” or “yellow” MUAC (less than 125 mm).
Calculate the mean days from first symptom onset to consultation by sub-prefecture.
Calculate the mean time spent in hospital (i.e. days from admission to outcome) by outcome (i.e. in those who recovered and those who died).
Additional Resources
- The EpiR Handbook chapter on grouping data
- Once you have tables, you can extensively customize them for display/publication using
{gt}
package: