Summary Tables

Core
Summary Tables
Create summary tables
Published

February 21, 2025

Objectives

  • Create contingency tables with count()
  • Compute summary statistics by group using summarize()
  • Review how to subset rows using filter() and create/modify variables with mutate()
  • Create ordered categorical variables

Setup

Dependencies. This session assumes that you know how to use RStudio that you are able to import data and that you know th basic data handling verbs that we have seen in the core sessions so far. If you need a refresher on either of these topics, we encourage you to review the core sessions in the learning pathway.

This session will use the clean version of the Moissala measles dataset. Download the data and store them in data/clean in your practice project:


Open your RStudio Project and create a new script in the R folder called tables.R with appropriate metadata and a “Packages” section that imports: {rio}, {here} and {tidyverse}. Add an “Import Data” section that loads the cleaned version of the measles linelist.

Introduction: Data aggregation

OK so let’s recap, you have just performed one of the most important tasks of an epidemiologist: the data cleaning. Now that you have clean and standardized data, you can get into the real business and start analysing them. Analyses typically start with some tables and summaries that describe our data:

  • Univariate frequency tables to count occurrences of different values
  • Summary statistics of numerical variables (mean, median, standard deviation)
  • Cross-tabulations to examine relationships between categorical variables
  • Group-wise summaries to compare statistics across different subsets of the data

Counting Multiple Columns (Contingency Tables)

During the data exploration session, you have learned to create a frequency table for a single categorical variable using the count() function. This is nice, but we often want to count the number observations based on two (or more!) variables.

These tables are called contingency tables. For example, knowing the number of patients by sub_prefecture is great but we might want to stratify by both sub_prefecture and age_group to see if certain areas have unusually old patients. Doing this is easy, you just need to pass multiple column names to count():

df_linelist |>
  count(sub_prefecture, age_group)

Create a new summary table counting the number of patients stratified by sub_prefecture and hospitalisation. What happens if you change the order of the arguments given to count()?
Now, using count(), answer the following questions:

  • How many patients were female? What is the proportion?
  • What are all the possible values of the outcome variable?
  • How many patients between 1 - 4 years have recovered?

Filtering out NAs

When looking at the categories of outcome, you should have spotted that some patients have missing values (NA):

df_linelist |>
  count(outcome) |>
  mutate(prop = n / sum(n))

Observe the output of the code above. How can you also call the proportion of patients who died? Are you happy with this calculation?

The proportion of cases that died is also referred to as the Case Fatality Ratio (CFR). To precisely calculate the CFR we need to make sure that the denominator only includes patient for whom we are sure of their outcome (ie we need to remove all cases with NA or left aginst medical advice).

Remember that we can do this using filter(). To filter for missing values (NA) in a variable we can use the small function is.na(outcome). Adding a ! in front will do the opposite: removing missing values from outcome:

df_linelist |>
  filter(
    outcome != "left against medical advice", 
    !is.na(outcome)
  ) |>
  count(outcome)

Which other conditionnal statement could you use in filter() to obtain the same results

Now that we have removed the patients with unknown outcomes, we can add this before creating our frequency table to get the right CFR.

Using your filter, update your code to summarize the observed number of patients who survived and died as well as the CFR (proportion who died). Store this new dataframe into an object, cfr_df.

Tip

Bonus. A useful “shortcut” function is drop_na() from the package {tidyr} that equates to filter(!is.na()).

df_linelist |>
  drop_na(outcome) |>
  count(outcome)

drop_na() is particularly useful as you can give it multiple column names to filter by. But be careful that doing so will remove all rows where one or more of those columns have a missing value.

Summary Table: Statistics by Sub Prefecture

Ok now that we have produced some simple frequency and contingency tables we may want to increase the complexity. A common task in epidemiology is to look at summary statistics within subsets of the data.

For example, we may be asked to produce patient statistics at the sub-prefecture level, ie: for each sub-prefecture in the data, we need to answer the following questions:

  • How many patients consulted?
  • What is their average age?
  • What was the earliest date of admission?
  • How many patients have been hospitalized?
  • Among children under 6 months, how many have died?

This is exactly what the function summarize() has been made for! It allows us to calculate summary statistics on a dataset, and the syntax is similar to that of mutate():

# DO NOT RUN (PSEUDO-CODE)
df |>
  mutate(new_col = function_to_create(existing_col))

df |>
  summarize(
    .by = grouping_variable,
    new_col = summary_function(existing_col)
  )

Consider the following code, here we are summarizing the data to calculate the average age across all patients.

df_linelist |>
  summarize(mean_age = mean(age))
# A tibble: 1 × 1
  mean_age
     <dbl>
1     6.82

Notice that this code yields a single value for average age. No grouping variable was provided so summarize() returned one summary statistic for the whole dataset. To calculate the average age by a specific strata, we need to specify a grouping variable using the .by argument:

df_linelist |>
  summarize(
    .by = sex,  # Make the summary (here, the mean) by sex
    mean_age = mean(age)
  )
# A tibble: 2 × 2
  sex   mean_age
  <chr>    <dbl>
1 f         6.77
2 m         6.87

Take a look at the above results. How would you interpret them?

Now that we can use summarize() we can use it to calculate some proper summary statistics by sub-prefecture. Let’s start by calling an empty summarize() and grouping the data on sub_prefecture.

Run the following code:

df_linelist |>
  summarize(
    .by = sub_prefecture
  )

What happens when you run these lines?

Counts

We first want to look at the number of cases in each sub_prefecture. This can be done using the helper function n():

df_linelist |>
  summarize(
    .by = sub_prefecture,
    n_patients = n() # Count stuffs
  )

Ok now let’s build a summary table for each sub_prefecture. First start by replicating the above lines

Summarizing Continuous Variables

We can then use the mean(), median(), min(), max() functions (and other) to produce summaries for continuous variables. For example the average age:

df_linelist |>
  summarize(
    .by = sub_prefecture,
    n_patients = n(),
    mean_age = mean(age)
  )

Add the minimum date of admission to your table for each of the sub_prefecture? Are you happy with the results?

Tip

Remember that with the arithmetic functions such as mean(), median(), min(), max(), you need to explicitly tell R to remove NA.

Counting with a Condition

We may also be interested in looking at the number of patients (rows) that fit a condition: the number of patients that were female. Counting by a logical condition can be done with the following syntax:

# DO NOT RUN (PSEUDO-CODE)
summarize(
  sum_category = sum(LOGIC_TEST, na.rm = TRUE)
  )

This sum allows us to count all the lines where our condition was met (returns TRUE). For example:

df_linelist |>
  summarize(
    .by = sub_prefecture,
    n_female = sum(sex == "f", na.rm = TRUE)
  )

Add a variable to your table that counts the number of patients that have been hospitalized. (ie: rows that have yes in variable hospitalisation)

Other Statistics

Sometimes we want to produce a more complicated statistic, for example the mean age of all hospitalized patients. Here the syntax is a bit different:

# DO NOT RUN (PSEUDO-CODE)
df |>
  summarize(
    mean_category = mean(col_to_use[LOGIC_TEST], na.rm = TRUE)
    )

Here, we have:

  • Stated what summary statistic we want to use (mean())
  • Indicated which column we want to calculate that statistic on (col_to_use)
  • Created a condition of which observations in that column to use in the calculation ([LOGIC_TEST])

To give a concrete example, if we wanted to compute the mean of the age variable but only for hospitalized patients (ie: in rows where hospitalisation == "yes") we would write:

df_linelist |>
  summarize(
    .by = sub_prefecture,
    n_patients = n(),
    mean_age_hosp = mean(age[hospitalisation == "yes"], na.rm = TRUE)
  )

The use of a logical test in the example above is called logical indexing, where a condition is essentially being used to filter which observations you want to consider when performing a calculation. Logical indexing is very powerful but can also take some getting used to, so don’t worry too much if it isn’t perfectly clear at this stage.

Can you use this syntax to calculate the mean age of female patients in your table?

That is looking great! We are starting to get a pretty exhaustive grouped summary table with a lot of useful information by sub_prefecture! An extra challenge for you:

CHALLENGE: Could you add a variable to your table that counts the number of patients that died among the ones that are < 6 months old.


Hint. You want to count rows (so use sum()) that fill a specific condition for outcome (outcome == "dead"), but only when age_group == "< 6 months"

Use the Output

Finally, remember that summarize() returns a dataframe that we can then further manipulate (eg: with filter() and mutate()).

Add a mutate() after producing your summary table to calculate:

  • The proportion of hospitalized patients per sub-prefecture
  • The proportion of female patients per sub-prefecture

The head of your final table should look like this:

# A tibble: 6 × 11
  sub_prefecture n_patients mean_age min_admission n_female n_hosp mean_age_hosp
  <chr>               <int>    <dbl> <date>           <int>  <int>         <dbl>
1 Moissala             1808     6.84 2022-08-14         923    612          5.49
2 Bouna                1376     6.56 2023-01-11         669    412          5.67
3 Bedjondo              534     7.07 2023-06-09         251    184          5.21
4 Bekourou              496     6.84 2023-06-17         251    164          6.04
5 Bedaya                435     7.10 2023-07-04         209    147          6.16
6 Koumra                253     7.11 2023-08-14         138     84          6.26
# ℹ 4 more variables: mean_age_female <dbl>, n_death_u6m <int>,
#   prop_female <dbl>, prop_hosp <dbl>

Done!

You should be proud of yourselves, making summary tables is an important skill to an epidemiologist, making it in R is very efficient! Don’t forget to save your code!

Going Further

Extra Exercises

  1. Make a summary table that summarizes:

    • The number of patients

    • The proportion of male

    • The number of deaths

    • The CFR

    • The number of deaths among patients that had pneumonia

      in all the different age groups !

  2. Make a table that shows the proportion of patients by age with any measles vaccine (by oral recall or card) and those with 1 or 2 doses.

  3. Make a table that compares the proportion of hospitalised and non-hospitalised patients with positive malaria RDT, fever, rash, cough, red eye, pneumonia, encephalitis, and “red” or “yellow” MUAC (less than 125 mm).

  4. Calculate the mean days from first symptom onset to consultation by sub-prefecture.

  5. Calculate the mean time spent in hospital (i.e. days from admission to outcome) by outcome (i.e. in those who recovered and those who died).

Additional Resources