Chapter 3 Pseudonymization

3.1 Objectives

Learn to:

assess the variables in a dataset for disclosure risk and utility
implement pseudonymization procedures, where necessary, to limit disclosure risk
ensure that a final dataset meets all data-sharing requirements

3.2 Background

3.2.1 Disclosure

When a person or organization recognizes or learns something that they did not already know about another identifiable person or organization through released data. This might occur through:

spontaneous recognition (i.e. someone with knowledge of the sampled population recognizes a unique or particular combination of data values)
record matching/linkage with other existing datasets (e.g. population registers, electoral rolls, data from specialized firms)

3.2.2 Identifiers

Direct identifiers: variables that unambiguously reveal a person’s identity (e.g. name, passport number, phone number, physical address, email address)

Indirect identifiers: variables that contain information that, when combined with other variables, could lead to re-identification (e.g. sex, age, marital status, occupation). Note potential for elevated identifiability risk from extreme values of continuous variables (height, income, number of children, land area).

3.2.3 k-anonymity

A measure of re-identification risk for discrete variables. k = the number of records in a dataset containing a certain combination of indirect identifiers (e.g. how many records with sex = “male” and age_group = “40-49 years” ?). Higher value of k means lower re-identification risk, because higher k means more records in the dataset with the same combination of indirect identifiers.

3.2.4 Pseudonymization

Methods used to transform a dataset to achieve an “acceptable level” of re-identification / disclosure risk. Two types of methods:

Non-perturbative: suppression (remove entire variables, or specific records or values) or aggregation (aggregate levels of a variable to reduce uniqueness)

Perturbative: shuffle values or add noise to a variable while preserving desired statistical properties

3.3 Typical workflow

Select a threshold value of k-anonymity that will be the minimum acceptable value for combinations of indirect identifiers within the released dataset (e.g. k = 5).
Assess re-identification risk of each variable (e.g. direct identifier, indirect identifier, non-identifying).
Assess utility of each variable for analysis (e.g. high, low, uncertain).
Withhold variables classified as direct identifiers (e.g. name, phone number, address). Consider withholding other variables with low utility and non-zero re-identification risk.
Merge groups of related indirect identifiers, where possible. E.g. If dataset contains two age-related variables age_in_years and age_in_months, merge these two variables into a new derived variable age, and withhold the original variables.
Review all unique values of ‘free-text’ type variables to ensure they do not contain identifying details. Aggregate or withhold as necessary.
Discretise any indirect identifiers that are continuous variables (e.g. height in cm -> discrete height categories).
Assess re-identification risk criterion (i.e. k-anonymity) using all indirect identifiers.
Pseudonymize indirect identifiers to limit re-identification risk (e.g. aggregate, withhold).
Repeat steps 8 and 9 until the given risk criterion is met.
Ensure that the final pseudonymized dataset and dictionary meet all data-sharing requirements.

3.4 Exercise

This repository includes an example dataset based on a mortality survey (see section 1.1 Setup for data download links). Load the dataset and pre-prepared data dictionary using the example code below, and use them to work through the pseudonymization workflow described above.

library(rio)
library(here)

# import dataset and prepared data dictionary
dat <- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl")
dict <- rio::import(here("data/mortality_survey_simple_dict_pre_pseudonym.xlsx"), setclass = "tbl")

As you make your way through the pseudonymization workflow, answer the following questions:

(1) Which variables, if any, did you assess as either direct or indirect identifiers?

(2) Despite not being completely familiar with the original study, were you able to assess any variables as being of low utility for analysis? If so, what actions did you take?

(3) Did you find any groups of related variables that you decided to merge into a new derived variable?

(4) In your assessment of free-text variables, did you notice any values that were potentially identifying? If so, what actions did you take?

(5) In your initial application of Step 6 of the pseudonymization work flow, were there any combinations of indirect identifiers yielding values of k below your pre-selected threshold? If so, what action did you take?

See proposed workflow

Load required packages and read data/dictionary

# ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman")

# load/install packages from CRAN
pacman::p_load(
  
  # project and file management
  here,     # file paths relative to R project root folder
  rio,      # import/export of many types of data
  
  # general data management
  dplyr    # data wrangling
)

# Load/install packages from GitHub
pacman::p_load_gh(
  # data dictionary
  "epicentre-msf/datadict"  # create/validate data dictionary
)

# read/prep dictionary
odk_survey <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "survey", setclass = "tbl")
odk_choices <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "choices", setclass = "tbl")

dict <- datadict::dict_from_odk(odk_survey, odk_choices)

# read dataset (and reclass columns according to dictionary)
dat <- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl") %>% 
  datadict::reclass_data(dict)

3.4.0.1 Select a threshold value of k-anonymity

k = 5

3.4.0.2 Assess re-identification risk of each variable (e.g. direct identifier, indirect identifier, non-identifying).

Indirect identifiers:

location
source_water
sex
age_under_one
age_months
age_years
arrived
date_arrived
departed
date_departed
born
date_born

3.4.0.3 Assess utility of each variable for analysis (e.g. high, low, uncertain).

Most variables of ‘high’ utility for analysis

3.4.0.4 Withhold variables classified as direct identifiers. Consider withholding other variables with low utility and non-zero re-identification risk.

No direct identifiers

3.4.0.5 Merge groups of related indirect identifiers, where possible.

The most obvious group of related variables to merge are the age-related variables, age_under_one, age_months, and age_years. We’ll also merge all of the date variables (except date_death) and related indicators into a single new derived variable exposure, the number of days that a given person was at risk of mortality.

## merge age variables ---------------------------------------------------------
age_group_levels <- c("0-2", "3-14", "15-29", "30-44", "45+")

dat <- dat %>% 
  mutate(
    age_group = case_when(
      age_under_one == "Yes" ~ "0-2",
      age_years <= 2 ~ "0-2",
      age_years >= 3 & age_years <= 14 ~ "3-14",
      age_years >= 15 & age_years <= 29 ~ "15-29",
      age_years >= 30 & age_years <= 44 ~ "30-44",
      age_years >= 45 ~ "45+"
    ),
    .after = age_years,
    age_group = factor(age_group, levels = age_group_levels)
  )

# withhold old age variables
dat$age_under_one <- NA_character_
dat$age_months <- NA_real_
dat$age_years <- NA_real_

vars_withhold <- c(
  "age_under_one",
  "age_months",
  "age_years"
)

dict$status[dict$variable_name %in% vars_withhold] <- "withheld"

# add new age var to dictionary
age_group_choices <- datadict::generate_coded_options(age_group_levels)

dict <- dict %>% 
  add_row(
    variable_name = "age_group",
    short_label = "Age group (downscaled from age variables `age_months` and `age_years`)",
    type = "Coded list",
    choices = age_group_choices,
    origin = "derived",
    status = "shared",
    .after = which(.$variable_name == "age_years")
  )

## merge date variables --------------------------------------------------------
dat <- dat %>% 
  mutate(
    date_min = as.Date("2020-07-01")
  ) %>% 
  mutate(
    date_start = as.Date(apply(select(., date_min, date_arrived, date_born), 1, max, na.rm = TRUE)),
    date_end = as.Date(apply(select(., date, date_departed), 1, max, na.rm = TRUE)),
    exposure = as.integer(date_end - date_start),
    .after = date_died
  ) %>% 
  select(-date_min, -date_start, -date_end)

# withhold old date variables (and related indicators)
dat$born <- NA_character_
dat$date_born <- as.Date(NA_character_)
dat$arrived <- NA_character_
dat$date_arrived <- as.Date(NA_character_)
dat$departed <- NA_character_
dat$date_departed <- as.Date(NA_character_)

vars_withhold <- c(
  "born",
  "date_born",
  "arrived",
  "date_arrived",
  "departed",
  "date_departed"
)

dict$status[dict$variable_name %in% vars_withhold] <- "withheld"

# add new exposure var to dictionary
dict <- dict %>% 
  add_row(
    variable_name = "exposure",
    short_label = "Exposure period in days (derived from study start date '2020-07-01' and variables `date_born`, `date_arrived`, and `date_departed`)",
    type = "Numeric",
    choices = NA_character_,
    origin = "derived",
    status = "shared",
    .after = which(.$variable_name == "date_died")
  )

3.4.0.6 Review all unique values of ‘free-text’ type variables to ensure they do not contain identifying details. Aggregate or withhold as necessary.

dict %>% 
  filter(type %in% "Free text") %>% 
  select(1:3)

## # A tibble: 3 × 3
##   variable_name      short_label                              type     
##   <chr>              <chr>                                    <chr>    
## 1 source_water_other Please, specify other source of water    Free text
## 2 ilness_other       If other illness, specify, please        Free text
## 3 cause_death_other  If other cause of death, specify, please Free text

dat %>% 
  count(source_water_other)

## # A tibble: 3 × 2
##   source_water_other        n
##   <chr>                 <int>
## 1 Buying purified water     2
## 2 Untreated water wells     2
## 3 <NA>                    996

dat %>% 
  count(ilness_other)

## # A tibble: 3 × 2
##   ilness_other                 n
##   <chr>                    <int>
## 1 Diarrhea, fever and rash     1
## 2 died before                  1
## 3 <NA>                       998

dat %>% 
  count(cause_death_other)

## # A tibble: 4 × 2
##   cause_death_other     n
##   <chr>             <int>
## 1 Cancer                1
## 2 Gunshot               1
## 3 Stroke                1
## 4 <NA>                997

Value “Untreated water wells” of variable source_water_other is potentially identifying. We’ll treat it as indirect identifier along with variable source_water.

3.4.0.7 Discretise any indirect identifiers that are continuous variables.

None remaining (already discretised the age variables when we merged them)

3.4.0.8 Assess re-identification risk criterion (i.e. k-anonymity) using all indirect identifiers.

vars_indirect <- c(
  "location",
  "source_water",
  "source_water_other",
  "sex",
  "age_group"
)

datadict::k_anonymity_counts(dat, vars_indirect, threshold = 5)

## # A tibble: 25 × 6
##    location source_water                source_water_other sex   age_group     k
##    <chr>    <chr>                       <chr>              <chr> <fct>     <int>
##  1 Town A   Direct from canal           <NA>               Fema… 0-2           1
##  2 Town A   Direct from canal           <NA>               Male  3-14          1
##  3 Town A   Other (specify)             Buying purified w… Male  3-14          1
##  4 Town A   Other (specify)             Buying purified w… Male  15-29         1
##  5 Town A   Tank filled by a truck tra… <NA>               Fema… 3-14          1
##  6 Town A   Tank filled by a truck tra… <NA>               Male  0-2           1
##  7 Town A   Tank filled by a truck tra… <NA>               Male  30-44         1
##  8 Town B   Direct from canal           <NA>               Male  3-14          1
##  9 Town B   Direct from canal           <NA>               Male  30-44         1
## 10 Town B   Other (specify)             Untreated water w… Male  3-14          1
## # ℹ 15 more rows

3.4.0.9 Pseudonymize indirect identifiers to limit re-identification risk (e.g. aggregate, withhold).

With the five indirect identifiers noted above, we are far from our k-anonymity threshold… there are 25 combinations with k < 5. We’re particularly interested in keeping variables sex and age_group, so let’s see what would happen if we withheld different combinations of the other identifiers.

datadict::k_anonymity_counts(dat, c("sex", "age_group"), threshold = 5)

## # A tibble: 0 × 3
## # ℹ 3 variables: sex <chr>, age_group <fct>, k <int>

datadict::k_anonymity_counts(dat, c("sex", "age_group", "location"), threshold = 5)

## # A tibble: 0 × 4
## # ℹ 4 variables: sex <chr>, age_group <fct>, location <chr>, k <int>

datadict::k_anonymity_counts(dat, c("sex", "age_group", "source_water"), threshold = 5)

## # A tibble: 12 × 4
##    sex    age_group source_water                                            k
##    <chr>  <fct>     <chr>                                               <int>
##  1 Female 0-2       Direct from canal                                       1
##  2 Male   30-44     Direct from canal                                       1
##  3 Female 3-14      Direct from canal                                       2
##  4 Female 45+       Tank filled by a truck transporting untreated water     2
##  5 Male   0-2       Tank filled by a truck transporting untreated water     2
##  6 Male   3-14      Direct from canal                                       2
##  7 Male   3-14      Other (specify)                                         2
##  8 Male   15-29     Other (specify)                                         2
##  9 Male   30-44     Tank filled by a truck transporting untreated water     2
## 10 Male   45+       Tank filled by a truck transporting untreated water     3
## 11 Female 3-14      Tank filled by a truck transporting untreated water     4
## 12 Male   15-29     Tank filled by a truck transporting untreated water     4

The variables source_water and source_water_other could potentially be aggregated into categories like “Treated water” and “Untreated water”.

dat %>% 
  count(source_water, source_water_other)

## # A tibble: 5 × 3
##   source_water                                        source_water_other       n
##   <chr>                                               <chr>                <int>
## 1 City water network piped to household               <NA>                   942
## 2 Direct from canal                                   <NA>                     6
## 3 Other (specify)                                     Buying purified wat…     2
## 4 Other (specify)                                     Untreated water wel…     2
## 5 Tank filled by a truck transporting untreated water <NA>                    48

dat <- dat %>% 
  mutate(
    water_source_agg = case_when(
      source_water == "City water network piped to household" ~ "Source treated",
      source_water_other == "Buying purified water" ~ "Source treated",
      !is.na(source_water) ~ "Source untreated",
      TRUE ~ NA_character_
    )
  )

datadict::k_anonymity_counts(dat, c("sex", "age_group", "water_source_agg"), threshold = 5)

## # A tibble: 4 × 4
##   sex    age_group water_source_agg     k
##   <chr>  <fct>     <chr>            <int>
## 1 Female 45+       Source untreated     2
## 2 Male   0-2       Source untreated     2
## 3 Male   30-44     Source untreated     3
## 4 Male   45+       Source untreated     3

3.4.0.10 Repeat steps 8 and 9 until the given risk criterion is met.

Even after aggregating the water source variables, we still do not meet our k-anonymity threshold if we also include sex and age_group. We therefore elect to withhold the water source variables.

dat$water_source_agg <- NULL # we can remove this one outright
dat$source_water <- NA_character_
dat$source_water_other <- NA_character_

vars_withhold <- c(
  "source_water",
  "source_water_other"
)

dict$status[dict$variable_name %in% vars_withhold] <- "withheld"

3.4.0.11 Ensure that the final pseudonymized dataset and dictionary meet all data-sharing requirements.

# check dictionary valid
datadict::valid_dict(dict)

## [1] TRUE

# check dataset corresponds with dictionary
datadict::valid_data(dat, dict)

## Warning: - Columns present in `data` but not defined in `dict`: "id"

## [1] FALSE

The variable id seems to be missing from the dictionary, so we’ll have to create a manual entry.

dict <- dict %>% 
  add_row(
    variable_name = "id",
    short_label = "Participant ID",
    type = "Free text",
    choices = NA_character_,
    origin = "original",
    status = "shared",
    .before = 1
  )

We’ll run the checks one final time, including a final check of our k-anonymity threshold.

# check dictionary valid
datadict::valid_dict(dict)

## [1] TRUE

# check dataset corresponds with dictionary
datadict::valid_data(dat, dict)

## [1] TRUE

# check k-anonymity
datadict::k_anonymity_counts(dat, c("sex", "age_group", "location"), threshold = 5)

## # A tibble: 0 × 4
## # ℹ 4 variables: sex <chr>, age_group <fct>, location <chr>, k <int>

Finally, we’ll write the final, pseudonymized dataset and data dictionary for sharing.

if (!dir.exists(here("output"))) dir.create(here("output"))
rio::export(dat, file = here("output/data_share.xlsx"))
rio::export(dict, file = here("output/dict_share.xlsx"))