Chapter 3 Pseudonymization
3.1 Objectives
Learn to:
- assess the variables in a dataset for disclosure risk and utility
- implement pseudonymization procedures, where necessary, to limit disclosure risk
- ensure that a final dataset meets all data-sharing requirements
3.2 Background
3.2.1 Disclosure
When a person or organization recognizes or learns something that they did not already know about another identifiable person or organization through released data. This might occur through:
- spontaneous recognition (i.e. someone with knowledge of the sampled population recognizes a unique or particular combination of data values)
- record matching/linkage with other existing datasets (e.g. population registers, electoral rolls, data from specialized firms)
3.2.2 Identifiers
Direct identifiers: variables that unambiguously reveal a person’s identity (e.g. name, passport number, phone number, physical address, email address)
Indirect identifiers: variables that contain information that, when combined with other variables, could lead to re-identification (e.g. sex, age, marital status, occupation). Note potential for elevated identifiability risk from extreme values of continuous variables (height, income, number of children, land area).
3.2.3 k-anonymity
A measure of re-identification risk for discrete variables. k = the number of
records in a dataset containing a certain combination of indirect identifiers
(e.g. how many records with sex
= “male” and age_group
= “40-49 years” ?).
Higher value of k means lower re-identification risk, because higher k means
more records in the dataset with the same combination of indirect identifiers.
3.2.4 Pseudonymization
Methods used to transform a dataset to achieve an “acceptable level” of re-identification / disclosure risk. Two types of methods:
Non-perturbative: suppression (remove entire variables, or specific records or values) or aggregation (aggregate levels of a variable to reduce uniqueness)
Perturbative: shuffle values or add noise to a variable while preserving desired statistical properties
3.3 Typical workflow
- Select a threshold value of k-anonymity that will be the minimum acceptable value for combinations of indirect identifiers within the released dataset (e.g. k = 5).
- Assess re-identification risk of each variable (e.g. direct identifier, indirect identifier, non-identifying).
- Assess utility of each variable for analysis (e.g. high, low, uncertain).
- Withhold variables classified as direct identifiers (e.g. name, phone number, address). Consider withholding other variables with low utility and non-zero re-identification risk.
- Merge groups of related indirect identifiers, where possible. E.g. If dataset
contains two age-related variables
age_in_years
andage_in_months
, merge these two variables into a new derived variableage
, and withhold the original variables. - Review all unique values of ‘free-text’ type variables to ensure they do not contain identifying details. Aggregate or withhold as necessary.
- Discretise any indirect identifiers that are continuous variables (e.g. height in cm -> discrete height categories).
- Assess re-identification risk criterion (i.e. k-anonymity) using all indirect identifiers.
- Pseudonymize indirect identifiers to limit re-identification risk (e.g. aggregate, withhold).
- Repeat steps 8 and 9 until the given risk criterion is met.
- Ensure that the final pseudonymized dataset and dictionary meet all data-sharing requirements.
3.4 Exercise
This repository includes an example dataset based on a mortality survey (see section 1.1 Setup for data download links). Load the dataset and pre-prepared data dictionary using the example code below, and use them to work through the pseudonymization workflow described above.
library(rio)
library(here)
# import dataset and prepared data dictionary
<- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl")
dat <- rio::import(here("data/mortality_survey_simple_dict_pre_pseudonym.xlsx"), setclass = "tbl") dict
As you make your way through the pseudonymization workflow, answer the following questions:
(1) Which variables, if any, did you assess as either direct or indirect identifiers?
(2) Despite not being completely familiar with the original study, were you able to assess any variables as being of low utility for analysis? If so, what actions did you take?
(3) Did you find any groups of related variables that you decided to merge into a new derived variable?
(4) In your assessment of free-text variables, did you notice any values that were potentially identifying? If so, what actions did you take?
(5) In your initial application of Step 6 of the pseudonymization work flow, were there any combinations of indirect identifiers yielding values of k below your pre-selected threshold? If so, what action did you take?
See proposed workflow
Load required packages and read data/dictionary
# ensures the package "pacman" is installed
if (!require("pacman")) install.packages("pacman")
# load/install packages from CRAN
::p_load(
pacman
# project and file management
# file paths relative to R project root folder
here, # import/export of many types of data
rio,
# general data management
# data wrangling
dplyr
)
# Load/install packages from GitHub
::p_load_gh(
pacman# data dictionary
"epicentre-msf/datadict" # create/validate data dictionary
)
# read/prep dictionary
<- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "survey", setclass = "tbl")
odk_survey <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "choices", setclass = "tbl")
odk_choices
<- datadict::dict_from_odk(odk_survey, odk_choices)
dict
# read dataset (and reclass columns according to dictionary)
<- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl") %>%
dat ::reclass_data(dict) datadict
3.4.0.2 Assess re-identification risk of each variable (e.g. direct identifier, indirect identifier, non-identifying).
Indirect identifiers:
location
source_water
sex
age_under_one
age_months
age_years
arrived
date_arrived
departed
date_departed
born
date_born
3.4.0.3 Assess utility of each variable for analysis (e.g. high, low, uncertain).
Most variables of ‘high’ utility for analysis
3.4.0.4 Withhold variables classified as direct identifiers. Consider withholding other variables with low utility and non-zero re-identification risk.
No direct identifiers
3.4.0.6 Review all unique values of ‘free-text’ type variables to ensure they do not contain identifying details. Aggregate or withhold as necessary.
%>%
dict filter(type %in% "Free text") %>%
select(1:3)
## # A tibble: 3 × 3
## variable_name short_label type
## <chr> <chr> <chr>
## 1 source_water_other Please, specify other source of water Free text
## 2 ilness_other If other illness, specify, please Free text
## 3 cause_death_other If other cause of death, specify, please Free text
%>%
dat count(source_water_other)
## # A tibble: 3 × 2
## source_water_other n
## <chr> <int>
## 1 Buying purified water 2
## 2 Untreated water wells 2
## 3 <NA> 996
%>%
dat count(ilness_other)
## # A tibble: 3 × 2
## ilness_other n
## <chr> <int>
## 1 Diarrhea, fever and rash 1
## 2 died before 1
## 3 <NA> 998
%>%
dat count(cause_death_other)
## # A tibble: 4 × 2
## cause_death_other n
## <chr> <int>
## 1 Cancer 1
## 2 Gunshot 1
## 3 Stroke 1
## 4 <NA> 997
Value “Untreated water wells” of variable source_water_other
is potentially
identifying. We’ll treat it as indirect identifier along with variable
source_water
.
3.4.0.7 Discretise any indirect identifiers that are continuous variables.
None remaining (already discretised the age variables when we merged them)
3.4.0.8 Assess re-identification risk criterion (i.e. k-anonymity) using all indirect identifiers.
<- c(
vars_indirect "location",
"source_water",
"source_water_other",
"sex",
"age_group"
)
::k_anonymity_counts(dat, vars_indirect, threshold = 5) datadict
## # A tibble: 25 × 6
## location source_water source_water_other sex age_group k
## <chr> <chr> <chr> <chr> <fct> <int>
## 1 Town A Direct from canal <NA> Fema… 0-2 1
## 2 Town A Direct from canal <NA> Male 3-14 1
## 3 Town A Other (specify) Buying purified w… Male 3-14 1
## 4 Town A Other (specify) Buying purified w… Male 15-29 1
## 5 Town A Tank filled by a truck tra… <NA> Fema… 3-14 1
## 6 Town A Tank filled by a truck tra… <NA> Male 0-2 1
## 7 Town A Tank filled by a truck tra… <NA> Male 30-44 1
## 8 Town B Direct from canal <NA> Male 3-14 1
## 9 Town B Direct from canal <NA> Male 30-44 1
## 10 Town B Other (specify) Untreated water w… Male 3-14 1
## # ℹ 15 more rows
3.4.0.9 Pseudonymize indirect identifiers to limit re-identification risk (e.g. aggregate, withhold).
With the five indirect identifiers noted above, we are far from our
k-anonymity threshold… there are 25 combinations with k < 5. We’re
particularly interested in keeping variables sex
and age_group
, so let’s see
what would happen if we withheld different combinations of the other
identifiers.
::k_anonymity_counts(dat, c("sex", "age_group"), threshold = 5) datadict
## # A tibble: 0 × 3
## # ℹ 3 variables: sex <chr>, age_group <fct>, k <int>
::k_anonymity_counts(dat, c("sex", "age_group", "location"), threshold = 5) datadict
## # A tibble: 0 × 4
## # ℹ 4 variables: sex <chr>, age_group <fct>, location <chr>, k <int>
::k_anonymity_counts(dat, c("sex", "age_group", "source_water"), threshold = 5) datadict
## # A tibble: 12 × 4
## sex age_group source_water k
## <chr> <fct> <chr> <int>
## 1 Female 0-2 Direct from canal 1
## 2 Male 30-44 Direct from canal 1
## 3 Female 3-14 Direct from canal 2
## 4 Female 45+ Tank filled by a truck transporting untreated water 2
## 5 Male 0-2 Tank filled by a truck transporting untreated water 2
## 6 Male 3-14 Direct from canal 2
## 7 Male 3-14 Other (specify) 2
## 8 Male 15-29 Other (specify) 2
## 9 Male 30-44 Tank filled by a truck transporting untreated water 2
## 10 Male 45+ Tank filled by a truck transporting untreated water 3
## 11 Female 3-14 Tank filled by a truck transporting untreated water 4
## 12 Male 15-29 Tank filled by a truck transporting untreated water 4
The variables source_water
and source_water_other
could potentially be
aggregated into categories like “Treated water” and “Untreated water”.
%>%
dat count(source_water, source_water_other)
## # A tibble: 5 × 3
## source_water source_water_other n
## <chr> <chr> <int>
## 1 City water network piped to household <NA> 942
## 2 Direct from canal <NA> 6
## 3 Other (specify) Buying purified wat… 2
## 4 Other (specify) Untreated water wel… 2
## 5 Tank filled by a truck transporting untreated water <NA> 48
<- dat %>%
dat mutate(
water_source_agg = case_when(
== "City water network piped to household" ~ "Source treated",
source_water == "Buying purified water" ~ "Source treated",
source_water_other !is.na(source_water) ~ "Source untreated",
TRUE ~ NA_character_
)
)
::k_anonymity_counts(dat, c("sex", "age_group", "water_source_agg"), threshold = 5) datadict
## # A tibble: 4 × 4
## sex age_group water_source_agg k
## <chr> <fct> <chr> <int>
## 1 Female 45+ Source untreated 2
## 2 Male 0-2 Source untreated 2
## 3 Male 30-44 Source untreated 3
## 4 Male 45+ Source untreated 3
3.4.0.10 Repeat steps 8 and 9 until the given risk criterion is met.
Even after aggregating the water source variables, we still do not meet our
k-anonymity threshold if we also include sex
and age_group
. We therefore
elect to withhold the water source variables.
$water_source_agg <- NULL # we can remove this one outright
dat$source_water <- NA_character_
dat$source_water_other <- NA_character_
dat
<- c(
vars_withhold "source_water",
"source_water_other"
)
$status[dict$variable_name %in% vars_withhold] <- "withheld" dict
3.4.0.11 Ensure that the final pseudonymized dataset and dictionary meet all data-sharing requirements.
# check dictionary valid
::valid_dict(dict) datadict
## [1] TRUE
# check dataset corresponds with dictionary
::valid_data(dat, dict) datadict
## Warning: - Columns present in `data` but not defined in `dict`: "id"
## [1] FALSE
The variable id
seems to be missing from the dictionary, so we’ll have to
create a manual entry.
<- dict %>%
dict add_row(
variable_name = "id",
short_label = "Participant ID",
type = "Free text",
choices = NA_character_,
origin = "original",
status = "shared",
.before = 1
)
We’ll run the checks one final time, including a final check of our k-anonymity threshold.
# check dictionary valid
::valid_dict(dict) datadict
## [1] TRUE
# check dataset corresponds with dictionary
::valid_data(dat, dict) datadict
## [1] TRUE
# check k-anonymity
::k_anonymity_counts(dat, c("sex", "age_group", "location"), threshold = 5) datadict
## # A tibble: 0 × 4
## # ℹ 4 variables: sex <chr>, age_group <fct>, location <chr>, k <int>
Finally, we’ll write the final, pseudonymized dataset and data dictionary for sharing.
if (!dir.exists(here("output"))) dir.create(here("output"))
::export(dat, file = here("output/data_share.xlsx"))
rio::export(dict, file = here("output/dict_share.xlsx")) rio