Count the number of observations across unique combinations of indirect identifiers within a dataset

Given a dataset and set of one or more variables that may be indirect identifiers, the function returns a table of counts of the number of observations corresponding to each unique combination of those variables (i.e. k), optionally filtered to those combinations that do not meet the user-specific threshhold of k-anonymity.

Usage

k_anonymity_counts(x, vars, threshold = NULL)

Arguments

x: A data frame
vars: A character vector containing the name(s) of the variable(s) in x to be included in the k-anonymity calculation
threshold: Integer threshold indicating the minimum acceptable value of k. Combinations with values of k below the threshold will be flagged and returned. A return with 0 rows indicates that no combinations have values of k below the threshold.

Value

A tibble-style data frame containing counts of unique combinations of the variables specified in argument vars. If argument threshold is specified, only the combinations with counts lower than the threshold are returned, if any (i.e. combinations that do not meet the specified value of k-anonymity).

Examples

# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))

# display combinations of gender and age_cat with k < 5
k_anonymity_counts(dat, vars = c("gender", "age_cat"), threshold = 5)
#> # A tibble: 3 × 3
#>   gender age_cat     k
#>   <chr>  <chr>   <int>
#> 1 NA     70+         1
#> 2 f      50-69       2
#> 3 NA     50-69       2