Count the number of observations across unique combinations of indirect identifiers within a dataset
Source:R/k_anonymity_counts.R
k_anonymity_counts.RdGiven a dataset and set of one or more variables that may be indirect identifiers, the function returns a table of counts of the number of observations corresponding to each unique combination of those variables (i.e. k), optionally filtered to those combinations that do not meet the user-specific threshhold of k-anonymity.
Arguments
- x
A data frame
- vars
A character vector containing the name(s) of the variable(s) in
xto be included in the k-anonymity calculation- threshold
Integer threshold indicating the minimum acceptable value of k. Combinations with values of k below the threshold will be flagged and returned. A return with 0 rows indicates that no combinations have values of k below the threshold.
Value
A tibble-style data frame containing counts of unique
combinations of the variables specified in argument vars. If argument
threshold is specified, only the combinations with counts lower than the
threshold are returned, if any (i.e. combinations that do not meet the
specified value of k-anonymity).
Examples
# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))
# display combinations of gender and age_cat with k < 5
k_anonymity_counts(dat, vars = c("gender", "age_cat"), threshold = 5)
#> # A tibble: 3 × 3
#> gender age_cat k
#> <chr> <chr> <int>
#> 1 NA 70+ 1
#> 2 f 50-69 2
#> 3 NA 50-69 2