Check that a dataset is consistent with its corresponding data dictionary

Includes the following checks:

variables marked 'withheld' in dictionary contain only missing values
all variables in dataset defined in dictionary
all variables defined in dictionary present in dataset
variables of type 'Logical' are all TRUE/FALSE (or 1/0)
variables of type 'Numeric' are valid numbers
variables of type 'Date' are valid dates
variables of type 'Time' are valid times
variables of type 'Datetimes' are valid date-times
variables of type 'Coded list' contain only allowed options

Usage

valid_data(
  data,
  dict,
  format_date = "%Y-%m-%d",
  format_time = "%H:%M:%S",
  format_datetime = NULL,
  format_coded = "label",
  vals_withheld = NA,
  verbose = TRUE
)

Arguments

data: A data frame reflecting a dataset to be shared
dict: A data frame reflecting the corresponding data dictionary
format_date: Expected format for date variables. Defaults to "%Y-%m-%d".
format_time: Expected format for date variables. Defaults to "%H:%M:%S".
format_datetime: Expected format for date variables. Defaults to NULL to use defaults in lubridate::as_datetime.
format_coded: Expected format for coded-list variables, either "value" or "label". Defaults to "label.
vals_withheld: Expected value(s) in columns that are withheld. Default to NA.
verbose: Logical indicating whether to give warning describing the checks that have failed. Defaults to TRUE.

Value

TRUE if all checks pass, FALSE if any checks fail

Examples

# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))

# generate data dictionary template from dataset
dict <- dict_from_data(dat, factor_values = "string")

# dictionary column 'indirect_identifier' must be manually specified (yes/no)
dict$indirect_identifier <- "no"

# check for validity
valid_data(dat, dict)
#> [1] TRUE