Check that a dataset is consistent with its corresponding data dictionary
Source:R/valid_data.R
valid_data.Rd
Includes the following checks:
variables marked 'withheld' in dictionary contain only missing values
all variables in dataset defined in dictionary
all variables defined in dictionary present in dataset
variables of type 'Logical' are all TRUE/FALSE (or 1/0)
variables of type 'Numeric' are valid numbers
variables of type 'Date' are valid dates
variables of type 'Time' are valid times
variables of type 'Datetimes' are valid date-times
variables of type 'Coded list' contain only allowed options
Usage
valid_data(
data,
dict,
format_date = "%Y-%m-%d",
format_time = "%H:%M:%S",
format_datetime = NULL,
format_coded = "label",
verbose = TRUE
)
Arguments
- data
A data frame reflecting a dataset to be shared
- dict
A data frame reflecting the corresponding data dictionary
- format_date
Expected format for date variables. Defaults to "%Y-%m-%d".
- format_time
Expected format for date variables. Defaults to "%H:%M:%S".
- format_datetime
Expected format for date variables. Defaults to
NULL
to use defaults in lubridate::as_datetime.- format_coded
Expected format for coded-list variables, either "value" or "label". Defaults to "label.
- verbose
Logical indicating whether to give warning describing the checks that have failed. Defaults to TRUE.
Examples
# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))
# generate data dictionary template from dataset
dict <- dict_from_data(dat, factor_values = "string")
# dictionary column 'indirect_identifier' must be manually specified (yes/no)
dict$indirect_identifier <- "no"
# check for validity
valid_data(dat, dict)
#> [1] TRUE