Skip to contents

Includes the following checks:

  • variables marked 'withheld' in dictionary contain only missing values

  • all variables in dataset defined in dictionary

  • all variables defined in dictionary present in dataset

  • variables of type 'Logical' are all TRUE/FALSE (or 1/0)

  • variables of type 'Numeric' are valid numbers

  • variables of type 'Date' are valid dates

  • variables of type 'Time' are valid times

  • variables of type 'Datetimes' are valid date-times

  • variables of type 'Coded list' contain only allowed options

Usage

valid_data(
  data,
  dict,
  format_date = "%Y-%m-%d",
  format_time = "%H:%M:%S",
  format_datetime = NULL,
  format_coded = "label",
  verbose = TRUE
)

Arguments

data

A data frame reflecting a dataset to be shared

dict

A data frame reflecting the corresponding data dictionary

format_date

Expected format for date variables. Defaults to "%Y-%m-%d".

format_time

Expected format for date variables. Defaults to "%H:%M:%S".

format_datetime

Expected format for date variables. Defaults to NULL to use defaults in lubridate::as_datetime.

format_coded

Expected format for coded-list variables, either "value" or "label". Defaults to "label.

verbose

Logical indicating whether to give warning describing the checks that have failed. Defaults to TRUE.

Value

TRUE if all checks pass, FALSE if any checks fail

Examples

# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))

# generate data dictionary template from dataset
dict <- dict_from_data(dat, factor_values = "string")

# dictionary column 'indirect_identifier' must be manually specified (yes/no)
dict$indirect_identifier <- "no"

# check for validity
valid_data(dat, dict)
#> [1] TRUE