Generate a data dictionary template from a dataset

Inferred data types for each field are based on the class of each column within in the input dataset:

Column class in data	Dictionary data type
Date	Date
POSIX	Datetime
logical	Logical
integer	Numeric
numeric	Numeric
factor	Coded list
character	Coded list or Free text (see argument `factor_threshold`)

Usage

dict_from_data(x, factor_threshold = 10, factor_values = c("int", "string"))

Arguments

x

A data frame reflecting the dataset from which to generate a data dictionary template

factor_threshold

An integer representing the maximum number of unique values within a dataset column for that column to be classified as a factor-type (i.e. coded list) variable. Columns that are not recognized as other data types (such as Numeric, Date, etc.) and have more than factor_threshold unique values will be specified as type "Free text". Defaults to 10.

factor_values

Should values of factor-type (i.e. Coded list) variables be generated as integers ("int") or strings ("string"). Defaults to "int".

 label    | int | string
-------------------------
 Yes      | 0   | yes
 No       | 1   | no
 Not sure | 2   | not_sure

Value

A tibble-style data frame representing a data dictionary template formatted to the OCA data sharing standard

Examples

# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))

# generate data dictionary template from dataset
dict_from_data(dat, factor_values = "string")
#> # A tibble: 31 × 7
#>    variable_name     short_label type  choices origin status indirect_identifier
#>    <chr>             <chr>       <chr> <chr>   <chr>  <chr>  <chr>              
#>  1 case_id           NA          Free… NA      origi… shared NA                 
#>  2 generation        NA          Nume… NA      origi… shared NA                 
#>  3 cohort_fu         NA          Logi… NA      origi… shared NA                 
#>  4 date_infection    NA          Date… NA      origi… shared NA                 
#>  5 date_onset        NA          Date… NA      origi… shared NA                 
#>  6 date_hospitalisa… NA          Date… NA      origi… shared NA                 
#>  7 date_outcome      NA          Date… NA      origi… shared NA                 
#>  8 outcome           NA          Code… death,… origi… shared NA                 
#>  9 gender            NA          Code… f, f |… origi… shared NA                 
#> 10 age               NA          Nume… NA      origi… shared NA                 
#> # ℹ 21 more rows