Inferred data types for each field are based on the class of each column within in the input dataset:
Column class in data | Dictionary data type |
Date | Date |
POSIX | Datetime |
logical | Logical |
integer | Numeric |
numeric | Numeric |
factor | Coded list |
character | Coded list or Free text (see argument factor_threshold ) |
Usage
dict_from_data(x, factor_threshold = 10, factor_values = c("int", "string"))
Arguments
- x
A data frame reflecting the dataset from which to generate a data dictionary template
- factor_threshold
An integer representing the maximum number of unique values within a dataset column for that column to be classified as a factor-type (i.e. coded list) variable. Columns that are not recognized as other data types (such as Numeric, Date, etc.) and have more than
factor_threshold
unique values will be specified as type "Free text". Defaults to 10.- factor_values
Should values of factor-type (i.e. Coded list) variables be generated as integers ("int") or strings ("string"). Defaults to "int".
Value
A tibble
-style data frame representing a data dictionary
template formatted to the OCA data sharing standard
Examples
# read example dataset
path_data <- system.file("extdata", package = "datadict")
dat <- readxl::read_xlsx(file.path(path_data, "linelist_cleaned.xlsx"))
# generate data dictionary template from dataset
dict_from_data(dat, factor_values = "string")
#> # A tibble: 31 × 7
#> variable_name short_label type choices origin status indirect_identifier
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 case_id NA Free… NA origi… shared NA
#> 2 generation NA Nume… NA origi… shared NA
#> 3 cohort_fu NA Logi… NA origi… shared NA
#> 4 date_infection NA Date… NA origi… shared NA
#> 5 date_onset NA Date… NA origi… shared NA
#> 6 date_hospitalisa… NA Date… NA origi… shared NA
#> 7 date_outcome NA Date… NA origi… shared NA
#> 8 outcome NA Code… death,… origi… shared NA
#> 9 gender NA Code… f, f |… origi… shared NA
#> 10 age NA Nume… NA origi… shared NA
#> # ℹ 21 more rows