Chapter 2 Data dictionaries
2.1 Objectives
Learn to:
- prepare an OCA-style data dictionary, starting from either a raw dataset or a ODK/Kobo dictionary
- compare datasets and corresponding data dictionaries to ensure completeness and validity
2.2 Background
Documenting the variables in a dataset is a crucial part of data management and
ensures that a dataset is interpretable by researchers who were not directly
involved in study design or data collection. Whereas data files will generally
contain codenames for variables (e.g. patagegrp
for “Patient’s age group”),
and sometimes also coded data values, the data dictionary describes each
variable and the set of possible values in plain language that is more broadly
interpretable.
Data collection platforms such as Kobo, REDCap, and OpenClinica have their own specialized data dictionary format. To facilitate dataset and data dictionary validation, the OCA Data Sharing Platform uses its own standardized dictionary format, defined in the next section.
2.3 Dictionary format
Every dataset shared or archived on the OCA platform must have an accompanying data dictionary that includes, at a minimum, the fields described below. Each variable in the dataset (i.e. each column) must have a dictionary entry for each of the required fields.
Required field | Description | Example entry |
---|---|---|
variable_name | Variable name (i.e. exact column name within the corresponding dataset) | “sample_type” |
short_label | Short phrase describing the variable in words | “Type of laboratory sample collected” |
type | Variable type (options: “Numeric”, “Date”, “Time”, “Datetime”, “Coded list”, or “Free text”) | “Coded list” |
choices | The list of options (pairs of codes and labels) corresponding to a variable of type “Coded list”. | “1, Blood | 2, Nasal swab | 3, Throat swab | 4, Other” |
origin | Was the variable a part of the original data collection instrument (option “original”), or was it later derived (option “derived”) | “original” |
status | Is the variable shared (option “shared”), or withheld to reduce disclosure risk (option “withheld”) ? | “shared” |
2.4 The datadict
package
The R package datadict
contains a
variety of functions to aid in the preparation of an OCA-style data dictionary:
dict_from_data()
: prepare a dictionary template from a raw datasetdict_from_odk()
: prepare a dictionary template from an ODK/Kobo dictionarydict_from_redcap()
: prepare a dictionary template from a REDCap dictionaryvalid_dict()
: verify that a dictionary is consistent with the OCA formatvalid_data()
: verify that a dataset corresponds to its associated data dictionary
Note that dictionary templates produces by the dict_from_
functions may still
require further processing by the user (e.g. with additional R scripts, or by
hand in Excel).
2.5 Exercises
This repository includes an example dataset based on a mortality survey, and
corresponding ODK data dictionary (see section 1.1 Setup for data
download links). Load the dataset and data dictionary using the example code
below (note you may need to modify the file path, depending on how you setup
your project), and work through the following exercises using functions from the
datadict
package where possible.
library(rio)
library(here)
# import dataset
<- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl")
dat
# import ODK dictionary (note the main dictionary and multiple-choice options are in separate sheets)
<- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "survey", setclass = "tbl")
odk_survey <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "choices", setclass = "tbl") odk_choices
2.5.1 Exercise 1
With the datadict
package we can prepare a dictionary template either from
the raw dataset (using function dict_from_data()
) or from the ODK dictionary
(using function dict_from_odk()
). Try both approaches separately and compare
the resulting dictionary templates. What are some differences? Can you guess why
these differences are occurring?
See proposed answer
<- datadict::dict_from_data(dat)
dict_dat <- datadict::dict_from_odk(odk_survey, odk_choices)
dict_odk
if (!dir.exists(here("output"))) dir.create(here("output"))
::export(dict_dat, here("output/ex1_dict_dat.xlsx"))
rio::export(dict_odk, here("output/ex1_dict_odk.xlsx")) rio
- Function
dict_from_data()
doesn’t populate columnshort_label
, whereasdict_from_odk()
does - Variable types in column
type
are different. Functiondict_from_data()
guesses only ‘Free text’ and ‘Coded list’, whereasdict_from_odk()
also returns ‘Date’ and ‘Numeric’
2.5.2 Exercise 2
When producing a dictionary template using dict_from_data()
, the variable type
is determined by the class of the original column (e.g. character, numeric,
Date). The column classes that are read in by e.g. rio::import()
might not
always correspond to the variable types that we have in mind (e.g. numbers,
dates, and times are sometimes read in as class “character”). Where necessary,
transform the columns of dat
using functions like as.numeric()
or
as.Date()
and then produce another dictionary template using
dict_from_data()
. What are the differences that remain between this dictionary
template and the template derived from the ODK dictionary?
See proposed answer
<- c(
vars_date "date",
"date_arrived",
"date_departed",
"date_born",
"date_died"
)
<- c(
vars_numeric "cluster",
"age_months",
"age_years",
"muac"
)
for (j in vars_date) {
<- as.Date(dat[[j]])
dat[[j]]
}
for (j in vars_numeric) {
<- as.numeric(dat[[j]])
dat[[j]]
}
<- datadict::dict_from_data(dat)
dict_dat ::export(dict_dat, here("output/ex2_dict_dat.xlsx")) rio
- Function
dict_from_data()
classifies some variables as “Coded list” thatdict_from_odk()
classifies as “Free text” (source_water_other
,ilness_other
,cause_death_other
) - Various differences in column
choices
(order varies, some entries missing withdict_from_data()
, labels same but values differ) - Function
dict_from_odk()
retains extra columns from ODK data dictionary (e.g.constrain
,relevant
, etc.)
2.5.3 Exercise 3
Examine the options for the ‘Coded list’ type variables in the dictionary
produced in exercise 2, and compare these to the corresponding options produced
by dict_from_odk()
. Why does the dictionary produced by dict_from_data()
have fewer ‘Coded list’ options for some variables? Does this matter in terms of
data sharing?
Hint: check out the function datadict::coded_options()
to extract a long-form
table of Coded list variables and corresponding options from a dictionary.
See proposed answer
<- datadict::coded_options(dict_dat)
opts_dat <- datadict::coded_options(dict_odk)
opts_odk
::export(opts_dat, here("output/ex3_opts_dat.xlsx"))
rio::export(opts_odk, here("output/ex3_opts_odk.xlsx")) rio
- Function
dict_from_data()
only includes options that appear in the dataset, whereasdict_from_odk()
includes all options from original ODK dictionary
2.5.4 Exercise 4
Use the function valid_dict()
to check that the dictionary you produced in
Exercise 2 complies with the OCA standard. Assuming it does, edit the dictionary
so that it fails at least two of the checks implemented by valid_dict()
.
See proposed answer
::valid_dict(dict_dat) datadict
## [1] TRUE
<- dict_dat
dict_dat_fail1 $variable_name[4] <- NA_character_
dict_dat_fail1
::valid_dict(dict_dat_fail1) datadict
## Warning: - Missing values in column(s): "variable_name"
## [1] FALSE
<- dict_dat
dict_dat_fail2 $type <- "Number"
dict_dat_fail2
::valid_dict(dict_dat_fail2) datadict
## Warning: - Column `type` has nonvalid value(s): "number"
## [1] FALSE
2.5.5 Exercise 5
Use the function valid_data()
to check for consistency between the dataset and
dictionary produced in Exercise 2. Assuming all checks pass, edit the dataset so
that it fails at least two of the checks implemented by valid_data()
.
See proposed answer
::valid_data(dat, dict_dat) datadict
## [1] TRUE
# add nonvalid date
$date <- as.character(dat$date)
dat$date[4] <- "July ?, 2021"
dat
# remove column present in dictionary
$oedema <- NULL
dat
::valid_data(dat, dict_dat) datadict
## Warning: - Columns defined in `dict` but not present in `data`: "oedema"
## - Variables of type 'Date' contain nonvalid values: "date"
## [1] FALSE