Chapter 2 Data dictionaries

2.1 Objectives

Learn to:

  • prepare an OCA-style data dictionary, starting from either a raw dataset or a ODK/Kobo dictionary
  • compare datasets and corresponding data dictionaries to ensure completeness and validity

2.2 Background

Documenting the variables in a dataset is a crucial part of data management and ensures that a dataset is interpretable by researchers who were not directly involved in study design or data collection. Whereas data files will generally contain codenames for variables (e.g. patagegrp for “Patient’s age group”), and sometimes also coded data values, the data dictionary describes each variable and the set of possible values in plain language that is more broadly interpretable.

Data collection platforms such as Kobo, REDCap, and OpenClinica have their own specialized data dictionary format. To facilitate dataset and data dictionary validation, the OCA Data Sharing Platform uses its own standardized dictionary format, defined in the next section.

2.3 Dictionary format

Every dataset shared or archived on the OCA platform must have an accompanying data dictionary that includes, at a minimum, the fields described below. Each variable in the dataset (i.e. each column) must have a dictionary entry for each of the required fields.

Required field Description Example entry
variable_name Variable name (i.e. exact column name within the corresponding dataset) “sample_type”
short_label Short phrase describing the variable in words “Type of laboratory sample collected”
type Variable type (options: “Numeric”, “Date”, “Time”, “Datetime”, “Coded list”, or “Free text”) “Coded list”
choices The list of options (pairs of codes and labels) corresponding to a variable of type “Coded list”. “1, Blood | 2, Nasal swab | 3, Throat swab | 4, Other”
origin Was the variable a part of the original data collection instrument (option “original”), or was it later derived (option “derived”) “original”
status Is the variable shared (option “shared”), or withheld to reduce disclosure risk (option “withheld”) ? “shared”

2.4 The datadict package

The R package datadict contains a variety of functions to aid in the preparation of an OCA-style data dictionary:

  • dict_from_data(): prepare a dictionary template from a raw dataset
  • dict_from_odk(): prepare a dictionary template from an ODK/Kobo dictionary
  • dict_from_redcap(): prepare a dictionary template from a REDCap dictionary
  • valid_dict(): verify that a dictionary is consistent with the OCA format
  • valid_data(): verify that a dataset corresponds to its associated data dictionary

Note that dictionary templates produces by the dict_from_ functions may still require further processing by the user (e.g. with additional R scripts, or by hand in Excel).

2.5 Exercises

This repository includes an example dataset based on a mortality survey, and corresponding ODK data dictionary (see section 1.1 Setup for data download links). Load the dataset and data dictionary using the example code below (note you may need to modify the file path, depending on how you setup your project), and work through the following exercises using functions from the datadict package where possible.

library(rio)
library(here)

# import dataset
dat <- rio::import(here("data/mortality_survey_simple_data.xlsx"), setclass = "tbl")

# import ODK dictionary (note the main dictionary and multiple-choice options are in separate sheets)
odk_survey <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "survey", setclass = "tbl")
odk_choices <- rio::import(here("data/mortality_survey_simple_kobo.xlsx"), sheet = "choices", setclass = "tbl")

2.5.1 Exercise 1

With the datadict package we can prepare a dictionary template either from the raw dataset (using function dict_from_data()) or from the ODK dictionary (using function dict_from_odk()). Try both approaches separately and compare the resulting dictionary templates. What are some differences? Can you guess why these differences are occurring?

See proposed answer
dict_dat <- datadict::dict_from_data(dat)
dict_odk <- datadict::dict_from_odk(odk_survey, odk_choices)

if (!dir.exists(here("output"))) dir.create(here("output"))
rio::export(dict_dat, here("output/ex1_dict_dat.xlsx"))
rio::export(dict_odk, here("output/ex1_dict_odk.xlsx"))
  • Function dict_from_data() doesn’t populate column short_label, whereas dict_from_odk() does
  • Variable types in column type are different. Function dict_from_data() guesses only ‘Free text’ and ‘Coded list’, whereas dict_from_odk() also returns ‘Date’ and ‘Numeric’

2.5.2 Exercise 2

When producing a dictionary template using dict_from_data(), the variable type is determined by the class of the original column (e.g. character, numeric, Date). The column classes that are read in by e.g. rio::import() might not always correspond to the variable types that we have in mind (e.g. numbers, dates, and times are sometimes read in as class “character”). Where necessary, transform the columns of dat using functions like as.numeric() or as.Date() and then produce another dictionary template using dict_from_data(). What are the differences that remain between this dictionary template and the template derived from the ODK dictionary?

See proposed answer
vars_date <- c(
  "date",
  "date_arrived",
  "date_departed",
  "date_born",
  "date_died"
)

vars_numeric <- c(
  "cluster",
  "age_months",
  "age_years",
  "muac"
)

for (j in vars_date) {
  dat[[j]] <- as.Date(dat[[j]])
}

for (j in vars_numeric) {
  dat[[j]] <- as.numeric(dat[[j]])
}

dict_dat <- datadict::dict_from_data(dat)
rio::export(dict_dat, here("output/ex2_dict_dat.xlsx"))
  • Function dict_from_data() classifies some variables as “Coded list” that dict_from_odk() classifies as “Free text” (source_water_other, ilness_other, cause_death_other)
  • Various differences in column choices (order varies, some entries missing with dict_from_data(), labels same but values differ)
  • Function dict_from_odk() retains extra columns from ODK data dictionary (e.g. constrain, relevant, etc.)

2.5.3 Exercise 3

Examine the options for the ‘Coded list’ type variables in the dictionary produced in exercise 2, and compare these to the corresponding options produced by dict_from_odk(). Why does the dictionary produced by dict_from_data() have fewer ‘Coded list’ options for some variables? Does this matter in terms of data sharing?

Hint: check out the function datadict::coded_options() to extract a long-form table of Coded list variables and corresponding options from a dictionary.

See proposed answer
opts_dat <- datadict::coded_options(dict_dat)
opts_odk <- datadict::coded_options(dict_odk)

rio::export(opts_dat, here("output/ex3_opts_dat.xlsx"))
rio::export(opts_odk, here("output/ex3_opts_odk.xlsx"))
  • Function dict_from_data() only includes options that appear in the dataset, whereas dict_from_odk() includes all options from original ODK dictionary

2.5.4 Exercise 4

Use the function valid_dict() to check that the dictionary you produced in Exercise 2 complies with the OCA standard. Assuming it does, edit the dictionary so that it fails at least two of the checks implemented by valid_dict().

See proposed answer
datadict::valid_dict(dict_dat)
## [1] TRUE
dict_dat_fail1 <- dict_dat
dict_dat_fail1$variable_name[4] <- NA_character_

datadict::valid_dict(dict_dat_fail1)
## Warning: - Missing values in column(s): "variable_name"
## [1] FALSE
dict_dat_fail2 <- dict_dat
dict_dat_fail2$type <- "Number"

datadict::valid_dict(dict_dat_fail2)
## Warning: - Column `type` has nonvalid value(s): "number"
## [1] FALSE

2.5.5 Exercise 5

Use the function valid_data() to check for consistency between the dataset and dictionary produced in Exercise 2. Assuming all checks pass, edit the dataset so that it fails at least two of the checks implemented by valid_data().

See proposed answer
datadict::valid_data(dat, dict_dat)
## [1] TRUE
# add nonvalid date
dat$date <- as.character(dat$date)
dat$date[4] <- "July ?, 2021"

# remove column present in dictionary
dat$oedema <- NULL

datadict::valid_data(dat, dict_dat)
## Warning: - Columns defined in `dict` but not present in `data`: "oedema"
## - Variables of type 'Date' contain nonvalid values: "date"
## [1] FALSE