Skip to contents

Match a data.frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a dictionary of manually-specified matches.

Usage

hmatch_manual(
  raw,
  ref,
  man,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  code_col,
  type = "left",
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

man

data.frame of manually-specified matches, relating a given set of hierarchical values to the code within ref to which those values correspond

pattern

regex pattern to match the hierarchical columns in raw and man (see also specifying_columns)

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw and man

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

code_col

name of the code column containing codes for matching ref and man

type

type of join ("left", "inner", or "anti"). Defaults to "left". See join_types. Note that this function does not allow 'resolve joins', unlike most other hmatch_ functions.

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref based on sets of matches specified in man, using the join type specified by argument type (see join_types for more details)

Examples

data(ne_raw)
data(ne_ref)

# create df mapping sets of raw hierarchical values to codes within ref
ne_man <- data.frame(
  adm0 = NA_character_,
  adm1 = NA_character_,
  adm2 = "Bergen, N.J.",
  hcode = "211",
  stringsAsFactors = FALSE
)

# find manual matches
hmatch_manual(ne_raw, ne_ref, ne_man, code_col = "hcode", type = "inner")
#>      id adm0 adm1         adm2 level ref_adm0   ref_adm1 ref_adm2 hcode
#> 1 PID10 <NA> <NA> Bergen, N.J.  adm2      USA New Jersey   Bergen   211