Match a data.frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a dictionary of manually-specified matches.
Usage
hmatch_manual(
raw,
ref,
man,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
code_col,
type = "left",
ref_prefix = "ref_",
std_fn = string_std,
...
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- man
data.frame
of manually-specified matches, relating a given set of hierarchical values to the code withinref
to which those values correspond- pattern
regex pattern to match the hierarchical columns in
raw
andman
(see also specifying_columns)- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
andman
- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- code_col
name of the code column containing codes for matching
ref
andman
- type
type of join ("left", "inner", or "anti"). Defaults to "left". See join_types. Note that this function does not allow 'resolve joins', unlike most other
hmatch_
functions.- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref
based on sets of matches specified in man
, using the join type
specified by argument type
(see join_types for more details)
Examples
data(ne_raw)
data(ne_ref)
# create df mapping sets of raw hierarchical values to codes within ref
ne_man <- data.frame(
adm0 = NA_character_,
adm1 = NA_character_,
adm2 = "Bergen, N.J.",
hcode = "211",
stringsAsFactors = FALSE
)
# find manual matches
hmatch_manual(ne_raw, ne_ref, ne_man, code_col = "hcode", type = "inner")
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 PID10 <NA> <NA> Bergen, N.J. adm2 USA New Jersey Bergen 211