Implement a variety of hierarchical matching strategies in sequence
Source:R/hmatch_composite.R
hmatch_composite.Rd
Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a variety of matching strategies implemented in sequence to identify the best-possible match (i.e. highest-resolution) for each row.
The sequence of matching strategies is:
(optional) manually-specified matching with
hmatch_manual
complete matching with
hmatch(..., allow_gaps = FALSE)
partial matching with
hmatch(..., allow_gaps = TRUE)
fuzzy partial matching with
hmatch(allow_gaps = TRUE, fuzzy = TRUE)
best-possible matching with
hmatch_settle
Each approach is implement only on the rows of data for which a single match has not already been identified using the previous approaches.
Usage
hmatch_composite(
raw,
ref,
man,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
code_col,
type = "resolve_left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- man
(optional) data frame of manually-specified matches, relating a given set of hierarchical values to the code within
ref
to which those values correspond- pattern
regex pattern to match the hierarchical columns in
raw
(andman
if given) (see also specifying_columns)- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
(andman
if given)- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- code_col
name of the code column containing codes for matching
ref
andman
(only required if argumentman
is given)- type
type of join ("resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
- allow_gaps
logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of
raw
. Defaults toTRUE
.- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdist
package). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE
, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE
, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_dist
will be considered matching). Defaults to1L
.- dict
optional dictionary for recoding values within the hierarchical columns of
raw
(see dictionary_recoding)- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
Examples
data(ne_raw)
data(ne_ref)
hmatch_composite(ne_raw, ne_ref, fuzzy = TRUE)
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1
#> 1 PID01 USA New York Suffolk adm2 USA New York
#> 2 PID02 can ontario <NA> adm1 CAN Ontario
#> 3 PID03 USA New York Kings County adm1 USA New York
#> 4 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania
#> 5 PID05 USA <NA> York adm2 USA Pennsylvania
#> 6 PID06 USA new. york jefferson adm2 USA New York
#> 7 PID07 CAN Ontario Peel R.M. adm1 CAN Ontario
#> 8 PID08 USA Pensylvania Ithaca adm1 USA Pennsylvania
#> 9 PID09 USA New_York King adm2 USA New York
#> 10 PID10 <NA> <NA> Bergen, N.J. <NA> <NA> <NA>
#> 11 PID11 USA Philadelphia <NA> adm0 USA <NA>
#> 12 PID12 USA NJ <NA> adm0 USA <NA>
#> 13 PID13 <NA> <NA> Jeffersen adm0 USA <NA>
#> 14 PID14 <NA> <NA> york <NA> <NA> <NA>
#> 15 PID15 USA New York State New York adm0 USA <NA>
#> ref_adm2 hcode match_type
#> 1 Suffolk 227 complete
#> 2 <NA> 110 complete
#> 3 <NA> 220 settle
#> 4 Philadelphia 237 gaps
#> 5 York 238 gaps
#> 6 Jefferson 222 complete
#> 7 <NA> 110 settle
#> 8 <NA> 230 settle
#> 9 Kings 223 fuzzy
#> 10 <NA> <NA> <NA>
#> 11 <NA> 200 settle
#> 12 <NA> 200 settle
#> 13 <NA> 200 settle
#> 14 <NA> <NA> <NA>
#> 15 <NA> 200 settle