Hierarchical matching of parents based on sets of common offspring
Source:R/hmatch_parents.R
hmatch_parents.Rd
Match a hierarchical column (e.g. region, province, or county) within a raw, potentially messy dataset against a corresponding column within a reference dataset, by searching for similar sets of 'offspring' (i.e. values at the next hierarchical level).
For example, if the raw dataset uses admin1 level "NY" whereas the reference dataset uses "New York", it would be difficult to automatically match these values using only fuzzy-matching. However, we might nonetheless be able to match "NY" to "New York" if they share a common and unique set of 'offspring' (i.e. admin2 values) across both datasets (e.g "Kings", "Queens", "New York", "Suffolk", "Bronx", etc.).
Unlike other hmatch
functions, the data frame returned by hmatch_parents
only includes unique hierarchical combinations and only relevant
hierarchical levels (i.e. the parent level and above), along with additional
columns giving the number of matching children and total number of children
for a given parent.
Usage
hmatch_parents(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
level,
min_matches = 1L,
type = "left",
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
ref_prefix = "ref_",
std_fn = string_std,
...
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- pattern
regex pattern to match the hierarchical columns in
raw
Note: hierarchical column names can be matched using either the
pattern
orby
arguments. Or, if neitherpattern
orby
are specified, the hierarchical columns are assumed to be all column names that are common to bothraw
andref
. See specifying_columns.- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- level
name or integer index of the hierarchical level to match at (i.e. the 'parent' level). If a name, must correspond to a hierarchical column within
raw
, not including the very last hierarchical column (which has no hierarchical children). If an integer, must be between 1 and k-1, where k is the number of hierarchical columns.- min_matches
minimum number of matching offspring required for parents to be considered a match. Defaults to
1
.- type
type of join ("left", "inner" or "anti") (defaults to "left")
- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdist
package). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE
, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE
, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_dist
will be considered matching). Defaults to1L
.- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref
(at the parent level and above), using the join type specified by
argument type
(see join_types for more details). Note that unlike
other hmatch_
functions, hmatch_parents returns only unique rows and
relevant hierarchical columns (i.e. the parent level and above), along with
additional columns describing the number of matching children and total
number of children for a given parent.
- ...
hierarchical columns from
raw
, parent level and above- ...
hierarchical columns from
ref
, parent level and above- n_child_raw
total number of unique children belonging to the parent within
raw
- n_child_ref
total number of unique children belonging to the parent within
ref
- n_child_match
number of children in
raw
with match inref
Examples
# e.g. match abbreviated adm1 names to full names based on common offspring
raw <- ne_ref
raw$adm1[raw$adm1 == "Ontario"] <- "ON"
raw$adm1[raw$adm1 == "New York"] <- "NY"
raw$adm1[raw$adm1 == "New Jersey"] <- "NJ"
raw$adm1[raw$adm1 == "Pennsylvania"] <- "PA"
hmatch_parents(
raw,
ne_ref,
pattern = "adm",
level = "adm1",
min_matches = 2,
type = "left"
)
#> adm0 adm1 ref_adm0 ref_adm1 n_child_raw n_child_ref n_child_match
#> 1 CAN ON CAN Ontario 5 5 5
#> 2 USA NJ USA New Jersey 5 5 5
#> 3 USA NY USA New York 7 7 7
#> 4 USA PA USA Pennsylvania 8 8 8