Sequential hierarchical matching at each hierarchical level, settling for the highest resolution match that is possible for each row
Source:R/hmatch_settle.R
hmatch_settle.Rd
Match sets of hierarchical values (e.g. province / county / township) in a
raw, messy dataset to corresponding values within a reference dataset,
sequentially over each hierarchical level. Specifically, implements
hmatch
at each successive hierarchical level, starting with
only the first level (lowest resolution), then first and second, first second
and third, etc.
After the initial matching over all levels, users can optionally use a resolve join to 'settle' for the highest match possible for each row of raw data, even if that match is below the highest-resolution level specified.
Usage
hmatch_settle(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- pattern
regex pattern to match the hierarchical columns in
raw
Note: hierarchical column names can be matched using either the
pattern
orby
arguments. Or, if neitherpattern
orby
are specified, the hierarchical columns are assumed to be all column names that are common to bothraw
andref
. See specifying_columns.- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- type
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
- allow_gaps
logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of
raw
. Defaults toTRUE
.- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdist
package). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE
, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE
, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_dist
will be considered matching). Defaults to1L
.- dict
optional dictionary for recoding values within the hierarchical columns of
raw
(see dictionary_recoding)- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
Resolve joins
In a resolve type join with hmatch_settle
, rows of raw
with multiple
matches to ref
are resolved to the highest hierarchical level that is
non-conflicting among all matches (or no match if there is a conflict at the
very first level). E.g.
raw
: 1. | United States | <NA> | Jefferson |
Relevant rows from ref
: 1. | United States | <NA> | <NA> |
2. | United States | New York | Jefferson |
3. | United States | Pennsylvania | Jefferson |
In a regular join, the single row from raw
(above) will match all three
rows from ref
. However, in a resolve join the multiple matches will be
resolved to the first row from ref
, because only the first hierarchical
level ("United States") is non-conflicting among all possible matches.
Note that there's a distinction between "common" values at a given hierarchical level (i.e. a single unique value in each row) and "non-conflicting" values (i.e. a single unique value or a missing value). E.g.
raw
: 1. | United States | New York | New York |
Relevant rows from ref
: 1. | United States | <NA> | <NA> |
2. | United States | New York | <NA> |
3. | United States | New York | New York |
In the example above, only the 1st hierarchical level ("United States") is
"common" to all matches, but all hierarchical levels are "non-conflicting"
(i.e. because row 2 is a hierarchical child of row 1, and row 3 a child of
row 2), and so a resolve-type match will be made to the 3rd row in ref
.
Examples
data(ne_raw)
data(ne_ref)
# return matches at all levels
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "inner")
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1
#> 1 PID01 USA New York Suffolk adm0 USA <NA>
#> 2 PID01 USA New York Suffolk adm1 USA New York
#> 3 PID01 USA New York Suffolk adm2 USA New York
#> 4 PID02 can ontario <NA> adm0 CAN <NA>
#> 5 PID02 can ontario <NA> adm1 CAN Ontario
#> 6 PID03 USA New York Kings County adm0 USA <NA>
#> 7 PID03 USA New York Kings County adm1 USA New York
#> 8 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania
#> 9 PID05 USA <NA> York adm0 USA <NA>
#> 10 PID05 USA <NA> York adm2 USA Pennsylvania
#> 11 PID06 USA new. york jefferson adm0 USA <NA>
#> 12 PID06 USA new. york jefferson adm1 USA New York
#> 13 PID06 USA new. york jefferson adm2 USA New York
#> 14 PID07 CAN Ontario Peel R.M. adm0 CAN <NA>
#> 15 PID07 CAN Ontario Peel R.M. adm1 CAN Ontario
#> 16 PID08 USA Pensylvania Ithaca adm0 USA <NA>
#> 17 PID09 USA New_York King adm0 USA <NA>
#> 18 PID09 USA New_York King adm1 USA New York
#> 19 PID11 USA Philadelphia <NA> adm0 USA <NA>
#> 20 PID12 USA NJ <NA> adm0 USA <NA>
#> 21 PID14 <NA> <NA> york adm2 CAN Ontario
#> 22 PID14 <NA> <NA> york adm2 USA Pennsylvania
#> 23 PID15 USA New York State New York adm0 USA <NA>
#> ref_adm2 hcode
#> 1 <NA> 200
#> 2 <NA> 220
#> 3 Suffolk 227
#> 4 <NA> 100
#> 5 <NA> 110
#> 6 <NA> 200
#> 7 <NA> 220
#> 8 Philadelphia 237
#> 9 <NA> 200
#> 10 York 238
#> 11 <NA> 200
#> 12 <NA> 220
#> 13 Jefferson 222
#> 14 <NA> 100
#> 15 <NA> 110
#> 16 <NA> 200
#> 17 <NA> 200
#> 18 <NA> 220
#> 19 <NA> 200
#> 20 <NA> 200
#> 21 York 115
#> 22 York 238
#> 23 <NA> 200
# use a resolve join to settle for the best possible match for each row
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1
#> 1 PID01 USA New York Suffolk adm2 USA New York
#> 2 PID02 can ontario <NA> adm1 CAN Ontario
#> 3 PID03 USA New York Kings County adm1 USA New York
#> 4 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania
#> 5 PID05 USA <NA> York adm2 USA Pennsylvania
#> 6 PID06 USA new. york jefferson adm2 USA New York
#> 7 PID07 CAN Ontario Peel R.M. adm1 CAN Ontario
#> 8 PID08 USA Pensylvania Ithaca adm0 USA <NA>
#> 9 PID09 USA New_York King adm1 USA New York
#> 10 PID11 USA Philadelphia <NA> adm0 USA <NA>
#> 11 PID12 USA NJ <NA> adm0 USA <NA>
#> 12 PID15 USA New York State New York adm0 USA <NA>
#> ref_adm2 hcode
#> 1 Suffolk 227
#> 2 <NA> 110
#> 3 <NA> 220
#> 4 Philadelphia 237
#> 5 York 238
#> 6 Jefferson 222
#> 7 <NA> 110
#> 8 <NA> 200
#> 9 <NA> 220
#> 10 <NA> 200
#> 11 <NA> 200
#> 12 <NA> 200