Hierarchical matching with sequential column permutation to allow for values entered at the wrong hierarchical level
Source:R/hmatch_permute.R
hmatch_permute.Rd
Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using sequential permutation of the hierarchical columns to allow for values entered at the wrong hierarchical level.
The function calls hmatch
on each possible permutation of the
hierarchical columns, and then combines the results. Rows of raw
yielding
multiple matches to ref
can optionally be resolved using a resolve-type
join (see section Resolve joins below).
Usage
hmatch_permute(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- pattern
regex pattern to match the hierarchical columns in
raw
Note: hierarchical column names can be matched using either the
pattern
orby
arguments. Or, if neitherpattern
orby
are specified, the hierarchical columns are assumed to be all column names that are common to bothraw
andref
. See specifying_columns.- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- type
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
- allow_gaps
logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of
raw
. Defaults toTRUE
.- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdist
package). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE
, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE
, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_dist
will be considered matching). Defaults to1L
.- dict
optional dictionary for recoding values within the hierarchical columns of
raw
(see dictionary_recoding)- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
Resolve joins
In hmatch_permute
, if argument type
corresponds to a resolve join, rows
of raw
with multiple matches to ref
are resolved to the highest
hierarchical level that is common among all matches (or no match if there is
a conflict at the very first level). E.g.
raw
: 1. | United States | <NA> | New York |
Relevant rows from ref
: 1. | United States | New York | <NA> |
2. | United States | New York | New York |
In a regular join with hmatch_permute
, the single row from raw
(above)
will match both of the depicted rows from ref
. However, in a resolve join
the two matches will resolve to the first row from ref
, because it reflects
the highest hierarchical level that is common to all matches.
Examples
data(ne_raw)
data(ne_ref)
hmatch_permute(ne_raw, ne_ref, pattern = "^adm", type = "inner")
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2
#> 1 PID01 USA New York Suffolk adm2 USA New York Suffolk
#> 2 PID02 can ontario <NA> adm1 CAN Ontario <NA>
#> 3 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia
#> 4 PID05 USA <NA> York adm2 USA Pennsylvania York
#> 5 PID06 USA new. york jefferson adm2 USA New York Jefferson
#> 6 PID11 USA Philadelphia <NA> adm2 USA Pennsylvania Philadelphia
#> 7 PID14 <NA> <NA> york adm2 CAN Ontario York
#> 8 PID14 <NA> <NA> york adm2 USA Pennsylvania York
#> hcode
#> 1 227
#> 2 110
#> 3 237
#> 4 238
#> 5 222
#> 6 237
#> 7 115
#> 8 238