Hierarchical matching with tokenization of multi-term values
Source:R/hmatch_tokens.R
hmatch_tokens.Rd
Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, using tokenization to help match multi-term values that might otherwise be difficult to match (e.g. "New York City" vs. "New York").
Includes options for ignoring matches from frequently-occurring tokens (e.g. "North", "South", "City"), small tokens (e.g. "El", "San", "New"), or any other set of tokens specified by the user.
Usage
hmatch_tokens(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
type = "left",
allow_gaps = TRUE,
always_tokenize = FALSE,
token_split = "_",
token_min = 1,
exclude_freq = 3,
exclude_nchar = 3,
exclude_values = NULL,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- pattern
regex pattern to match the hierarchical columns in
raw
Note: hierarchical column names can be matched using either the
pattern
orby
arguments. Or, if neitherpattern
orby
are specified, the hierarchical columns are assumed to be all column names that are common to bothraw
andref
. See specifying_columns.- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- type
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
- allow_gaps
logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of
raw
. Defaults toTRUE
.- always_tokenize
logical indicating whether to tokenize all values prior to matching (
TRUE
), or to first attempt non-tokenized matching withhmatch
and only tokenize values withinraw
(and corresponding putative matches withinref
) that don't have a non-tokenized match (FALSE
). Defaults toFALSE
.- token_split
regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument
std_fn
(this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".- token_min
minimum number of tokens that must match for a term to be considered matching overall. Defaults to 1.
- exclude_freq
exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by
count_tokens
(separately forraw
andref
). Defaults to3
.- exclude_nchar
exclude tokens from matching if they have nchar less than or equal to this value. Defaults to
3
.- exclude_values
character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument
std_fn
.- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdist
package). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE
, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE
, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_dist
will be considered matching). Defaults to1L
.- dict
optional dictionary for recoding values within the hierarchical columns of
raw
(see dictionary_recoding)- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref
, using the join type specified by argument type
(see
join_types for more details)
Resolve joins
Uses the same approach to resolve joins as hmatch
.
Examples
data(ne_raw)
data(ne_ref)
# add tokens to some values within ref to illustrate tokenized matching
ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America"
ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State"
hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)
#> id adm0 adm1 adm2 level ref_adm0 ref_adm1
#> 1 PID01 USA New York Suffolk adm2 USA New York State
#> 2 PID02 can ontario <NA> adm1 CAN Ontario
#> 3 PID03 USA New York Kings County adm2 USA New York State
#> 4 PID04 <NA> <NA> Philadelphia adm2 USA Pennsylvania
#> 5 PID05 USA <NA> York adm2 USA New York State
#> 6 PID05 USA <NA> York adm2 USA Pennsylvania
#> 7 PID06 USA new. york jefferson adm2 USA New York State
#> 8 PID07 CAN Ontario Peel R.M. adm2 CAN Ontario
#> 9 PID10 <NA> <NA> Bergen, N.J. adm2 USA New Jersey
#> 10 PID14 <NA> <NA> york adm2 CAN Ontario
#> 11 PID14 <NA> <NA> york adm2 USA New York State
#> 12 PID14 <NA> <NA> york adm2 USA Pennsylvania
#> 13 PID15 USA New York State New York adm2 USA New York State
#> ref_adm2 hcode
#> 1 Suffolk 227
#> 2 <NA> 110
#> 3 Kings 223
#> 4 Philadelphia 237
#> 5 New York 225
#> 6 York 238
#> 7 Jefferson 222
#> 8 Peel 113
#> 9 Bergen 211
#> 10 York 115
#> 11 New York 225
#> 12 York 238
#> 13 New York 225