Hierarchical matching with tokenization of multi-term values
Source:R/hmatch_tokens.R
      hmatch_tokens.RdMatch sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, using tokenization to help match multi-term values that might otherwise be difficult to match (e.g. "New York City" vs. "New York").
Includes options for ignoring matches from frequently-occurring tokens (e.g. "North", "South", "City"), small tokens (e.g. "El", "San", "New"), or any other set of tokens specified by the user.
Usage
hmatch_tokens(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  type = "left",
  allow_gaps = TRUE,
  always_tokenize = FALSE,
  token_split = "_",
  token_min = 1,
  exclude_freq = 3,
  exclude_nchar = 3,
  exclude_values = NULL,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)Arguments
- raw
- data frame containing hierarchical columns with raw data 
- ref
- data frame containing hierarchical columns with reference data 
- pattern
- regex pattern to match the hierarchical columns in - raw- Note: hierarchical column names can be matched using either the - patternor- byarguments. Or, if neither- patternor- byare specified, the hierarchical columns are assumed to be all column names that are common to both- rawand- ref. See specifying_columns.
- pattern_ref
- regex pattern to match the hierarchical columns in - ref. Defaults to- pattern, so only need to specify if the hierarchical columns have different names in- rawand- ref.
- by
- vector giving the names of the hierarchical columns in - raw
- by_ref
- vector giving the names of the hierarchical columns in - ref. Defaults to- by, so only need to specify if the hierarchical columns have different names in- rawand- ref.
- type
- type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types. 
- allow_gaps
- logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of - raw. Defaults to- TRUE.
- always_tokenize
- logical indicating whether to tokenize all values prior to matching ( - TRUE), or to first attempt non-tokenized matching with- hmatchand only tokenize values within- raw(and corresponding putative matches within- ref) that don't have a non-tokenized match (- FALSE). Defaults to- FALSE.
- token_split
- regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument - std_fn(this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".
- token_min
- minimum number of tokens that must match for a term to be considered matching overall. Defaults to 1. 
- exclude_freq
- exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by - count_tokens(separately for- rawand- ref). Defaults to- 3.
- exclude_nchar
- exclude tokens from matching if they have nchar less than or equal to this value. Defaults to - 3.
- exclude_values
- character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument - std_fn.
- fuzzy
- logical indicating whether to use fuzzy-matching (based on the - stringdistpackage). Defaults to FALSE.
- fuzzy_method
- if - fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".
- fuzzy_dist
- if - fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to- fuzzy_distwill be considered matching). Defaults to- 1L.
- dict
- optional dictionary for recoding values within the hierarchical columns of - raw(see dictionary_recoding)
- ref_prefix
- prefix to add to names of returned columns from - refif they are otherwise identical to names within- raw. Defaults to "ref_".
- std_fn
- function to standardize strings during matching. Defaults to - string_std. Set to- NULLto omit standardization. See also string_standardization.
- ...
- additional arguments passed to - std_fn()
Value
a data frame obtained by matching the hierarchical columns in raw
and ref, using the join type specified by argument type (see
join_types for more details)
Resolve joins
Uses the same approach to resolve joins as hmatch.
Examples
data(ne_raw)
data(ne_ref)
# add tokens to some values within ref to illustrate tokenized matching
ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America"
ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State"
hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)
#>       id adm0           adm1         adm2 level ref_adm0       ref_adm1
#> 1  PID01  USA       New York      Suffolk  adm2      USA New York State
#> 2  PID02  can        ontario         <NA>  adm1      CAN        Ontario
#> 3  PID03  USA       New York Kings County  adm2      USA New York State
#> 4  PID04 <NA>           <NA> Philadelphia  adm2      USA   Pennsylvania
#> 5  PID05  USA           <NA>         York  adm2      USA New York State
#> 6  PID05  USA           <NA>         York  adm2      USA   Pennsylvania
#> 7  PID06  USA      new. york    jefferson  adm2      USA New York State
#> 8  PID07  CAN        Ontario    Peel R.M.  adm2      CAN        Ontario
#> 9  PID10 <NA>           <NA> Bergen, N.J.  adm2      USA     New Jersey
#> 10 PID14 <NA>           <NA>         york  adm2      CAN        Ontario
#> 11 PID14 <NA>           <NA>         york  adm2      USA New York State
#> 12 PID14 <NA>           <NA>         york  adm2      USA   Pennsylvania
#> 13 PID15  USA New York State     New York  adm2      USA New York State
#>        ref_adm2 hcode
#> 1       Suffolk   227
#> 2          <NA>   110
#> 3         Kings   223
#> 4  Philadelphia   237
#> 5      New York   225
#> 6          York   238
#> 7     Jefferson   222
#> 8          Peel   113
#> 9        Bergen   211
#> 10         York   115
#> 11     New York   225
#> 12         York   238
#> 13     New York   225