Hierarchical matching with tokenization of multi-term values

Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, using tokenization to help match multi-term values that might otherwise be difficult to match (e.g. "New York City" vs. "New York").

Includes options for ignoring matches from frequently-occurring tokens (e.g. "North", "South", "City"), small tokens (e.g. "El", "San", "New"), or any other set of tokens specified by the user.

Usage

hmatch_tokens(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  type = "left",
  allow_gaps = TRUE,
  always_tokenize = FALSE,
  token_split = "_",
  token_min = 1,
  exclude_freq = 3,
  exclude_nchar = 3,
  exclude_values = NULL,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

pattern

regex pattern to match the hierarchical columns in raw

Note: hierarchical column names can be matched using either the pattern or by arguments. Or, if neither pattern or by are specified, the hierarchical columns are assumed to be all column names that are common to both raw and ref. See specifying_columns.

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

type

type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.

allow_gaps

logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of raw. Defaults to TRUE.

always_tokenize

logical indicating whether to tokenize all values prior to matching (TRUE), or to first attempt non-tokenized matching with hmatch and only tokenize values within raw (and corresponding putative matches within ref) that don't have a non-tokenized match (FALSE). Defaults to FALSE.

token_split

regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument std_fn (this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".

token_min

minimum number of tokens that must match for a term to be considered matching overall. Defaults to 1.

exclude_freq

exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by count_tokens (separately for raw and ref). Defaults to 3.

exclude_nchar

exclude tokens from matching if they have nchar less than or equal to this value. Defaults to 3.

exclude_values

character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument std_fn.

fuzzy

logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.

fuzzy_method

if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".

fuzzy_dist

if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.

dict

optional dictionary for recoding values within the hierarchical columns of raw (see dictionary_recoding)

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref, using the join type specified by argument type (see join_types for more details)

Resolve joins

Uses the same approach to resolve joins as hmatch.

Examples

data(ne_raw)
data(ne_ref)

# add tokens to some values within ref to illustrate tokenized matching
ne_ref$adm0[ne_ref$adm0 == "United States"] <- "United States of America"
ne_ref$adm1[ne_ref$adm1 == "New York"] <- "New York State"

hmatch_tokens(ne_raw, ne_ref, type = "inner", token_min = 1)
#>       id adm0           adm1         adm2 level ref_adm0       ref_adm1
#> 1  PID01  USA       New York      Suffolk  adm2      USA New York State
#> 2  PID02  can        ontario         <NA>  adm1      CAN        Ontario
#> 3  PID03  USA       New York Kings County  adm2      USA New York State
#> 4  PID04 <NA>           <NA> Philadelphia  adm2      USA   Pennsylvania
#> 5  PID05  USA           <NA>         York  adm2      USA New York State
#> 6  PID05  USA           <NA>         York  adm2      USA   Pennsylvania
#> 7  PID06  USA      new. york    jefferson  adm2      USA New York State
#> 8  PID07  CAN        Ontario    Peel R.M.  adm2      CAN        Ontario
#> 9  PID10 <NA>           <NA> Bergen, N.J.  adm2      USA     New Jersey
#> 10 PID14 <NA>           <NA>         york  adm2      CAN        Ontario
#> 11 PID14 <NA>           <NA>         york  adm2      USA New York State
#> 12 PID14 <NA>           <NA>         york  adm2      USA   Pennsylvania
#> 13 PID15  USA New York State     New York  adm2      USA New York State
#>        ref_adm2 hcode
#> 1       Suffolk   227
#> 2          <NA>   110
#> 3         Kings   223
#> 4  Philadelphia   237
#> 5      New York   225
#> 6          York   238
#> 7     Jefferson   222
#> 8          Peel   113
#> 9        Bergen   211
#> 10         York   115
#> 11     New York   225
#> 12         York   238
#> 13     New York   225