Implement a variety of hierarchical matching strategies in sequence

Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a variety of matching strategies implemented in sequence to identify the best-possible match (i.e. highest-resolution) for each row.

The sequence of matching strategies is:

(optional) manually-specified matching with hmatch_manual
complete matching with hmatch(..., allow_gaps = FALSE)
partial matching with hmatch(..., allow_gaps = TRUE)
fuzzy partial matching with hmatch(allow_gaps = TRUE, fuzzy = TRUE)
best-possible matching with hmatch_settle

Each approach is implement only on the rows of data for which a single match has not already been identified using the previous approaches.

Usage

hmatch_composite(
  raw,
  ref,
  man,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  code_col,
  type = "resolve_left",
  allow_gaps = TRUE,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw: data frame containing hierarchical columns with raw data
ref: data frame containing hierarchical columns with reference data
man: (optional) data frame of manually-specified matches, relating a given set of hierarchical values to the code within ref to which those values correspond
pattern: regex pattern to match the hierarchical columns in raw (and man if given) (see also specifying_columns)
pattern_ref: regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.
by: vector giving the names of the hierarchical columns in raw (and man if given)
by_ref: vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.
code_col: name of the code column containing codes for matching ref and man (only required if argument man is given)
type: type of join ("resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
allow_gaps: logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of raw. Defaults to TRUE.
fuzzy: logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.
fuzzy_method: if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".
fuzzy_dist: if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.
dict: optional dictionary for recoding values within the hierarchical columns of raw (see dictionary_recoding)
ref_prefix: prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".
std_fn: function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.
...: additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref, using the join type specified by argument type (see join_types for more details)

Examples

data(ne_raw)
data(ne_ref)

hmatch_composite(ne_raw, ne_ref, fuzzy = TRUE)
#>       id adm0           adm1         adm2 level ref_adm0     ref_adm1
#> 1  PID01  USA       New York      Suffolk  adm2      USA     New York
#> 2  PID02  can        ontario         <NA>  adm1      CAN      Ontario
#> 3  PID03  USA       New York Kings County  adm1      USA     New York
#> 4  PID04 <NA>           <NA> Philadelphia  adm2      USA Pennsylvania
#> 5  PID05  USA           <NA>         York  adm2      USA Pennsylvania
#> 6  PID06  USA      new. york    jefferson  adm2      USA     New York
#> 7  PID07  CAN        Ontario    Peel R.M.  adm1      CAN      Ontario
#> 8  PID08  USA    Pensylvania       Ithaca  adm1      USA Pennsylvania
#> 9  PID09  USA       New_York         King  adm2      USA     New York
#> 10 PID10 <NA>           <NA> Bergen, N.J.  <NA>     <NA>         <NA>
#> 11 PID11  USA   Philadelphia         <NA>  adm0      USA         <NA>
#> 12 PID12  USA             NJ         <NA>  adm0      USA         <NA>
#> 13 PID13 <NA>           <NA>    Jeffersen  adm0      USA         <NA>
#> 14 PID14 <NA>           <NA>         york  <NA>     <NA>         <NA>
#> 15 PID15  USA New York State     New York  adm0      USA         <NA>
#>        ref_adm2 hcode match_type
#> 1       Suffolk   227   complete
#> 2          <NA>   110   complete
#> 3          <NA>   220     settle
#> 4  Philadelphia   237       gaps
#> 5          York   238       gaps
#> 6     Jefferson   222   complete
#> 7          <NA>   110     settle
#> 8          <NA>   230     settle
#> 9         Kings   223      fuzzy
#> 10         <NA>  <NA>       <NA>
#> 11         <NA>   200     settle
#> 12         <NA>   200     settle
#> 13         <NA>   200     settle
#> 14         <NA>  <NA>       <NA>
#> 15         <NA>   200     settle