Hierarchical matching with sequential column permutation to allow for values entered at the wrong hierarchical level

Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using sequential permutation of the hierarchical columns to allow for values entered at the wrong hierarchical level.

The function calls hmatch on each possible permutation of the hierarchical columns, and then combines the results. Rows of raw yielding multiple matches to ref can optionally be resolved using a resolve-type join (see section Resolve joins below).

Usage

hmatch_permute(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  type = "left",
  allow_gaps = TRUE,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

pattern

regex pattern to match the hierarchical columns in raw

Note: hierarchical column names can be matched using either the pattern or by arguments. Or, if neither pattern or by are specified, the hierarchical columns are assumed to be all column names that are common to both raw and ref. See specifying_columns.

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

type

type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.

allow_gaps

logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of raw. Defaults to TRUE.

fuzzy

logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.

fuzzy_method

if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".

fuzzy_dist

if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.

dict

optional dictionary for recoding values within the hierarchical columns of raw (see dictionary_recoding)

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref, using the join type specified by argument type (see join_types for more details)

Resolve joins

In hmatch_permute, if argument type corresponds to a resolve join, rows of raw with multiple matches to ref are resolved to the highest hierarchical level that is common among all matches (or no match if there is a conflict at the very first level). E.g.

raw:
1. | United States | <NA> | New York |

In a regular join with hmatch_permute, the single row from raw (above) will match both of the depicted rows from ref. However, in a resolve join the two matches will resolve to the first row from ref, because it reflects the highest hierarchical level that is common to all matches.

Examples

data(ne_raw)
data(ne_ref)

hmatch_permute(ne_raw, ne_ref, pattern = "^adm", type = "inner")
#>      id adm0         adm1         adm2 level ref_adm0     ref_adm1     ref_adm2
#> 1 PID01  USA     New York      Suffolk  adm2      USA     New York      Suffolk
#> 2 PID02  can      ontario         <NA>  adm1      CAN      Ontario         <NA>
#> 3 PID04 <NA>         <NA> Philadelphia  adm2      USA Pennsylvania Philadelphia
#> 4 PID05  USA         <NA>         York  adm2      USA Pennsylvania         York
#> 5 PID06  USA    new. york    jefferson  adm2      USA     New York    Jefferson
#> 6 PID11  USA Philadelphia         <NA>  adm2      USA Pennsylvania Philadelphia
#> 7 PID14 <NA>         <NA>         york  adm2      CAN      Ontario         York
#> 8 PID14 <NA>         <NA>         york  adm2      USA Pennsylvania         York
#>   hcode
#> 1   227
#> 2   110
#> 3   237
#> 4   238
#> 5   222
#> 6   237
#> 7   115
#> 8   238