Skip to contents

Match a data frame with raw, potentially messy hierarchical data (e.g. province, county, township) against a reference dataset, using a variety of matching strategies implemented in sequence to identify the best-possible match (i.e. highest-resolution) for each row.

The sequence of matching strategies is:

  1. (optional) manually-specified matching with hmatch_manual

  2. complete matching with hmatch(..., allow_gaps = FALSE)

  3. partial matching with hmatch(..., allow_gaps = TRUE)

  4. fuzzy partial matching with hmatch(allow_gaps = TRUE, fuzzy = TRUE)

  5. best-possible matching with hmatch_settle

Each approach is implement only on the rows of data for which a single match has not already been identified using the previous approaches.

Usage

hmatch_composite(
  raw,
  ref,
  man,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  code_col,
  type = "resolve_left",
  allow_gaps = TRUE,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

man

(optional) data frame of manually-specified matches, relating a given set of hierarchical values to the code within ref to which those values correspond

pattern

regex pattern to match the hierarchical columns in raw (and man if given) (see also specifying_columns)

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw (and man if given)

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

code_col

name of the code column containing codes for matching ref and man (only required if argument man is given)

type

type of join ("resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.

allow_gaps

logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of raw. Defaults to TRUE.

fuzzy

logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.

fuzzy_method

if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".

fuzzy_dist

if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.

dict

optional dictionary for recoding values within the hierarchical columns of raw (see dictionary_recoding)

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref, using the join type specified by argument type (see join_types for more details)

Examples

data(ne_raw)
data(ne_ref)

hmatch_composite(ne_raw, ne_ref, fuzzy = TRUE)
#>       id adm0           adm1         adm2 level ref_adm0     ref_adm1
#> 1  PID01  USA       New York      Suffolk  adm2      USA     New York
#> 2  PID02  can        ontario         <NA>  adm1      CAN      Ontario
#> 3  PID03  USA       New York Kings County  adm1      USA     New York
#> 4  PID04 <NA>           <NA> Philadelphia  adm2      USA Pennsylvania
#> 5  PID05  USA           <NA>         York  adm2      USA Pennsylvania
#> 6  PID06  USA      new. york    jefferson  adm2      USA     New York
#> 7  PID07  CAN        Ontario    Peel R.M.  adm1      CAN      Ontario
#> 8  PID08  USA    Pensylvania       Ithaca  adm1      USA Pennsylvania
#> 9  PID09  USA       New_York         King  adm2      USA     New York
#> 10 PID10 <NA>           <NA> Bergen, N.J.  <NA>     <NA>         <NA>
#> 11 PID11  USA   Philadelphia         <NA>  adm0      USA         <NA>
#> 12 PID12  USA             NJ         <NA>  adm0      USA         <NA>
#> 13 PID13 <NA>           <NA>    Jeffersen  adm0      USA         <NA>
#> 14 PID14 <NA>           <NA>         york  <NA>     <NA>         <NA>
#> 15 PID15  USA New York State     New York  adm0      USA         <NA>
#>        ref_adm2 hcode match_type
#> 1       Suffolk   227   complete
#> 2          <NA>   110   complete
#> 3          <NA>   220     settle
#> 4  Philadelphia   237       gaps
#> 5          York   238       gaps
#> 6     Jefferson   222   complete
#> 7          <NA>   110     settle
#> 8          <NA>   230     settle
#> 9         Kings   223      fuzzy
#> 10         <NA>  <NA>       <NA>
#> 11         <NA>   200     settle
#> 12         <NA>   200     settle
#> 13         <NA>   200     settle
#> 14         <NA>  <NA>       <NA>
#> 15         <NA>   200     settle