Skip to contents

Match sets of hierarchical values (e.g. province / county / township) in a raw, messy dataset to corresponding values within a reference dataset, sequentially over each hierarchical level. Specifically, implements hmatch at each successive hierarchical level, starting with only the first level (lowest resolution), then first and second, first second and third, etc.

After the initial matching over all levels, users can optionally use a resolve join to 'settle' for the highest match possible for each row of raw data, even if that match is below the highest-resolution level specified.

Usage

hmatch_settle(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  type = "left",
  allow_gaps = TRUE,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

pattern

regex pattern to match the hierarchical columns in raw

Note: hierarchical column names can be matched using either the pattern or by arguments. Or, if neither pattern or by are specified, the hierarchical columns are assumed to be all column names that are common to both raw and ref. See specifying_columns.

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

type

type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.

allow_gaps

logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of raw. Defaults to TRUE.

fuzzy

logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.

fuzzy_method

if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".

fuzzy_dist

if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.

dict

optional dictionary for recoding values within the hierarchical columns of raw (see dictionary_recoding)

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref, using the join type specified by argument type (see join_types for more details)

Resolve joins

In a resolve type join with hmatch_settle, rows of raw with multiple matches to ref are resolved to the highest hierarchical level that is non-conflicting among all matches (or no match if there is a conflict at the very first level). E.g.

raw:
1. | United States | <NA> | Jefferson |

Relevant rows from ref:
1. | United States | <NA> | <NA> |
2. | United States | New York | Jefferson |
3. | United States | Pennsylvania | Jefferson |

In a regular join, the single row from raw (above) will match all three rows from ref. However, in a resolve join the multiple matches will be resolved to the first row from ref, because only the first hierarchical level ("United States") is non-conflicting among all possible matches.

Note that there's a distinction between "common" values at a given hierarchical level (i.e. a single unique value in each row) and "non-conflicting" values (i.e. a single unique value or a missing value). E.g.

raw:
1. | United States | New York | New York |

Relevant rows from ref:
1. | United States | <NA> | <NA> |
2. | United States | New York | <NA> |
3. | United States | New York | New York |

In the example above, only the 1st hierarchical level ("United States") is "common" to all matches, but all hierarchical levels are "non-conflicting" (i.e. because row 2 is a hierarchical child of row 1, and row 3 a child of row 2), and so a resolve-type match will be made to the 3rd row in ref.

Examples

data(ne_raw)
data(ne_ref)

# return matches at all levels
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "inner")
#>       id adm0           adm1         adm2 level ref_adm0     ref_adm1
#> 1  PID01  USA       New York      Suffolk  adm0      USA         <NA>
#> 2  PID01  USA       New York      Suffolk  adm1      USA     New York
#> 3  PID01  USA       New York      Suffolk  adm2      USA     New York
#> 4  PID02  can        ontario         <NA>  adm0      CAN         <NA>
#> 5  PID02  can        ontario         <NA>  adm1      CAN      Ontario
#> 6  PID03  USA       New York Kings County  adm0      USA         <NA>
#> 7  PID03  USA       New York Kings County  adm1      USA     New York
#> 8  PID04 <NA>           <NA> Philadelphia  adm2      USA Pennsylvania
#> 9  PID05  USA           <NA>         York  adm0      USA         <NA>
#> 10 PID05  USA           <NA>         York  adm2      USA Pennsylvania
#> 11 PID06  USA      new. york    jefferson  adm0      USA         <NA>
#> 12 PID06  USA      new. york    jefferson  adm1      USA     New York
#> 13 PID06  USA      new. york    jefferson  adm2      USA     New York
#> 14 PID07  CAN        Ontario    Peel R.M.  adm0      CAN         <NA>
#> 15 PID07  CAN        Ontario    Peel R.M.  adm1      CAN      Ontario
#> 16 PID08  USA    Pensylvania       Ithaca  adm0      USA         <NA>
#> 17 PID09  USA       New_York         King  adm0      USA         <NA>
#> 18 PID09  USA       New_York         King  adm1      USA     New York
#> 19 PID11  USA   Philadelphia         <NA>  adm0      USA         <NA>
#> 20 PID12  USA             NJ         <NA>  adm0      USA         <NA>
#> 21 PID14 <NA>           <NA>         york  adm2      CAN      Ontario
#> 22 PID14 <NA>           <NA>         york  adm2      USA Pennsylvania
#> 23 PID15  USA New York State     New York  adm0      USA         <NA>
#>        ref_adm2 hcode
#> 1          <NA>   200
#> 2          <NA>   220
#> 3       Suffolk   227
#> 4          <NA>   100
#> 5          <NA>   110
#> 6          <NA>   200
#> 7          <NA>   220
#> 8  Philadelphia   237
#> 9          <NA>   200
#> 10         York   238
#> 11         <NA>   200
#> 12         <NA>   220
#> 13    Jefferson   222
#> 14         <NA>   100
#> 15         <NA>   110
#> 16         <NA>   200
#> 17         <NA>   200
#> 18         <NA>   220
#> 19         <NA>   200
#> 20         <NA>   200
#> 21         York   115
#> 22         York   238
#> 23         <NA>   200

# use a resolve join to settle for the best possible match for each row
hmatch_settle(ne_raw, ne_ref, pattern = "^adm", type = "resolve_inner")
#>       id adm0           adm1         adm2 level ref_adm0     ref_adm1
#> 1  PID01  USA       New York      Suffolk  adm2      USA     New York
#> 2  PID02  can        ontario         <NA>  adm1      CAN      Ontario
#> 3  PID03  USA       New York Kings County  adm1      USA     New York
#> 4  PID04 <NA>           <NA> Philadelphia  adm2      USA Pennsylvania
#> 5  PID05  USA           <NA>         York  adm2      USA Pennsylvania
#> 6  PID06  USA      new. york    jefferson  adm2      USA     New York
#> 7  PID07  CAN        Ontario    Peel R.M.  adm1      CAN      Ontario
#> 8  PID08  USA    Pensylvania       Ithaca  adm0      USA         <NA>
#> 9  PID09  USA       New_York         King  adm1      USA     New York
#> 10 PID11  USA   Philadelphia         <NA>  adm0      USA         <NA>
#> 11 PID12  USA             NJ         <NA>  adm0      USA         <NA>
#> 12 PID15  USA New York State     New York  adm0      USA         <NA>
#>        ref_adm2 hcode
#> 1       Suffolk   227
#> 2          <NA>   110
#> 3          <NA>   220
#> 4  Philadelphia   237
#> 5          York   238
#> 6     Jefferson   222
#> 7          <NA>   110
#> 8          <NA>   200
#> 9          <NA>   220
#> 10         <NA>   200
#> 11         <NA>   200
#> 12         <NA>   200