Hierarchical matching, separately at each hierarchical level

Implements hierarchical matching, separately at each hierarchical level within the data. For a given level, the raw data that is matched includes every unique combination of values at and below the level of interest. E.g.

Level 1:
| Canada |
| United States |

Usage

hmatch_split(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  fn = "hmatch",
  type = "left",
  allow_gaps = TRUE,
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  dict = NULL,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...,
  levels = NULL,
  always_list = FALSE,
  man,
  code_col,
  always_tokenize = FALSE,
  token_split = "_",
  exclude_freq = 3,
  exclude_nchar = 3,
  exclude_values = NULL
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

pattern

regex pattern to match the hierarchical columns in raw

Note: hierarchical column names can be matched using either the pattern or by arguments. Or, if neither pattern or by are specified, the hierarchical columns are assumed to be all column names that are common to both raw and ref. See specifying_columns.

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

fn

which function to use for matching. Current options are hmatch, hmatch_permute, hmatch_tokens, hmatch_settle, or hmatch_composite. Defaults to "hmatch".

Note that some subsequent arguments are only relevant for specific functions (e.g. the exclude_ arguments are only relevant if fn = "hmatch_tokens").

type

type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.

Note that the details of resolve joins vary somewhat among hmatch functions (see documentation for the relevant function), and that function hmatch_composite only allows resolve joins.

allow_gaps

logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of raw. Defaults to TRUE.

fuzzy

logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.

fuzzy_method

if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".

fuzzy_dist

if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.

dict

optional dictionary for recoding values within the hierarchical columns of raw (see dictionary_recoding)

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

levels

a vector of names or integer indices corresponding to one or more of the hierarchical columns in raw to match at. Defaults to NULL in which case matches are made at each hierarchical level.

always_list

logical indicating whether to always return a list, even when argument levels specifies a single match level. Defaults to FALSE.

man

(optional) data frame of manually-specified matches, relating a given set of hierarchical values to the code within ref to which those values correspond

code_col

name of the code column containing codes for matching ref and man (only required if argument man is given)

always_tokenize

logical indicating whether to tokenize all values prior to matching (TRUE), or to first attempt non-tokenized matching with hmatch and only tokenize values within raw (and corresponding putative matches within ref) that don't have a non-tokenized match (FALSE). Defaults to FALSE.

token_split

regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument std_fn (this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".

exclude_freq

exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by count_tokens (separately for raw and ref). Defaults to 3.

exclude_nchar

exclude tokens from matching if they have nchar less than or equal to this value. Defaults to 3.

exclude_values

character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument std_fn.

Value

A list of data frames, each returned by a call to fn on the unique combination of hierarchical values at the given hierarchical level. The number of elements in the list corresponds to the number of hierarchical columns in raw, or, if specified, the number of elements in argument levels.

However, if always_list = FALSE and length(levels) == 1, a single data frame is returned (i.e. not wrapped in a list).

Examples

data(ne_raw)
data(ne_ref)

# by default calls fn `hmatch` separately for each hierarchical level
hmatch_split(ne_raw, ne_ref)
#> $adm0
#>   adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1  CAN <NA> <NA>  adm0      CAN     <NA>     <NA>   100
#> 2  USA <NA> <NA>  adm0      USA     <NA>     <NA>   200
#> 3  can <NA> <NA>  adm0      CAN     <NA>     <NA>   100
#> 
#> $adm1
#>   adm0           adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1  CAN        Ontario <NA>  adm1      CAN  Ontario     <NA>   110
#> 2  USA             NJ <NA>  <NA>     <NA>     <NA>     <NA>  <NA>
#> 3  USA       New York <NA>  adm1      USA New York     <NA>   220
#> 4  USA New York State <NA>  <NA>     <NA>     <NA>     <NA>  <NA>
#> 5  USA       New_York <NA>  adm1      USA New York     <NA>   220
#> 6  USA    Pensylvania <NA>  <NA>     <NA>     <NA>     <NA>  <NA>
#> 7  USA   Philadelphia <NA>  <NA>     <NA>     <NA>     <NA>  <NA>
#> 8  USA      new. york <NA>  adm1      USA New York     <NA>   220
#> 9  can        ontario <NA>  adm1      CAN  Ontario     <NA>   110
#> 
#> $adm2
#>    adm0           adm1         adm2 level ref_adm0     ref_adm1     ref_adm2
#> 1  <NA>           <NA> Bergen, N.J.  <NA>     <NA>         <NA>         <NA>
#> 2  <NA>           <NA>    Jeffersen  <NA>     <NA>         <NA>         <NA>
#> 3  <NA>           <NA> Philadelphia  adm2      USA Pennsylvania Philadelphia
#> 4  <NA>           <NA>         york  adm2      CAN      Ontario         York
#> 5  <NA>           <NA>         york  adm2      USA Pennsylvania         York
#> 6   CAN        Ontario    Peel R.M.  <NA>     <NA>         <NA>         <NA>
#> 7   USA           <NA>         York  adm2      USA Pennsylvania         York
#> 8   USA       New York Kings County  <NA>     <NA>         <NA>         <NA>
#> 9   USA       New York      Suffolk  adm2      USA     New York      Suffolk
#> 10  USA New York State     New York  <NA>     <NA>         <NA>         <NA>
#> 11  USA       New_York         King  <NA>     <NA>         <NA>         <NA>
#> 12  USA    Pensylvania       Ithaca  <NA>     <NA>         <NA>         <NA>
#> 13  USA      new. york    jefferson  adm2      USA     New York    Jefferson
#>    hcode
#> 1   <NA>
#> 2   <NA>
#> 3    237
#> 4    115
#> 5    238
#> 6   <NA>
#> 7    238
#> 8   <NA>
#> 9    227
#> 10  <NA>
#> 11  <NA>
#> 12  <NA>
#> 13   222
#> 

# can also specify other hmatch functions, and subsets of hierarchical levels
hmatch_split(ne_raw, ne_ref, fn = "hmatch_tokens", levels = 2:3)
#> $adm1
#>    adm0           adm1 adm2 level ref_adm0   ref_adm1 ref_adm2 hcode
#> 1   CAN        Ontario <NA>  adm1      CAN    Ontario     <NA>   110
#> 2   USA             NJ <NA>  <NA>     <NA>       <NA>     <NA>  <NA>
#> 3   USA       New York <NA>  adm1      USA New Jersey     <NA>   210
#> 4   USA       New York <NA>  adm1      USA   New York     <NA>   220
#> 5   USA New York State <NA>  adm1      USA New Jersey     <NA>   210
#> 6   USA New York State <NA>  adm1      USA   New York     <NA>   220
#> 7   USA       New_York <NA>  adm1      USA New Jersey     <NA>   210
#> 8   USA       New_York <NA>  adm1      USA   New York     <NA>   220
#> 9   USA    Pensylvania <NA>  <NA>     <NA>       <NA>     <NA>  <NA>
#> 10  USA   Philadelphia <NA>  <NA>     <NA>       <NA>     <NA>  <NA>
#> 11  USA      new. york <NA>  adm1      USA New Jersey     <NA>   210
#> 12  USA      new. york <NA>  adm1      USA   New York     <NA>   220
#> 13  can        ontario <NA>  adm1      CAN    Ontario     <NA>   110
#> 
#> $adm2
#>    adm0           adm1         adm2 level ref_adm0     ref_adm1     ref_adm2
#> 1  <NA>           <NA> Bergen, N.J.  adm2      USA   New Jersey       Bergen
#> 2  <NA>           <NA>    Jeffersen  <NA>     <NA>         <NA>         <NA>
#> 3  <NA>           <NA> Philadelphia  adm2      USA Pennsylvania Philadelphia
#> 4  <NA>           <NA>         york  adm2      CAN      Ontario         York
#> 5  <NA>           <NA>         york  adm2      USA     New York     New York
#> 6  <NA>           <NA>         york  adm2      USA Pennsylvania         York
#> 7   CAN        Ontario    Peel R.M.  adm2      CAN      Ontario         Peel
#> 8   USA           <NA>         York  adm2      USA     New York     New York
#> 9   USA           <NA>         York  adm2      USA Pennsylvania         York
#> 10  USA       New York Kings County  adm2      USA     New York        Kings
#> 11  USA       New York      Suffolk  adm2      USA     New York      Suffolk
#> 12  USA New York State     New York  adm2      USA     New York     New York
#> 13  USA       New_York         King  <NA>     <NA>         <NA>         <NA>
#> 14  USA    Pensylvania       Ithaca  <NA>     <NA>         <NA>         <NA>
#> 15  USA      new. york    jefferson  adm2      USA     New York    Jefferson
#>    hcode
#> 1    211
#> 2   <NA>
#> 3    237
#> 4    115
#> 5    225
#> 6    238
#> 7    113
#> 8    225
#> 9    238
#> 10   223
#> 11   227
#> 12   225
#> 13  <NA>
#> 14  <NA>
#> 15   222
#>