Hierarchical matching, separately at each hierarchical level
Source:R/hmatch_split.R
hmatch_split.RdImplements hierarchical matching, separately at each hierarchical level within the data. For a given level, the raw data that is matched includes every unique combination of values at and below the level of interest. E.g.
Level 1: | Canada | | United States |
Level 2: | Canada | Ontario | | United States | New York | | United States | Pennsylvania |
Level 3: | Canada | Ontario | Ottawa | | Canada | Ontario | Toronto | | United States | New York | Bronx | | United States | New York | New York | | United States | Pennsylvania | Philadelphia |
Usage
hmatch_split(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
fn = "hmatch",
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...,
levels = NULL,
always_list = FALSE,
man,
code_col,
always_tokenize = FALSE,
token_split = "_",
exclude_freq = 3,
exclude_nchar = 3,
exclude_values = NULL
)Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- pattern
regex pattern to match the hierarchical columns in
rawNote: hierarchical column names can be matched using either the
patternorbyarguments. Or, if neitherpatternorbyare specified, the hierarchical columns are assumed to be all column names that are common to bothrawandref. See specifying_columns.- pattern_ref
regex pattern to match the hierarchical columns in
ref. Defaults topattern, so only need to specify if the hierarchical columns have different names inrawandref.- by
vector giving the names of the hierarchical columns in
raw- by_ref
vector giving the names of the hierarchical columns in
ref. Defaults toby, so only need to specify if the hierarchical columns have different names inrawandref.- fn
which function to use for matching. Current options are
hmatch,hmatch_permute,hmatch_tokens,hmatch_settle, orhmatch_composite. Defaults to "hmatch".Note that some subsequent arguments are only relevant for specific functions (e.g. the
exclude_arguments are only relevant iffn = "hmatch_tokens").- type
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
Note that the details of resolve joins vary somewhat among hmatch functions (see documentation for the relevant function), and that function
hmatch_compositeonly allows resolve joins.- allow_gaps
logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of
raw. Defaults toTRUE.- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdistpackage). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_distwill be considered matching). Defaults to1L.- dict
optional dictionary for recoding values within the hierarchical columns of
raw(see dictionary_recoding)- ref_prefix
prefix to add to names of returned columns from
refif they are otherwise identical to names withinraw. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std. Set toNULLto omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()- levels
a vector of names or integer indices corresponding to one or more of the hierarchical columns in
rawto match at. Defaults toNULLin which case matches are made at each hierarchical level.- always_list
logical indicating whether to always return a list, even when argument
levelsspecifies a single match level. Defaults toFALSE.- man
(optional) data frame of manually-specified matches, relating a given set of hierarchical values to the code within
refto which those values correspond- code_col
name of the code column containing codes for matching
refandman(only required if argumentmanis given)- always_tokenize
logical indicating whether to tokenize all values prior to matching (
TRUE), or to first attempt non-tokenized matching withhmatchand only tokenize values withinraw(and corresponding putative matches withinref) that don't have a non-tokenized match (FALSE). Defaults toFALSE.- token_split
regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument
std_fn(this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".- exclude_freq
exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by
count_tokens(separately forrawandref). Defaults to3.- exclude_nchar
exclude tokens from matching if they have nchar less than or equal to this value. Defaults to
3.- exclude_values
character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument
std_fn.
Value
A list of data frames, each returned by a call to fn on the unique
combination of hierarchical values at the given hierarchical level. The
number of elements in the list corresponds to the number of hierarchical
columns in raw, or, if specified, the number of elements in argument
levels.
However, if always_list = FALSE and length(levels) == 1, a single data
frame is returned (i.e. not wrapped in a list).
Examples
data(ne_raw)
data(ne_ref)
# by default calls fn `hmatch` separately for each hierarchical level
hmatch_split(ne_raw, ne_ref)
#> $adm0
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 CAN <NA> <NA> adm0 CAN <NA> <NA> 100
#> 2 USA <NA> <NA> adm0 USA <NA> <NA> 200
#> 3 can <NA> <NA> adm0 CAN <NA> <NA> 100
#>
#> $adm1
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 CAN Ontario <NA> adm1 CAN Ontario <NA> 110
#> 2 USA NJ <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 USA New York <NA> adm1 USA New York <NA> 220
#> 4 USA New York State <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 USA New_York <NA> adm1 USA New York <NA> 220
#> 6 USA Pensylvania <NA> <NA> <NA> <NA> <NA> <NA>
#> 7 USA Philadelphia <NA> <NA> <NA> <NA> <NA> <NA>
#> 8 USA new. york <NA> adm1 USA New York <NA> 220
#> 9 can ontario <NA> adm1 CAN Ontario <NA> 110
#>
#> $adm2
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2
#> 1 <NA> <NA> Bergen, N.J. <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> Jeffersen <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia
#> 4 <NA> <NA> york adm2 CAN Ontario York
#> 5 <NA> <NA> york adm2 USA Pennsylvania York
#> 6 CAN Ontario Peel R.M. <NA> <NA> <NA> <NA>
#> 7 USA <NA> York adm2 USA Pennsylvania York
#> 8 USA New York Kings County <NA> <NA> <NA> <NA>
#> 9 USA New York Suffolk adm2 USA New York Suffolk
#> 10 USA New York State New York <NA> <NA> <NA> <NA>
#> 11 USA New_York King <NA> <NA> <NA> <NA>
#> 12 USA Pensylvania Ithaca <NA> <NA> <NA> <NA>
#> 13 USA new. york jefferson adm2 USA New York Jefferson
#> hcode
#> 1 <NA>
#> 2 <NA>
#> 3 237
#> 4 115
#> 5 238
#> 6 <NA>
#> 7 238
#> 8 <NA>
#> 9 227
#> 10 <NA>
#> 11 <NA>
#> 12 <NA>
#> 13 222
#>
# can also specify other hmatch functions, and subsets of hierarchical levels
hmatch_split(ne_raw, ne_ref, fn = "hmatch_tokens", levels = 2:3)
#> $adm1
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 CAN Ontario <NA> adm1 CAN Ontario <NA> 110
#> 2 USA NJ <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 USA New York <NA> adm1 USA New Jersey <NA> 210
#> 4 USA New York <NA> adm1 USA New York <NA> 220
#> 5 USA New York State <NA> adm1 USA New Jersey <NA> 210
#> 6 USA New York State <NA> adm1 USA New York <NA> 220
#> 7 USA New_York <NA> adm1 USA New Jersey <NA> 210
#> 8 USA New_York <NA> adm1 USA New York <NA> 220
#> 9 USA Pensylvania <NA> <NA> <NA> <NA> <NA> <NA>
#> 10 USA Philadelphia <NA> <NA> <NA> <NA> <NA> <NA>
#> 11 USA new. york <NA> adm1 USA New Jersey <NA> 210
#> 12 USA new. york <NA> adm1 USA New York <NA> 220
#> 13 can ontario <NA> adm1 CAN Ontario <NA> 110
#>
#> $adm2
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2
#> 1 <NA> <NA> Bergen, N.J. adm2 USA New Jersey Bergen
#> 2 <NA> <NA> Jeffersen <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia
#> 4 <NA> <NA> york adm2 CAN Ontario York
#> 5 <NA> <NA> york adm2 USA New York New York
#> 6 <NA> <NA> york adm2 USA Pennsylvania York
#> 7 CAN Ontario Peel R.M. adm2 CAN Ontario Peel
#> 8 USA <NA> York adm2 USA New York New York
#> 9 USA <NA> York adm2 USA Pennsylvania York
#> 10 USA New York Kings County adm2 USA New York Kings
#> 11 USA New York Suffolk adm2 USA New York Suffolk
#> 12 USA New York State New York adm2 USA New York New York
#> 13 USA New_York King <NA> <NA> <NA> <NA>
#> 14 USA Pensylvania Ithaca <NA> <NA> <NA> <NA>
#> 15 USA new. york jefferson adm2 USA New York Jefferson
#> hcode
#> 1 211
#> 2 <NA>
#> 3 237
#> 4 115
#> 5 225
#> 6 238
#> 7 113
#> 8 225
#> 9 238
#> 10 223
#> 11 227
#> 12 225
#> 13 <NA>
#> 14 <NA>
#> 15 222
#>