Hierarchical matching, separately at each hierarchical level
Source:R/hmatch_split.R
hmatch_split.Rd
Implements hierarchical matching, separately at each hierarchical level within the data. For a given level, the raw data that is matched includes every unique combination of values at and below the level of interest. E.g.
Level 1: | Canada |
| United States |
Level 2: | Canada | Ontario |
| United States | New York |
| United States | Pennsylvania |
Level 3: | Canada | Ontario | Ottawa |
| Canada | Ontario | Toronto |
| United States | New York | Bronx |
| United States | New York | New York |
| United States | Pennsylvania | Philadelphia |
Usage
hmatch_split(
raw,
ref,
pattern,
pattern_ref = pattern,
by,
by_ref = by,
fn = "hmatch",
type = "left",
allow_gaps = TRUE,
fuzzy = FALSE,
fuzzy_method = "osa",
fuzzy_dist = 1L,
dict = NULL,
ref_prefix = "ref_",
std_fn = string_std,
...,
levels = NULL,
always_list = FALSE,
man,
code_col,
always_tokenize = FALSE,
token_split = "_",
exclude_freq = 3,
exclude_nchar = 3,
exclude_values = NULL
)
Arguments
- raw
data frame containing hierarchical columns with raw data
- ref
data frame containing hierarchical columns with reference data
- pattern
regex pattern to match the hierarchical columns in
raw
Note: hierarchical column names can be matched using either the
pattern
orby
arguments. Or, if neitherpattern
orby
are specified, the hierarchical columns are assumed to be all column names that are common to bothraw
andref
. See specifying_columns.- pattern_ref
regex pattern to match the hierarchical columns in
ref
. Defaults topattern
, so only need to specify if the hierarchical columns have different names inraw
andref
.- by
vector giving the names of the hierarchical columns in
raw
- by_ref
vector giving the names of the hierarchical columns in
ref
. Defaults toby
, so only need to specify if the hierarchical columns have different names inraw
andref
.- fn
which function to use for matching. Current options are
hmatch
,hmatch_permute
,hmatch_tokens
,hmatch_settle
, orhmatch_composite
. Defaults to "hmatch".Note that some subsequent arguments are only relevant for specific functions (e.g. the
exclude_
arguments are only relevant iffn = "hmatch_tokens"
).- type
type of join ("left", "inner", "anti", "resolve_left", "resolve_inner", or "resolve_anti"). Defaults to "left". See join_types.
Note that the details of resolve joins vary somewhat among hmatch functions (see documentation for the relevant function), and that function
hmatch_composite
only allows resolve joins.- allow_gaps
logical indicating whether to allow missing values below the match level, where 'match level' is the highest level with a non-missing value within a given row of
raw
. Defaults toTRUE
.- fuzzy
logical indicating whether to use fuzzy-matching (based on the
stringdist
package). Defaults to FALSE.- fuzzy_method
if
fuzzy = TRUE
, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".- fuzzy_dist
if
fuzzy = TRUE
, the maximum string distance to use to classify matches (i.e. a string distance less than or equal tofuzzy_dist
will be considered matching). Defaults to1L
.- dict
optional dictionary for recoding values within the hierarchical columns of
raw
(see dictionary_recoding)- ref_prefix
prefix to add to names of returned columns from
ref
if they are otherwise identical to names withinraw
. Defaults to "ref_".- std_fn
function to standardize strings during matching. Defaults to
string_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
- levels
a vector of names or integer indices corresponding to one or more of the hierarchical columns in
raw
to match at. Defaults toNULL
in which case matches are made at each hierarchical level.- always_list
logical indicating whether to always return a list, even when argument
levels
specifies a single match level. Defaults toFALSE
.- man
(optional) data frame of manually-specified matches, relating a given set of hierarchical values to the code within
ref
to which those values correspond- code_col
name of the code column containing codes for matching
ref
andman
(only required if argumentman
is given)- always_tokenize
logical indicating whether to tokenize all values prior to matching (
TRUE
), or to first attempt non-tokenized matching withhmatch
and only tokenize values withinraw
(and corresponding putative matches withinref
) that don't have a non-tokenized match (FALSE
). Defaults toFALSE
.- token_split
regex pattern to split strings into tokens. Currently tokenization is implemented after string-standardizatipn with argument
std_fn
(this may change in a future version), so the regex pattern should split standardized strings rather than the original strings. Defaults to "_".- exclude_freq
exclude tokens from matching if they have a frequency greater than or equal to this value. Refers to the number of unique, string-standardized values at a given hierarchical level in which a given token occurs, as calculated by
count_tokens
(separately forraw
andref
). Defaults to3
.- exclude_nchar
exclude tokens from matching if they have nchar less than or equal to this value. Defaults to
3
.- exclude_values
character vector of additional tokens to exclude from matching. Subject to string-standardizatipn with argument
std_fn
.
Value
A list of data frames, each returned by a call to fn
on the unique
combination of hierarchical values at the given hierarchical level. The
number of elements in the list corresponds to the number of hierarchical
columns in raw
, or, if specified, the number of elements in argument
levels
.
However, if always_list = FALSE
and length(levels) == 1
, a single data
frame is returned (i.e. not wrapped in a list).
Examples
data(ne_raw)
data(ne_ref)
# by default calls fn `hmatch` separately for each hierarchical level
hmatch_split(ne_raw, ne_ref)
#> $adm0
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 CAN <NA> <NA> adm0 CAN <NA> <NA> 100
#> 2 USA <NA> <NA> adm0 USA <NA> <NA> 200
#> 3 can <NA> <NA> adm0 CAN <NA> <NA> 100
#>
#> $adm1
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 CAN Ontario <NA> adm1 CAN Ontario <NA> 110
#> 2 USA NJ <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 USA New York <NA> adm1 USA New York <NA> 220
#> 4 USA New York State <NA> <NA> <NA> <NA> <NA> <NA>
#> 5 USA New_York <NA> adm1 USA New York <NA> 220
#> 6 USA Pensylvania <NA> <NA> <NA> <NA> <NA> <NA>
#> 7 USA Philadelphia <NA> <NA> <NA> <NA> <NA> <NA>
#> 8 USA new. york <NA> adm1 USA New York <NA> 220
#> 9 can ontario <NA> adm1 CAN Ontario <NA> 110
#>
#> $adm2
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2
#> 1 <NA> <NA> Bergen, N.J. <NA> <NA> <NA> <NA>
#> 2 <NA> <NA> Jeffersen <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia
#> 4 <NA> <NA> york adm2 CAN Ontario York
#> 5 <NA> <NA> york adm2 USA Pennsylvania York
#> 6 CAN Ontario Peel R.M. <NA> <NA> <NA> <NA>
#> 7 USA <NA> York adm2 USA Pennsylvania York
#> 8 USA New York Kings County <NA> <NA> <NA> <NA>
#> 9 USA New York Suffolk adm2 USA New York Suffolk
#> 10 USA New York State New York <NA> <NA> <NA> <NA>
#> 11 USA New_York King <NA> <NA> <NA> <NA>
#> 12 USA Pensylvania Ithaca <NA> <NA> <NA> <NA>
#> 13 USA new. york jefferson adm2 USA New York Jefferson
#> hcode
#> 1 <NA>
#> 2 <NA>
#> 3 237
#> 4 115
#> 5 238
#> 6 <NA>
#> 7 238
#> 8 <NA>
#> 9 227
#> 10 <NA>
#> 11 <NA>
#> 12 <NA>
#> 13 222
#>
# can also specify other hmatch functions, and subsets of hierarchical levels
hmatch_split(ne_raw, ne_ref, fn = "hmatch_tokens", levels = 2:3)
#> $adm1
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2 hcode
#> 1 CAN Ontario <NA> adm1 CAN Ontario <NA> 110
#> 2 USA NJ <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 USA New York <NA> adm1 USA New Jersey <NA> 210
#> 4 USA New York <NA> adm1 USA New York <NA> 220
#> 5 USA New York State <NA> adm1 USA New Jersey <NA> 210
#> 6 USA New York State <NA> adm1 USA New York <NA> 220
#> 7 USA New_York <NA> adm1 USA New Jersey <NA> 210
#> 8 USA New_York <NA> adm1 USA New York <NA> 220
#> 9 USA Pensylvania <NA> <NA> <NA> <NA> <NA> <NA>
#> 10 USA Philadelphia <NA> <NA> <NA> <NA> <NA> <NA>
#> 11 USA new. york <NA> adm1 USA New Jersey <NA> 210
#> 12 USA new. york <NA> adm1 USA New York <NA> 220
#> 13 can ontario <NA> adm1 CAN Ontario <NA> 110
#>
#> $adm2
#> adm0 adm1 adm2 level ref_adm0 ref_adm1 ref_adm2
#> 1 <NA> <NA> Bergen, N.J. adm2 USA New Jersey Bergen
#> 2 <NA> <NA> Jeffersen <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> Philadelphia adm2 USA Pennsylvania Philadelphia
#> 4 <NA> <NA> york adm2 CAN Ontario York
#> 5 <NA> <NA> york adm2 USA New York New York
#> 6 <NA> <NA> york adm2 USA Pennsylvania York
#> 7 CAN Ontario Peel R.M. adm2 CAN Ontario Peel
#> 8 USA <NA> York adm2 USA New York New York
#> 9 USA <NA> York adm2 USA Pennsylvania York
#> 10 USA New York Kings County adm2 USA New York Kings
#> 11 USA New York Suffolk adm2 USA New York Suffolk
#> 12 USA New York State New York adm2 USA New York New York
#> 13 USA New_York King <NA> <NA> <NA> <NA>
#> 14 USA Pensylvania Ithaca <NA> <NA> <NA> <NA>
#> 15 USA new. york jefferson adm2 USA New York Jefferson
#> hcode
#> 1 211
#> 2 <NA>
#> 3 237
#> 4 115
#> 5 225
#> 6 238
#> 7 113
#> 8 225
#> 9 238
#> 10 223
#> 11 227
#> 12 225
#> 13 <NA>
#> 14 <NA>
#> 15 222
#>