Skip to contents

Match a hierarchical column (e.g. region, province, or county) within a raw, potentially messy dataset against a corresponding column within a reference dataset, by searching for similar sets of 'offspring' (i.e. values at the next hierarchical level).

For example, if the raw dataset uses admin1 level "NY" whereas the reference dataset uses "New York", it would be difficult to automatically match these values using only fuzzy-matching. However, we might nonetheless be able to match "NY" to "New York" if they share a common and unique set of 'offspring' (i.e. admin2 values) across both datasets (e.g "Kings", "Queens", "New York", "Suffolk", "Bronx", etc.).

Unlike other hmatch functions, the data frame returned by hmatch_parents only includes unique hierarchical combinations and only relevant hierarchical levels (i.e. the parent level and above), along with additional columns giving the number of matching children and total number of children for a given parent.

Usage

hmatch_parents(
  raw,
  ref,
  pattern,
  pattern_ref = pattern,
  by,
  by_ref = by,
  level,
  min_matches = 1L,
  type = "left",
  fuzzy = FALSE,
  fuzzy_method = "osa",
  fuzzy_dist = 1L,
  ref_prefix = "ref_",
  std_fn = string_std,
  ...
)

Arguments

raw

data frame containing hierarchical columns with raw data

ref

data frame containing hierarchical columns with reference data

pattern

regex pattern to match the hierarchical columns in raw

Note: hierarchical column names can be matched using either the pattern or by arguments. Or, if neither pattern or by are specified, the hierarchical columns are assumed to be all column names that are common to both raw and ref. See specifying_columns.

pattern_ref

regex pattern to match the hierarchical columns in ref. Defaults to pattern, so only need to specify if the hierarchical columns have different names in raw and ref.

by

vector giving the names of the hierarchical columns in raw

by_ref

vector giving the names of the hierarchical columns in ref. Defaults to by, so only need to specify if the hierarchical columns have different names in raw and ref.

level

name or integer index of the hierarchical level to match at (i.e. the 'parent' level). If a name, must correspond to a hierarchical column within raw, not including the very last hierarchical column (which has no hierarchical children). If an integer, must be between 1 and k-1, where k is the number of hierarchical columns.

min_matches

minimum number of matching offspring required for parents to be considered a match. Defaults to 1.

type

type of join ("left", "inner" or "anti") (defaults to "left")

fuzzy

logical indicating whether to use fuzzy-matching (based on the stringdist package). Defaults to FALSE.

fuzzy_method

if fuzzy = TRUE, the method to use for string distance calculation (see stringdist-metrics). Defaults to "osa".

fuzzy_dist

if fuzzy = TRUE, the maximum string distance to use to classify matches (i.e. a string distance less than or equal to fuzzy_dist will be considered matching). Defaults to 1L.

ref_prefix

prefix to add to names of returned columns from ref if they are otherwise identical to names within raw. Defaults to "ref_".

std_fn

function to standardize strings during matching. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Value

a data frame obtained by matching the hierarchical columns in raw

and ref (at the parent level and above), using the join type specified by argument type (see join_types for more details). Note that unlike other hmatch_ functions, hmatch_parents returns only unique rows and relevant hierarchical columns (i.e. the parent level and above), along with additional columns describing the number of matching children and total number of children for a given parent.

...

hierarchical columns from raw, parent level and above

...

hierarchical columns from ref, parent level and above

n_child_raw

total number of unique children belonging to the parent within raw

n_child_ref

total number of unique children belonging to the parent within ref

n_child_match

number of children in raw with match in ref

Examples

# e.g. match abbreviated adm1 names to full names based on common offspring
raw <- ne_ref
raw$adm1[raw$adm1 == "Ontario"] <- "ON"
raw$adm1[raw$adm1 == "New York"] <- "NY"
raw$adm1[raw$adm1 == "New Jersey"] <- "NJ"
raw$adm1[raw$adm1 == "Pennsylvania"] <- "PA"

hmatch_parents(
  raw,
  ne_ref,
  pattern = "adm",
  level = "adm1",
  min_matches = 2,
  type = "left"
)
#>   adm0 adm1 ref_adm0     ref_adm1 n_child_raw n_child_ref n_child_match
#> 1  CAN   ON      CAN      Ontario           5           5             5
#> 2  USA   NJ      USA   New Jersey           5           5             5
#> 3  USA   NY      USA     New York           7           7             7
#> 4  USA   PA      USA Pennsylvania           8           8             8