Skip to contents

Tokenized matching of hierarchical columns can yield false positives when there are tokens that occur frequently in multiple unique hierarchical values (e.g. "South", "North", "City", etc.).

This is a helper function to find such frequently-occurring tokens, which can then be passed to the exclude argument of hmatch_tokens. The frequency calculated is the number of unique, string-standardized values in which a given token is found.

Usage

count_tokens(
  x,
  split = "[-_[:space:]]+",
  min_freq = 2,
  min_nchar = 3,
  return_values = TRUE,
  std_fn = string_std,
  ...
)

Arguments

x

a character vector (generally a hierarchical column)

split

regex pattern used to split values into tokens. By default splits on any sequence of one or more space characters ("[:space:]"), dashes ("-"), and/or underscores ("_").

min_freq

minimum token frequency (i.e. number of unique values in which a given token occurs). Defaults to 2.

min_nchar

minimum token size in number of characters. Defaults to 3.

return_values

logical indicating whether to return the standardized values in which each token is found (TRUE), or only the count of the number of unique standardized values (FALSE). Defaults to TRUE.

std_fn

function to standardize strings, as performed within all hmatch_ functions. Defaults to string_std. Set to NULL to omit standardization. See also string_standardization.

...

additional arguments passed to std_fn()

Examples

french_departments <- c(
  "Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône",
  "Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine",
  "Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine"
)

count_tokens(french_departments)
#>    token_std               value_std
#> 1      haute alpes_de_haute_provence
#> 2      haute             haute_corse
#> 3      haute           haute_garonne
#> 4      haute             haute_loire
#> 5      alpes alpes_de_haute_provence
#> 6      alpes            hautes_alpes
#> 7      corse            corse_du_sud
#> 8      corse             haute_corse
#> 9     hautes            hautes_alpes
#> 10    hautes         hautes_pyrenees
#> 11  pyrenees         hautes_pyrenees
#> 12  pyrenees    pyrenees_atlantiques