Find frequently occurring tokens within a hierarchical column
Source:R/count_tokens.R
count_tokens.Rd
Tokenized matching of hierarchical columns can yield false positives when there are tokens that occur frequently in multiple unique hierarchical values (e.g. "South", "North", "City", etc.).
This is a helper function to find such frequently-occurring tokens, which can
then be passed to the exclude
argument of hmatch_tokens
. The
frequency calculated is the number of unique,
string-standardized values in which a given
token is found.
Usage
count_tokens(
x,
split = "[-_[:space:]]+",
min_freq = 2,
min_nchar = 3,
return_values = TRUE,
std_fn = string_std,
...
)
Arguments
- x
a character vector (generally a hierarchical column)
- split
regex pattern used to split values into tokens. By default splits on any sequence of one or more space characters ("[:space:]"), dashes ("-"), and/or underscores ("_").
- min_freq
minimum token frequency (i.e. number of unique values in which a given token occurs). Defaults to
2
.- min_nchar
minimum token size in number of characters. Defaults to
3
.- return_values
logical indicating whether to return the standardized values in which each token is found (
TRUE
), or only the count of the number of unique standardized values (FALSE
). Defaults toTRUE
.- std_fn
function to standardize strings, as performed within all
hmatch_
functions. Defaults tostring_std
. Set toNULL
to omit standardization. See also string_standardization.- ...
additional arguments passed to
std_fn()
Examples
french_departments <- c(
"Alpes-de-Haute-Provence", "Hautes-Alpes", "Ardennes", "Bouches-du-Rhône",
"Corse-du-Sud", "Haute-Corse", "Haute-Garonne", "Ille-et-Vilaine",
"Haute-Loire", "Hautes-Pyrénées", "Pyrénées-Atlantiques", "Hauts-de-Seine"
)
count_tokens(french_departments)
#> token_std value_std
#> 1 haute alpes_de_haute_provence
#> 2 haute haute_corse
#> 3 haute haute_garonne
#> 4 haute haute_loire
#> 5 alpes alpes_de_haute_provence
#> 6 alpes hautes_alpes
#> 7 corse corse_du_sud
#> 8 corse haute_corse
#> 9 hautes hautes_alpes
#> 10 hautes hautes_pyrenees
#> 11 pyrenees hautes_pyrenees
#> 12 pyrenees pyrenees_atlantiques