Replaces tokens by multiple replacements consisting of elements split by a separator pattern, with the option of retaining the separator. This function effectively reverses the operation of tokens_compound.

tokens_split(x, separator = " ", valuetype = c("fixed", "regex"),
  remove_separator = TRUE)

Arguments

x

a tokens object

separator

a single-character pattern match by which tokens are separated

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

remove_separator

if TRUE, remove separator from new tokens

Examples

# undo tokens_compound() ctoks <- tokens("pork barrel is an idiomatic multi-word expression") tokens_compound(ctoks, phrase("pork barrel"))
#> tokens from 1 document. #> text1 : #> [1] "pork_barrel" "is" "an" "idiomatic" "multi-word" #> [6] "expression" #>
tokens_compound(ctoks, phrase("pork barrel")) %>% tokens_split(separator = "_")
#> tokens from 1 document. #> text1 : #> [1] "pork" "barrel" "is" "an" "idiomatic" #> [6] "multi-word" "expression" #>
# similar to tokens(x, remove_hyphen = TRUE) but post-tokenization toks <- tokens("UK-EU negotiation is not going anywhere as of 2018-12-24.") tokens_split(toks, separator = "-", remove_separator = FALSE)
#> tokens from 1 document. #> text1 : #> [1] "UK" "-" "EU" "negotiation" "is" #> [6] "not" "going" "anywhere" "as" "of" #> [11] "2018" "-" "12" "-" "24" #> [16] "." #>