Replace multi-token sequences with a multi-word, or "compound" token. The resulting compound tokens will represent a phrase or multi-word expression, concatenated with concatenator (by default, the "_" character) to form a single "token". This ensures that the sequences will be processed subsequently as single tokens, for instance in constructing a dfm.

tokens_compound(x, pattern, concatenator = "_", valuetype = c("glob",
  "regex", "fixed"), case_insensitive = TRUE, join = TRUE)

Arguments

x

an input tokens object

pattern

a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.

concatenator

the concatenation character that will connect the words making up the multi-word sequences. The default _ is recommended since it will not be removed during normal cleaning and tokenization (while nearly all other punctuation characters, at least those in the Unicode punctuation class [P] will be removed).

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching. When pattern is a collocations, case-sensitive operation is significantly faster than case-insensitive operation.

join

logical; if TRUE, join overlapping compounds into a single compound; otherwise, form these separately. See examples.

Value

A tokens object in which the token sequences matching pattern have been replaced by compound "tokens" joined by the concatenator.

Note

Patterns to be compounded (naturally) consist of multi-word sequences, and how these are expected in pattern is very specific. If the elements to be compounded are supplied as space-delimited elements of a character vector, wrap the vector in phrase. If the elements to be compounded are separate elements of a character vector, supply it as a list where each list element is the sequence of character elements.

See the examples below.

Examples

txt <- "The United Kingdom is leaving the European Union." toks <- tokens(txt, remove_punct = TRUE) # character vector - not compounded tokens_compound(toks, c("United", "Kingdom", "European", "Union"))
#> tokens from 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" "leaving" "the" "European" #> [8] "Union" #>
# elements separated by spaces - not compounded tokens_compound(toks, c("United Kingdom", "European Union"))
#> tokens from 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" "leaving" "the" "European" #> [8] "Union" #>
# list of characters - is compounded tokens_compound(toks, list(c("United", "Kingdom"), c("European", "Union")))
#> tokens from 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European_Union" #>
# elements separated by spaces, wrapped in phrase)() - is compounded tokens_compound(toks, phrase(c("United Kingdom", "European Union")))
#> tokens from 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European_Union" #>
# supplied as values in a dictionary (same as list) - is compounded # (keys do not matter) tokens_compound(toks, dictionary(list(key1 = "United Kingdom", key2 = "European Union")))
#> tokens from 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European_Union" #>
# pattern as dictionaries with glob matches tokens_compound(toks, dictionary(list(key1 = c("U* K*"))), valuetype = "glob")
#> tokens from 1 document. #> text1 : #> [1] "The" "United_Kingdom" "is" "leaving" #> [5] "the" "European" "Union" #>
# supplied as collocations - is compounded colls <- tokens("The new European Union is not the old European Union.") %>% textstat_collocations(size = 2, min_count = 1, tolower = FALSE) tokens_compound(toks, colls, case_insensitive = FALSE)
#> tokens from 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" #> [5] "leaving" "the" "European_Union" #>
# note the differences caused by join = FALSE compounds <- list(c("the", "European"), c("European", "Union")) tokens_compound(toks, pattern = compounds, join = TRUE)
#> tokens from 1 document. #> text1 : #> [1] "The" "United" "Kingdom" #> [4] "is" "leaving" "the_European_Union" #>
tokens_compound(toks, pattern = compounds, join = FALSE)
#> tokens from 1 document. #> text1 : #> [1] "The" "United" "Kingdom" "is" #> [5] "leaving" "the_European" "European_Union" #>