This function selects or discards tokens from a tokens objects, with the shortcut tokens_remove(x, pattern) defined as a shortcut for tokens_select(x, pattern, selection = "remove"). The most common usage for tokens_remove will be to eliminate stop words from a text or text-based object, while the most common use of tokens_select will be to select tokens with only positive pattern matches from a list of regular expressions, including a dictionary.

tokens_select(x, pattern, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  padding = FALSE, verbose = quanteda_options("verbose"))

tokens_remove(x, pattern, valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, padding = FALSE,
  verbose = quanteda_options("verbose"))

Arguments

x

tokens object whose token elements will be selected

pattern

a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.

selection

whether to "keep" or "remove" the tokens matching pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

padding

if TRUE, leave an empty string where the removed tokens previously existed. This is useful if a positional match is needed between the pre- and post-selected tokens, for instance if a window of adjacency needs to be computed.

verbose

if TRUE print messages about how many tokens were selected or removed

Value

a tokens object with tokens selected or removed based on their match to pattern

Examples

## tokens_select with simple examples toks <- tokens(c("This is a sentence.", "This is a second sentence."), remove_punct = TRUE) tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" #> #> text2 : #> [1] "This" "is" "a" #>
tokens_select(toks, c("is", "a", "this"), selection = "keep", padding = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "is" "a" "" #> #> text2 : #> [1] "This" "is" "a" "" "" #>
tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>
tokens_select(toks, c("is", "a", "this"), selection = "remove", padding = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "" "" "" "sentence" #> #> text2 : #> [1] "" "" "" "second" "sentence" #>
# how case_insensitive works tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = TRUE)
#> tokens from 2 documents. #> text1 : #> [1] "sentence" #> #> text2 : #> [1] "second" "sentence" #>
tokens_select(toks, c("is", "a", "this"), selection = "remove", case_insensitive = FALSE)
#> tokens from 2 documents. #> text1 : #> [1] "This" "sentence" #> #> text2 : #> [1] "This" "second" "sentence" #>
## tokens_remove example txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.", wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.") tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))
#> tokens from 2 documents. #> text1 : #> [1] "Fellow" "citizens" "called" "upon" "voice" #> [6] "country" "execute" "functions" "Chief" "Magistrate" #> #> text2 : #> [1] "occasion" "proper" "shall" "arrive" #> [5] "shall" "endeavor" "express" "high" #> [9] "sense" "entertain" "distinguished" "honor" #>