These function select or discard tokens from a tokens object. For
convenience, the functions tokens_remove
and tokens_keep
are defined as
shortcuts for tokens_select(x, pattern, selection = "remove")
and
tokens_select(x, pattern, selection = "keep")
, respectively. The most
common usage for tokens_remove
will be to eliminate stop words from a text
or text-based object, while the most common use of tokens_select
will be to
select tokens with only positive pattern matches from a list of regular
expressions, including a dictionary. startpos
and endpos
determine the
positions of tokens searched for pattern
and areas affected are
expanded by window
.
tokens_select( x, pattern, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, padding = FALSE, window = 0, min_nchar = NULL, max_nchar = NULL, startpos = 1L, endpos = -1L, verbose = quanteda_options("verbose") ) tokens_remove(x, ...) tokens_keep(x, ...)
x | tokens object whose token elements will be removed or kept |
---|---|
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
padding | if |
window | integer of length 1 or 2; the size of the window of tokens
adjacent to Terms from overlapping windows are never double-counted, but simply
returned in the pattern match. This is because |
min_nchar, max_nchar | optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
startpos, endpos | integer; position of tokens in documents where pattern
matching starts and ends, where 1 is the first token in a document. For
negative indexes, counting starts at the ending token of the document, so
that -1 denotes the last token in the document, -2 the second to last, etc.
When the length of the vector is equal to |
verbose | if |
... | additional arguments passed by |
a tokens object with tokens selected or removed based on their
match to pattern
## tokens_select with simple examples toks <- as.tokens(list(letters, LETTERS)) tokens_select(toks, c("b", "e", "f"), selection = "keep", padding = FALSE)#> Tokens consisting of 2 documents. #> text1 : #> [1] "b" "e" "f" #> #> text2 : #> [1] "B" "E" "F" #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "" "b" "" "" "e" "f" "" "" "" "" "" "" #> [ ... and 14 more ] #> #> text2 : #> [1] "" "B" "" "" "E" "F" "" "" "" "" "" "" #> [ ... and 14 more ] #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "c" "d" "g" "h" "i" "j" "k" "l" "m" "n" "o" #> [ ... and 11 more ] #> #> text2 : #> [1] "A" "C" "D" "G" "H" "I" "J" "K" "L" "M" "N" "O" #> [ ... and 11 more ] #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "" "c" "d" "" "" "g" "h" "i" "j" "k" "l" #> [ ... and 14 more ] #> #> text2 : #> [1] "A" "" "C" "D" "" "" "G" "H" "I" "J" "K" "L" #> [ ... and 14 more ] #># how case_insensitive works tokens_select(toks, c("b", "e", "f"), selection = "remove", case_insensitive = TRUE)#> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "c" "d" "g" "h" "i" "j" "k" "l" "m" "n" "o" #> [ ... and 11 more ] #> #> text2 : #> [1] "A" "C" "D" "G" "H" "I" "J" "K" "L" "M" "N" "O" #> [ ... and 11 more ] #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "c" "d" "g" "h" "i" "j" "k" "l" "m" "n" "o" #> [ ... and 11 more ] #> #> text2 : #> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" #> [ ... and 14 more ] #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "b" "c" "e" "f" "g" #> #> text2 : #> [1] "A" "B" "C" "E" "F" "G" #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "d" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" #> [ ... and 8 more ] #> #> text2 : #> [1] "D" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" #> [ ... and 8 more ] #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "d" "e" "h" "i" "j" "k" "l" "m" "n" "o" "p" #> [ ... and 10 more ] #> #> text2 : #> [1] "A" "D" "E" "H" "I" "J" "K" "L" "M" "N" "O" "P" #> [ ... and 10 more ] #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "d" "e" "f" "g" "h" "i" #> #> text2 : #> [1] "D" "E" "F" "G" "H" "I" #># tokens_remove example: remove stopwords txt <- c(wash1 <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate.", wash2 <- "When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor.") tokens_remove(tokens(txt, remove_punct = TRUE), stopwords("english"))#> Tokens consisting of 2 documents. #> text1 : #> [1] "Fellow" "citizens" "called" "upon" "voice" #> [6] "country" "execute" "functions" "Chief" "Magistrate" #> #> text2 : #> [1] "occasion" "proper" "shall" "arrive" #> [5] "shall" "endeavor" "express" "high" #> [9] "sense" "entertain" "distinguished" "honor" #>#> Tokens consisting of 2 documents. #> text1 : #> [1] "am" "by" "of" "my" "to" "of" #> #> text2 : #> [1] "it" "to" "of" #>