This function retrieves stopwords from the type specified in the kind
argument and returns the stopword list as a character vector. The default is
English.
stopwords(kind = quanteda_options("language_stopwords"))
kind | The pre-set kind of stopwords (as a character string). Allowed
values are |
---|
The English stopwords are taken from the SMART information retrieval system (obtained from Lewis, David D., et al. "Rcv1: A new benchmark collection for text categorization research." Journal of machine learning research (2004, 5 April): 361-397.
Additional stopword lists are taken from the Snowball stemmer project in different languages (see http://snowballstem.org/projects.html).
The Greek stopwords were supplied by Carsten Schwemmer (see GitHub issue #282).
The Chinese stopwords are taken from the Baidu stopword list.
a character vector of stopwords
The stopword list is an internal data object named
data_char_stopwords
, which consists of English stopwords from
the SMART information retrieval system (obtained from Lewis et. al. (2004)
and a set of stopword lists from the Snowball stemmer project in different
languages (see http://snowballstem.org/projects.html). See
data_char_stopwords for details.
Stop words are an arbitrary choice imposed by the user, and accessing a pre-defined list of words to ignore does not mean that it will perfectly fit your needs. You are strongly encouraged to inspect the list and to make sure it fits your particular requirements.
head(stopwords("english"))#> [1] "i" "me" "my" "myself" "we" "our"head(stopwords("italian"))#> [1] "ad" "al" "allo" "ai" "agli" "all"head(stopwords("arabic"))#> [1] "فى" "في" "كل" "لم" "لن" "له"head(stopwords("chinese"))#> [1] "按" "按照" "俺" "俺" "们" "阿"head(stopwords("SMART"))#> [1] "a" "a's" "able" "about" "above" "according"# adding to the built-in stopword list toks <- tokens("The judge will sentence Mr. Adams to nine years in prison", remove_punct = TRUE) tokens_remove(toks, c(stopwords("english"), "will", "mr", "nine"))#> tokens from 1 document. #> text1 : #> [1] "judge" "sentence" "Adams" "years" "prison" #>