Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.

tokens_wordstem(x, language = quanteda_options("language_stemmer"))

char_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  check_whitespace = TRUE
)

dfm_wordstem(x, language = quanteda_options("language_stemmer"))

Arguments

x

a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based.

language

the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes)

check_whitespace

logical; if TRUE, stop with a warning when trying to stem inputs containing whitespace

Value

tokens_wordstem returns a tokens object whose word types have been stemmed.

char_wordstem returns a character object whose word types have been stemmed.

dfm_wordstem returns a dfm object whose word types (features) have been stemmed, and recombined to consolidate features made equivalent because of stemming.

See also

Examples

# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
         two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)
#> Tokens consisting of 2 documents.
#> one :
#> [1] "eat"   "eater" "eater" "eat"   "ate"  
#> 
#> two :
#> [1] "tax"    "tax"    "tax"    "my"     "tax"    "return"
#> 

# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))
#> [1] "win"    "win"    "win"    "won"    "winner"

# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#>      features
#> docs  eating eater eaters eats ate taxing taxes taxed my tax
#>   one      1     1      1    1   1      0     0     0  0   0
#>   two      0     0      0    0   0      1     1     1  1   1
#> [ reached max_nfeat ... 1 more feature ]
dfm_wordstem(origdfm)
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>      features
#> docs  eat eater ate tax my return
#>   one   2     2   1   0  0      0
#>   two   0     0   0   4  1      1