Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.

tokens_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  verbose = quanteda_options("verbose")
)

char_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  check_whitespace = TRUE
)

dfm_wordstem(
  x,
  language = quanteda_options("language_stemmer"),
  verbose = quanteda_options("verbose")
)

Arguments

x

a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based.

language

the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes)

verbose

if TRUE print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

check_whitespace

logical; if TRUE, stop with a warning when trying to stem inputs containing whitespace

Value

tokens_wordstem() returns a tokens object whose word types have been stemmed.

char_wordstem() returns a character object whose word types have been stemmed.

dfm_wordstem() returns a dfm object whose word types (features) have been stemmed, and recombined to consolidate features made equivalent because of stemming.

References

https://snowballstem.org/

https://www.iso.org/iso-639-language-code for the ISO-639 language codes

See also

Examples

# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
         two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)
#> Tokens consisting of 2 documents.
#> one :
#> [1] "eat"   "eater" "eater" "eat"   "ate"  
#> 
#> two :
#> [1] "tax"    "tax"    "tax"    "my"     "tax"    "return"
#> 

# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))
#> [1] "win"    "win"    "win"    "won"    "winner"

# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#>      features
#> docs  eating eater eaters eats ate taxing taxes taxed my tax
#>   one      1     1      1    1   1      0     0     0  0   0
#>   two      0     0      0    0   0      1     1     1  1   1
#> [ reached max_nfeat ... 1 more feature ]
dfm_wordstem(origdfm)
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#>      features
#> docs  eat eater ate tax my return
#>   one   2     2   1   0  0      0
#>   two   0     0   0   4  1      1