Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.
tokens_wordstem(
x,
language = quanteda_options("language_stemmer"),
verbose = quanteda_options("verbose")
)
char_wordstem(
x,
language = quanteda_options("language_stemmer"),
check_whitespace = TRUE
)
dfm_wordstem(
x,
language = quanteda_options("language_stemmer"),
verbose = quanteda_options("verbose")
)
a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based.
the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes)
if TRUE
print the number of tokens and documents before and
after the function is applied. The number of tokens does not include paddings.
logical; if TRUE
, stop with a warning when trying
to stem inputs containing whitespace
tokens_wordstem()
returns a tokens object whose word
types have been stemmed.
char_wordstem()
returns a character object whose word
types have been stemmed.
dfm_wordstem()
returns a dfm object whose word
types (features) have been stemmed, and recombined to consolidate features made
equivalent because of stemming.
https://www.iso.org/iso-639-language-code for the ISO-639 language codes
# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)
#> Tokens consisting of 2 documents.
#> one :
#> [1] "eat" "eater" "eater" "eat" "ate"
#>
#> two :
#> [1] "tax" "tax" "tax" "my" "tax" "return"
#>
# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))
#> [1] "win" "win" "win" "won" "winner"
# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs eating eater eaters eats ate taxing taxes taxed my tax
#> one 1 1 1 1 1 0 0 0 0 0
#> two 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 1 more feature ]
dfm_wordstem(origdfm)
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs eat eater ate tax my return
#> one 2 2 1 0 0 0
#> two 0 0 0 4 1 1