Apply a stemmer to words. This is a wrapper to wordStem designed to allow this function to be called without loading the entire SnowballC package. wordStem uses Martin Porter's stemming algorithm and the C libstemmer library generated by Snowball.
tokens_wordstem(x, language = quanteda_options("language_stemmer"))
char_wordstem(
x,
language = quanteda_options("language_stemmer"),
check_whitespace = TRUE
)
dfm_wordstem(x, language = quanteda_options("language_stemmer"))
a character, tokens, or dfm object whose word stems are to be removed. If tokenized texts, the tokenization must be word-based.
the name of a recognized language, as returned by getStemLanguages, or a two- or three-letter ISO-639 code corresponding to one of these languages (see references for the list of codes)
logical; if TRUE
, stop with a warning when trying
to stem inputs containing whitespace
tokens_wordstem
returns a tokens object whose word
types have been stemmed.
char_wordstem
returns a character object whose word
types have been stemmed.
dfm_wordstem
returns a dfm object whose word
types (features) have been stemmed, and recombined to consolidate features made
equivalent because of stemming.
http://www.iso.org/iso/home/standards/language_codes.htm for the ISO-639 language codes
# example applied to tokens
txt <- c(one = "eating eater eaters eats ate",
two = "taxing taxes taxed my tax return")
th <- tokens(txt)
tokens_wordstem(th)
#> Tokens consisting of 2 documents.
#> one :
#> [1] "eat" "eater" "eater" "eat" "ate"
#>
#> two :
#> [1] "tax" "tax" "tax" "my" "tax" "return"
#>
# simple example
char_wordstem(c("win", "winning", "wins", "won", "winner"))
#> [1] "win" "win" "win" "won" "winner"
# example applied to a dfm
(origdfm <- dfm(tokens(txt)))
#> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars.
#> features
#> docs eating eater eaters eats ate taxing taxes taxed my tax
#> one 1 1 1 1 1 0 0 0 0 0
#> two 0 0 0 0 0 1 1 1 1 1
#> [ reached max_nfeat ... 1 more feature ]
dfm_wordstem(origdfm)
#> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars.
#> features
#> docs eat eater ate tax my return
#> one 2 2 1 0 0 0
#> two 0 0 0 4 1 1