Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.
dfm( x, tolower = TRUE, remove_padding = FALSE, verbose = quanteda_options("verbose"), ... )
x | a tokens or dfm object |
---|---|
tolower | convert all features to lowercase |
remove_padding | logical; if |
verbose | display messages if |
... | not used directly |
a dfm object
In quanteda v3, many convenience functions formerly available in
dfm()
were deprecated. Formerly, dfm()
could be called directly on a
character
or corpus
object, but we now steer users to tokenise their
inputs first using tokens()
. Other convenience arguments to dfm()
were
also removed, such as select
, dictionary
, thesaurus
, and groups
. All
of these functions are available elsewhere, e.g. through dfm_group()
.
See news(Version >= "2.9", package = "quanteda")
for details.
## for a corpus toks <- data_corpus_inaugural %>% corpus_subset(Year > 1980) %>% tokens() dfm(toks) #> Document-feature matrix of: 11 documents, 3,426 features (78.47% sparse) and 4 docvars. #> features #> docs senator hatfield , mr . chief justice president vice bush #> 1981-Reagan 2 1 174 3 130 1 1 5 2 1 #> 1985-Reagan 4 0 177 0 124 1 1 3 1 1 #> 1989-Bush 2 0 166 6 142 1 2 6 1 0 #> 1993-Clinton 0 0 139 0 81 0 0 2 0 1 #> 1997-Clinton 0 0 131 0 108 0 1 1 0 0 #> 2001-Bush 0 0 110 0 96 0 3 3 1 0 #> [ reached max_ndoc ... 5 more documents, reached max_nfeat ... 3,416 more features ] # removal options toks <- tokens(c("a b c", "A B C D")) %>% tokens_remove("b", padding = TRUE) toks #> Tokens consisting of 2 documents. #> text1 : #> [1] "a" "" "c" #> #> text2 : #> [1] "A" "" "C" "D" #> dfm(toks) #> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars. #> features #> docs a c d #> text1 1 1 1 0 #> text2 1 1 1 1 dfm(toks, remove = "") # remove "pads" #> Warning: 'remove' is deprecated; use dfm_remove() instead #> Document-feature matrix of: 2 documents, 3 features (16.67% sparse) and 0 docvars. #> features #> docs a c d #> text1 1 1 0 #> text2 1 1 1 # preserving case dfm(toks, tolower = FALSE) #> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars. #> features #> docs a c A C D #> text1 1 1 1 0 0 0 #> text2 1 0 0 1 1 1