Construct a sparse document-feature matrix from a tokens or dfm object.
dfm(
x,
tolower = TRUE,
remove_padding = FALSE,
verbose = quanteda_options("verbose"),
...
)
a tokens or dfm object.
convert all features to lowercase.
logical; if TRUE
, remove the "pads" left as empty tokens after
calling tokens()
or tokens_remove()
with padding = TRUE
.
display messages if TRUE
.
not used.
a dfm object
In quanteda v4, many convenience functions formerly available in
dfm()
were removed.
## for a corpus
toks <- data_corpus_inaugural |>
corpus_subset(Year > 1980) |>
tokens()
dfm(toks)
#> Document-feature matrix of: 11 documents, 3,426 features (78.46% sparse) and 4 docvars.
#> features
#> docs senator hatfield , mr . chief justice president vice bush
#> 1981-Reagan 2 1 174 3 130 1 1 5 2 1
#> 1985-Reagan 4 0 177 0 124 1 1 3 1 1
#> 1989-Bush 2 0 166 6 142 1 2 6 1 0
#> 1993-Clinton 0 0 139 0 81 0 0 2 0 1
#> 1997-Clinton 0 0 131 0 108 0 1 1 0 0
#> 2001-Bush 0 0 110 0 96 0 3 3 1 0
#> [ reached max_ndoc ... 5 more documents, reached max_nfeat ... 3,416 more features ]
# removal options
toks <- tokens(c("a b c", "A B C D")) |>
tokens_remove("b", padding = TRUE)
toks
#> Tokens consisting of 2 documents.
#> text1 :
#> [1] "a" "" "c"
#>
#> text2 :
#> [1] "A" "" "C" "D"
#>
dfm(toks)
#> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars.
#> features
#> docs a c d
#> text1 1 1 1 0
#> text2 1 1 1 1
dfm(toks) |>
dfm_remove(pattern = "") # remove "pads"
#> Document-feature matrix of: 2 documents, 3 features (16.67% sparse) and 0 docvars.
#> features
#> docs a c d
#> text1 1 1 0
#> text2 1 1 1
# preserving case
dfm(toks, tolower = FALSE)
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs a c A C D
#> text1 1 1 1 0 0 0
#> text2 1 0 0 1 1 1