Returns document subsets of a dfm that meet certain conditions,
including direct logical operations on docvars (document-level variables).
dfm_subset
functions identically to subset.data.frame()
,
using non-standard evaluation to evaluate conditions based on the
docvars in the dfm.
dfm_subset(
x,
subset,
min_ntoken = NULL,
max_ntoken = NULL,
drop_docid = TRUE,
verbose = quanteda_options("verbose"),
...
)
dfm object to be subsetted.
logical expression indicating the documents to keep: missing values are taken as false.
minimum and maximum lengths of the documents to extract.
if TRUE
, docid
for documents are removed as the result
of subsetting.
if TRUE
print the number of tokens and documents before and
after the function is applied. The number of tokens does not include paddings.
not used
dfm object, with a subset of documents (and docvars) selected according to arguments
To select or subset features, see dfm_select()
instead.
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
d3 = "b b c e", d4 = "e e f a b"),
docvars = data.frame(grp = c(1, 1, 2, 3)))
dfmat <- dfm(tokens(corp))
# selecting on a docvars condition
dfm_subset(dfmat, grp > 1)
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 1 docvar.
#> features
#> docs a b c d e f
#> d3 0 2 1 0 1 0
#> d4 1 1 0 0 2 1
# selecting on a supplied vector
dfm_subset(dfmat, c(TRUE, FALSE, TRUE, FALSE))
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 1 docvar.
#> features
#> docs a b c d e f
#> d1 1 1 1 1 0 0
#> d3 0 2 1 0 1 0