Returns document subsets of a tokens that meet certain conditions, including
direct logical operations on docvars (document-level variables).
tokens_subset()
functions identically to subset.data.frame()
, using
non-standard evaluation to evaluate conditions based on the docvars in the
tokens.
tokens_subset(
x,
subset,
min_ntoken = NULL,
max_ntoken = NULL,
drop_docid = TRUE,
verbose = quanteda_options("verbose"),
...
)
tokens object to be subsetted.
logical expression indicating the documents to keep: missing values are taken as false.
minimum and maximum lengths of the documents to extract.
if TRUE
, docid
for documents are removed as the result
of subsetting.
if TRUE
print the number of tokens and documents before and
after the function is applied. The number of tokens does not include paddings.
not used
tokens object, with a subset of documents (and docvars) selected according to arguments
corp <- corpus(c(d1 = "a b c d", d2 = "a a b e",
d3 = "b b c e", d4 = "e e f a b"),
docvars = data.frame(grp = c(1, 1, 2, 3)))
toks <- tokens(corp)
# selecting on a docvars condition
tokens_subset(toks, grp > 1)
#> Tokens consisting of 2 documents and 1 docvar.
#> d3 :
#> [1] "b" "b" "c" "e"
#>
#> d4 :
#> [1] "e" "e" "f" "a" "b"
#>
# selecting on a supplied vector
tokens_subset(toks, c(TRUE, FALSE, TRUE, FALSE))
#> Tokens consisting of 2 documents and 1 docvar.
#> d1 :
#> [1] "a" "b" "c" "d"
#>
#> d3 :
#> [1] "b" "b" "c" "e"
#>