Returns subsets of a corpus that meet certain conditions, including direct logical operations on docvars (document-level variables). corpus_subset functions identically to subset.data.frame(), using non-standard evaluation to evaluate conditions based on the docvars in the corpus.

corpus_subset(x, subset, drop_docid = TRUE, ...)

Arguments

x

corpus object to be subsetted

subset

logical expression indicating the documents to keep: missing values are taken as false

drop_docid

if TRUE, docid for documents are removed as the result of subsetting.

...

not used

Value

corpus object, with a subset of documents (and docvars) selected according to arguments

See also

Examples

summary(corpus_subset(data_corpus_inaugural, Year > 1980))
#> Corpus consisting of 11 documents, showing 11 documents: #> #> Text Types Tokens Sentences Year President FirstName Party #> 1981-Reagan 902 2780 129 1981 Reagan Ronald Republican #> 1985-Reagan 925 2909 123 1985 Reagan Ronald Republican #> 1989-Bush 795 2673 141 1989 Bush George Republican #> 1993-Clinton 642 1833 81 1993 Clinton Bill Democratic #> 1997-Clinton 773 2436 111 1997 Clinton Bill Democratic #> 2001-Bush 621 1806 97 2001 Bush George W. Republican #> 2005-Bush 772 2312 99 2005 Bush George W. Republican #> 2009-Obama 938 2689 110 2009 Obama Barack Democratic #> 2013-Obama 814 2317 88 2013 Obama Barack Democratic #> 2017-Trump 582 1660 88 2017 Trump Donald J. Republican #> 2021-Biden 811 2766 216 2021 Biden Joseph R. Democratic #>
summary(corpus_subset(data_corpus_inaugural, Year > 1930 & President == "Roosevelt"))
#> Corpus consisting of 4 documents, showing 4 documents: #> #> Text Types Tokens Sentences Year President FirstName Party #> 1933-Roosevelt 743 2057 85 1933 Roosevelt Franklin D. Democratic #> 1937-Roosevelt 725 1989 96 1937 Roosevelt Franklin D. Democratic #> 1941-Roosevelt 526 1519 68 1941 Roosevelt Franklin D. Democratic #> 1945-Roosevelt 275 633 27 1945 Roosevelt Franklin D. Democratic #>