Returns subsets of a corpus that meet certain conditions, including direct
logical operations on docvars (document-level variables). corpus_subset
functions identically to subset.data.frame()
, using non-standard
evaluation to evaluate conditions based on the docvars in the corpus.
corpus_subset(x, subset, drop_docid = TRUE, ...)
corpus object to be subsetted.
logical expression indicating the documents to keep: missing values are taken as false.
if TRUE
, docid
for documents are removed as the result
of subsetting.
not used
corpus object, with a subset of documents (and docvars) selected according to arguments
summary(corpus_subset(data_corpus_inaugural, Year > 1980))
#> Corpus consisting of 11 documents, showing 11 documents:
#>
#> Text Types Tokens Sentences Year President FirstName Party
#> 1981-Reagan 902 2781 129 1981 Reagan Ronald Republican
#> 1985-Reagan 925 2909 123 1985 Reagan Ronald Republican
#> 1989-Bush 795 2674 141 1989 Bush George Republican
#> 1993-Clinton 642 1833 81 1993 Clinton Bill Democratic
#> 1997-Clinton 773 2436 111 1997 Clinton Bill Democratic
#> 2001-Bush 621 1806 97 2001 Bush George W. Republican
#> 2005-Bush 772 2312 99 2005 Bush George W. Republican
#> 2009-Obama 938 2689 110 2009 Obama Barack Democratic
#> 2013-Obama 814 2317 88 2013 Obama Barack Democratic
#> 2017-Trump 582 1660 88 2017 Trump Donald J. Republican
#> 2021-Biden 812 2766 216 2021 Biden Joseph R. Democratic
#>
summary(corpus_subset(data_corpus_inaugural, Year > 1930 & President == "Roosevelt"))
#> Corpus consisting of 4 documents, showing 4 documents:
#>
#> Text Types Tokens Sentences Year President FirstName Party
#> 1933-Roosevelt 743 2057 85 1933 Roosevelt Franklin D. Democratic
#> 1937-Roosevelt 725 1989 96 1937 Roosevelt Franklin D. Democratic
#> 1941-Roosevelt 526 1519 68 1941 Roosevelt Franklin D. Democratic
#> 1945-Roosevelt 275 633 27 1945 Roosevelt Franklin D. Democratic
#>