Take a random sample of documents of the specified size from a corpus, with or without replacement. Works just as sample works for the documents and their associated document-level variables.

corpus_sample(x, size = ndoc(x), replace = FALSE, prob = NULL,
  by = NULL)

Arguments

x

a corpus object whose documents will be sampled

size

a positive number, the number of documents to select; when used with groups, the number to select from each group or a vector equal in length to the number of groups defining the samples to be chosen in each group category. By defining a size larger than the number of documents, it is possible to oversample groups.

replace

Should sampling be with replacement?

prob

A vector of probability weights for obtaining the elements of the vector being sampled. May not be applied when by is used.

by

a grouping variable for sampling. Useful for resampling sub-document units such as sentences, for instance by specifying by = "document"

Value

A corpus object with number of documents equal to size, drawn from the corpus x. The returned corpus object will contain all of the meta-data of the original corpus, and the same document variables for the documents selected.

Examples

set.seed(2000) # sampling from a corpus summary(corpus_sample(data_corpus_inaugural, 5))
#> Corpus consisting of 5 documents: #> #> Text Types Tokens Sentences Year President FirstName #> 1869-Grant 485 1235 40 1869 Grant Ulysses S. #> 1945-Roosevelt 275 647 26 1945 Roosevelt Franklin D. #> 1985-Reagan 925 2921 123 1985 Reagan Ronald #> 1905-Roosevelt 404 1079 33 1905 Roosevelt Theodore #> 1997-Clinton 773 2449 111 1997 Clinton Bill #> #> Source: Gerhard Peters and John T. Woolley. The American Presidency Project. #> Created: Tue Jun 13 14:51:47 2017 #> Notes: http://www.presidency.ucsb.edu/inaugurals.php
summary(corpus_sample(data_corpus_inaugural, 10, replace = TRUE))
#> Corpus consisting of 10 documents: #> #> Text Types Tokens Sentences Year President FirstName #> 1845-Polk 1334 5193 153 1845 Polk James Knox #> 1841-Harrison 1896 9144 210 1841 Harrison William Henry #> 1845-Polk.1 1334 5193 153 1845 Polk James Knox #> 2009-Obama 938 2711 110 2009 Obama Barack #> 1805-Jefferson 804 2381 45 1805 Jefferson Thomas #> 1929-Hoover 1090 3865 158 1929 Hoover Herbert #> 1997-Clinton 773 2449 111 1997 Clinton Bill #> 2009-Obama.1 938 2711 110 2009 Obama Barack #> 1901-McKinley 854 2437 100 1901 McKinley William #> 1937-Roosevelt 725 1997 96 1937 Roosevelt Franklin D. #> #> Source: Gerhard Peters and John T. Woolley. The American Presidency Project. #> Created: Tue Jun 13 14:51:47 2017 #> Notes: http://www.presidency.ucsb.edu/inaugurals.php
# sampling sentences within document corp <- corpus(c(one = "Sentence one. Sentence two. Third sentence.", two = "First sentence, doc2. Second sentence, doc2.")) corpsent <- corpus_reshape(corp, to = "sentences") texts(corpsent)
#> one.1 one.2 one.3 #> "Sentence one." "Sentence two." "Third sentence." #> two.1 two.2 #> "First sentence, doc2." "Second sentence, doc2."
texts(corpus_sample(corpsent, replace = TRUE, by = "document"))
#> one.2 one.1 one.1.1 #> "Sentence two." "Sentence one." "Sentence one." #> two.1 two.1.1 #> "First sentence, doc2." "First sentence, doc2."