Extensions of base R functions for corpus objects.
# S3 method for corpus +(c1, c2) # S3 method for corpus c(..., recursive = FALSE) # S3 method for corpus [(x, i, drop_docid = TRUE) # S3 method for summary.corpus print(x, ...)
corpus one to be added
corpus two to be added
logical used by
a corpus object
document names or indices for documents to extract.
Indexing a corpus works in three ways, as of v2.x.x:
[ returns a subsetted corpus
[[ returns the textual contents of a subsetted corpus (similar to
$ returns a vector containing the single named docvars
+ operator for a corpus object will combine two corpus
objects, resolving any non-matching
docvars() by making them
NA values for the corpus lacking that field. Corpus-level meta
data is concatenated, except for
notes, which are
stamped with information pertaining to the creation of the new joined
c() operator is also defined for corpus class objects, and provides
an easy way to combine multiple corpus objects.
There are some issues that need to be addressed in future revisions of
quanteda concerning the use of factors to store document variables and
meta-data. Currently most or all of these are not recorded as factors,
because we use
stringsAsFactors=FALSE in the
data.frame() calls that are used to create and store the
document-level information, because the texts should always be stored as
character vectors and never as factors.
# concatenate corpus objects corp1 <- corpus(data_char_ukimmig2010[1:2]) corp2 <- corpus(data_char_ukimmig2010[3:4]) corp3 <- corpus(data_char_ukimmig2010[5:6]) summary(c(corp1, corp2, corp3)) #> Corpus consisting of 6 documents, showing 6 documents: #> #> Text Types Tokens Sentences #> BNP 1125 3280 88 #> Coalition 142 260 4 #> Conservative 251 499 15 #> Greens 322 677 21 #> Labour 298 680 29 #> LibDem 251 483 14 #> # two ways to index corpus elements data_corpus_inaugural["1793-Washington"] #> Corpus consisting of 1 document and 4 docvars. #> 1793-Washington : #> "Fellow citizens, I am again called upon by the voice of my c..." #> data_corpus_inaugural #> Corpus consisting of 1 document and 4 docvars. #> 1793-Washington : #> "Fellow citizens, I am again called upon by the voice of my c..." #> # return the text itself data_corpus_inaugural[["1793-Washington"]] #>  "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "