Extensions of base R functions for corpus objects.
# S3 method for corpus
+(c1, c2)
# S3 method for corpus
c(..., recursive = FALSE)
# S3 method for corpus
[(x, i, drop_docid = TRUE)
# S3 method for summary.corpus
print(x, ...)
corpus one to be added
corpus two to be added
logical used by c()
method, always set to FALSE
a corpus object
document names or indices for documents to extract.
if TRUE
, drop docid
for documents removed as the result of extraction.
The +
and c()
operators return a corpus()
object.
Indexing a corpus works in three ways, as of v2.x.x:
[
returns a subsetted corpus
[[
returns the textual contents of a subsetted corpus (similar to as.character()
)
$
returns a vector containing the single named docvars
The +
operator for a corpus object will combine two corpus
objects, resolving any non-matching docvars()
by making them
into NA
values for the corpus lacking that field. Corpus-level meta
data is concatenated, except for source
and notes
, which are
stamped with information pertaining to the creation of the new joined
corpus.
The c()
operator is also defined for corpus class objects, and provides
an easy way to combine multiple corpus objects.
There are some issues that need to be addressed in future revisions of
quanteda concerning the use of factors to store document variables and
meta-data. Currently most or all of these are not recorded as factors,
because we use stringsAsFactors=FALSE
in the
data.frame()
calls that are used to create and store the
document-level information, because the texts should always be stored as
character vectors and never as factors.
# concatenate corpus objects
corp1 <- corpus(data_char_ukimmig2010[1:2])
corp2 <- corpus(data_char_ukimmig2010[3:4])
corp3 <- corpus(data_char_ukimmig2010[5:6])
summary(c(corp1, corp2, corp3))
#> Corpus consisting of 6 documents, showing 6 documents:
#>
#> Text Types Tokens Sentences
#> BNP 1125 3280 88
#> Coalition 142 260 4
#> Conservative 251 499 15
#> Greens 322 679 21
#> Labour 298 683 29
#> LibDem 251 483 14
#>
# two ways to index corpus elements
data_corpus_inaugural["1793-Washington"]
#> Corpus consisting of 1 document and 4 docvars.
#> 1793-Washington :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
data_corpus_inaugural[2]
#> Corpus consisting of 1 document and 4 docvars.
#> 1793-Washington :
#> "Fellow citizens, I am again called upon by the voice of my c..."
#>
# return the text itself
data_corpus_inaugural[["1793-Washington"]]
#> [1] "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "