corpus.Rd
Creates a corpus object from available sources. The currently available sources are:
a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
a data.frame (or a tibble tbl_df
), whose default
document id is a variable identified by docid_field
; the text of the
document is a variable identified by textid_field
; and other variables
are imported as document-level meta-data. This matches the format of
data.frames constructed by the the readtext package.
a tm VCorpus or SimpleCorpus class object, with the fixed metadata fields imported as docvars and corpus-level metadata imported as metacorpus information.
a corpus object.
corpus(x, ...) # S3 method for corpus corpus(x, docnames = quanteda::docnames(x), docvars = quanteda::docvars(x), metacorpus = quanteda::metacorpus(x), compress = FALSE, ...) # S3 method for character corpus(x, docnames = NULL, docvars = NULL, metacorpus = NULL, compress = FALSE, ...) # S3 method for data.frame corpus(x, docid_field = "doc_id", text_field = "text", metacorpus = NULL, compress = FALSE, ...) # S3 method for kwic corpus(x, split_context = TRUE, extract_keyword = TRUE, ...) # S3 method for Corpus corpus(x, metacorpus = NULL, compress = FALSE, ...)
x | a valid corpus source object |
---|---|
... | not used directly |
docnames | Names to be assigned to the texts. Defaults to the names of
the character vector (if any); |
docvars | a data.frame of document-level variables associated with each text |
metacorpus | a named list containing additional (character) information
to be added to the corpus as corpus-level metadata. Special fields
recognized in the
|
compress | logical; if |
docid_field | optional column index of a document identifier; defaults
to "doc_id", but if this is not found, then will use the rownames of the
data.frame; if the rownames are not set, it will use the default sequence
based on |
text_field | the character name or numeric index of the source
|
split_context | logical; if |
extract_keyword | logical; if |
A corpus-class class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.
The texts and document variables of corpus objects can also be
accessed using index notation. Indexing a corpus object as a vector will
return its text, equivalent to texts(x)
. Note that this is not the
same as subsetting the entire corpus -- this should be done using the
subset
method for a corpus.
Indexing a corpus using two indexes (integers or column names) will return
the document variables, equivalent to docvars(x)
. It is also
possible to access, create, or replace docvars using list notation, e.g.
myCorpus[["newSerialDocvar"]] <-
paste0("tag", 1:ndoc(myCorpus))
.
For details, see corpus-class.
A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).
# create a corpus from texts corpus(data_char_ukimmig2010)#> Corpus consisting of 9 documents and 0 docvars.# create a corpus from texts and assign meta-data and document variables summary(corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010))), 5)#> Corpus consisting of 9 documents, showing 5 documents: #> #> Text Types Tokens Sentences party #> BNP 1125 3280 88 BNP #> Coalition 142 260 4 Coalition #> Conservative 251 499 15 Conservative #> Greens 322 679 21 Greens #> Labour 298 683 29 Labour #> #> Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/quanteda/docs/reference/* on x86_64 by kbenoit #> Created: Sat Feb 2 14:11:46 2019 #> Notes:#> Corpus consisting of 14 documents and 0 docvars.# import a tm VCorpus if (requireNamespace("tm", quietly = TRUE)) { data(crude, package = "tm") # load in a tm example VCorpus mytmCorpus <- corpus(crude) summary(mytmCorpus, showmeta=TRUE) data(acq, package = "tm") summary(corpus(acq), 5, showmeta=TRUE) tmCorp <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010)) quantCorp <- corpus(tmCorp) summary(quantCorp) }#> Corpus consisting of 9 documents: #> #> Text Types Tokens Sentences datetimestamp id language #> text1 1125 3280 88 2019-02-02 03:11:46 1 en #> text2 142 260 4 2019-02-02 03:11:46 2 en #> text3 251 499 15 2019-02-02 03:11:46 3 en #> text4 322 679 21 2019-02-02 03:11:46 4 en #> text5 298 683 29 2019-02-02 03:11:46 5 en #> text6 251 483 14 2019-02-02 03:11:46 6 en #> text7 77 114 5 2019-02-02 03:11:46 7 en #> text8 88 134 4 2019-02-02 03:11:46 8 en #> text9 346 723 27 2019-02-02 03:11:46 9 en #> #> Source: Converted from tm Corpus 'tmCorp' #> Created: Sat Feb 2 14:11:46 2019 #> Notes:# construct a corpus from a data.frame dat <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)), some_ints = 1L:6L, some_text = paste0("This is text number ", 1:6, "."), stringsAsFactors = FALSE, row.names = paste0("fromDf_", 1:6)) dat#> letter_factor some_ints some_text #> fromDf_1 a 1 This is text number 1. #> fromDf_2 a 2 This is text number 2. #> fromDf_3 b 3 This is text number 3. #> fromDf_4 b 4 This is text number 4. #> fromDf_5 c 5 This is text number 5. #> fromDf_6 c 6 This is text number 6.summary(corpus(dat, text_field = "some_text", metacorpus = list(source = "From a data.frame called mydf.")))#> Corpus consisting of 6 documents: #> #> Text Types Tokens Sentences letter_factor some_ints #> fromDf_1 6 6 1 a 1 #> fromDf_2 6 6 1 a 2 #> fromDf_3 6 6 1 b 3 #> fromDf_4 6 6 1 b 4 #> fromDf_5 6 6 1 c 5 #> fromDf_6 6 6 1 c 6 #> #> Source: From a data.frame called mydf. #> Created: Sat Feb 2 14:11:46 2019 #> Notes:# construct a corpus from a kwic object mykwic <- kwic(data_corpus_inaugural, "southern") summary(corpus(mykwic))#> Corpus consisting of 28 documents: #> #> Text Types Tokens Sentences docname from to keyword context #> text1.pre 5 5 1 1797-Adams 1803 1803 southern pre #> text2.pre 4 5 1 1825-Adams 2432 2432 southern pre #> text3.pre 4 5 1 1861-Lincoln 96 96 Southern pre #> text4.pre 5 5 1 1865-Lincoln 279 279 southern pre #> text5.pre 5 5 1 1877-Hayes 376 376 Southern pre #> text6.pre 5 5 1 1877-Hayes 948 948 Southern pre #> text7.pre 5 5 1 1877-Hayes 1240 1240 Southern pre #> text8.pre 5 5 1 1881-Garfield 991 991 Southern pre #> text9.pre 4 5 1 1909-Taft 4027 4027 Southern pre #> text10.pre 5 5 1 1909-Taft 4228 4228 Southern pre #> text11.pre 5 5 1 1909-Taft 4348 4348 Southern pre #> text12.pre 5 5 1 1909-Taft 4533 4533 Southern pre #> text13.pre 5 5 1 1909-Taft 4593 4593 Southern pre #> text14.pre 5 5 1 1953-Eisenhower 1226 1226 southern pre #> text1.post 5 5 1 1797-Adams 1803 1803 southern post #> text2.post 5 5 1 1825-Adams 2432 2432 southern post #> text3.post 5 5 1 1861-Lincoln 96 96 Southern post #> text4.post 5 5 2 1865-Lincoln 279 279 southern post #> text5.post 5 5 2 1877-Hayes 376 376 Southern post #> text6.post 5 5 1 1877-Hayes 948 948 Southern post #> text7.post 5 5 1 1877-Hayes 1240 1240 Southern post #> text8.post 5 5 2 1881-Garfield 991 991 Southern post #> text9.post 5 5 2 1909-Taft 4027 4027 Southern post #> text10.post 5 5 1 1909-Taft 4228 4228 Southern post #> text11.post 5 5 1 1909-Taft 4348 4348 Southern post #> text12.post 5 5 1 1909-Taft 4533 4533 Southern post #> text13.post 5 5 1 1909-Taft 4593 4593 Southern post #> text14.post 5 5 1 1953-Eisenhower 1226 1226 southern post #> #> Source: Corpus created from kwic(x, keywords = "") #> Created: Sat Feb 2 14:11:46 2019 #> Notes:#> Corpus consisting of 10 documents: #> #> Text Types Tokens Sentences docname from to keyword context #> text1.pre 5 5 1 text1 162 162 economy pre #> text2.pre 5 5 1 text1 202 202 economy pre #> text3.pre 5 5 1 text1 268 268 economy pre #> text4.pre 4 5 1 text1 486 486 economy pre #> text5.pre 5 5 1 text1 504 504 economy pre #> text1.post 5 5 1 text1 162 162 economy post #> text2.post 5 5 2 text1 202 202 economy post #> text3.post 5 5 1 text1 268 268 economy post #> text4.post 5 5 2 text1 486 486 economy post #> text5.post 5 5 1 text1 504 504 economy post #> #> Source: Corpus created from kwic(x, keywords = "") #> Created: Sat Feb 2 14:11:46 2019 #> Notes:#> Corpus consisting of 5 documents: #> #> Text Types Tokens Sentences keyword #> text1.L162 10 11 1 economy #> text1.L202 11 11 2 economy #> text1.L268 10 11 1 economy #> text1.L486 10 11 2 economy #> text1.L504 10 11 1 economy #> #> Source: Corpus created from kwic(x, keywords = "") #> Created: Sat Feb 2 14:11:46 2019 #> Notes:#> text1.L162 #> "reefed out of the Irish economy in pursuit of a policy" #> text1.L202 #> "it is decimating the domestic economy? As we are tired" #> text1.L268 #> "key indicators in the domestic economy show the abject failure of" #> text1.L486 #> "the banks massively dislocates the economy. Otherwise those funds would" #> text1.L504 #> "and services in the domestic economy, stimulating demand and sustaining"