Creates a corpus object from available sources. The currently available sources are:

  • a character vector, consisting of one document per element; if the elements are named, these names will be used as document names.
  • a data.frame, whose default document id is a variable identified by docid_field; the text of the document is a variable identified by textid_field; and other variables are imported as document-level meta-data. This matches the format of data.frames constructed by the the readtext package.
  • a kwic object constructed by kwic.
  • a tm VCorpus or SimpleCorpus class object, with the fixed metadata fields imported as docvars and corpus-level metadata imported as metacorpus information.

corpus(x, ...)

# S3 method for character
corpus(x, docnames = NULL, docvars = NULL,
  metacorpus = NULL, compress = FALSE, ...)

# S3 method for data.frame
corpus(x, docid_field = "doc_id", text_field = "text",
  metacorpus = NULL, compress = FALSE, ...)

# S3 method for kwic
corpus(x, ...)

# S3 method for Corpus
corpus(x, metacorpus = NULL, compress = FALSE, ...)

Arguments

x
a valid corpus source object
...
not used directly
docnames
Names to be assigned to the texts. Defaults to the names of the character vector (if any); doc_id for a data.frame; the document names in a tm corpus; or a vector of user-supplied labels equal in length to the number of documents. If none of these are round, then "text1", "text2", etc. are assigned automatically.
docvars
A data frame of attributes that is associated with each text.
metacorpus
a named list containing additional (character) information to be added to the corpus as corpus-level metadata. Special fields recognized in the summary.corpus are:
  • source a description of the source of the texts, used for referencing;
  • citation information on how to cite the corpus; and
  • notes any additional information about who created the text, warnings, to do lists, etc.
compress
logical; if TRUE, compress the texts in memory using gzip compression. This significantly reduces the size of the corpus in memory, but will slow down operations that require the texts to be extracted.
docid_field
name of the data.frame variable containing the document identifier; defaults to doc_id but if this is not found, will use the row.names of the data.frame if these are assigned
text_field
the character name or numeric index of the source data.frame indicating the variable to be read in as text, which must be a character vector. All other variables in the data.frame will be imported as docvars. This argument is only used for data.frame objects (including those created by readtext).

Value

A corpus-class class object containing the original texts, document-level variables, document-level metadata, corpus-level metadata, and default settings for subsequent processing of the corpus.

Details

The texts and document variables of corpus objects can also be accessed using index notation. Indexing a corpus object as a vector will return its text, equivalent to texts(x). Note that this is not the same as subsetting the entire corpus -- this should be done using the subset method for a corpus.

Indexing a corpus using two indexes (integers or column names) will return the document variables, equivalent to docvars(x). Because a corpus is also a list, it is also possible to access, create, or replace docvars using list notation, e.g.

myCorpus[["newSerialDocvar"]] <- paste0("tag", 1:ndoc(myCorpus)).

For details, see corpus-class.

A warning on accessing corpus elements

A corpus currently consists of an S3 specially classed list of elements, but you should not access these elements directly. Use the extractor and replacement functions instead, or else your code is not only going to be uglier, but also likely to break should the internal structure of a corpus object change (as it inevitably will as we continue to develop the package, including moving corpus objects to the S4 class system).

See also

corpus-class, docvars, metadoc, metacorpus, settings, texts, ndoc, docnames

Examples

# create a corpus from texts corpus(data_char_ukimmig2010)
#> Corpus consisting of 9 documents and 0 docvars.
# create a corpus from texts and assign meta-data and document variables summary(corpus(data_char_ukimmig2010, docvars = data.frame(party = names(data_char_ukimmig2010))), 5)
#> Corpus consisting of 9 documents, showing 5 documents. #> #> Text Types Tokens Sentences party #> BNP 1126 3330 88 BNP #> Coalition 144 268 4 Coalition #> Conservative 252 503 15 Conservative #> Greens 325 687 21 Greens #> Labour 296 703 29 Labour #> #> Source: /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/reference/* on x86_64 by kbenoit #> Created: Tue May 16 20:59:43 2017 #> Notes: #>
corpus(texts(data_corpus_irishbudget2010))
#> Corpus consisting of 14 documents and 0 docvars.
# import a tm VCorpus if (requireNamespace("tm")) { data(crude, package = "tm") # load in a tm example VCorpus mytmCorpus <- corpus(crude) summary(mytmCorpus, showmeta=TRUE) data(acq, package = "tm") summary(corpus(acq), 5, showmeta=TRUE) tmCorp <- tm::VCorpus(tm::VectorSource(data_char_ukimmig2010)) quantCorp <- corpus(tmCorp) summary(quantCorp) }
#> Corpus consisting of 20 documents. #> #> Text Types Tokens Sentences author #> reut-00001.xml 62 103 5 <NA> #> reut-00002.xml 240 497 19 BY TED D'AFFLISIO, Reuters #> reut-00004.xml 47 62 4 <NA> #> reut-00005.xml 55 74 5 <NA> #> reut-00006.xml 67 97 4 <NA> #> reut-00007.xml 260 533 22 <NA> #> reut-00008.xml 255 500 22 By Jeremy Clift, Reuters #> reut-00009.xml 117 199 8 <NA> #> reut-00010.xml 197 380 16 <NA> #> reut-00011.xml 204 396 16 <NA> #> reut-00012.xml 207 420 14 <NA> #> reut-00013.xml 72 106 4 <NA> #> reut-00014.xml 75 115 4 <NA> #> reut-00015.xml 81 116 5 <NA> #> reut-00016.xml 78 120 5 <NA> #> reut-00018.xml 100 162 6 <NA> #> reut-00019.xml 128 217 8 <NA> #> reut-00021.xml 54 94 5 <NA> #> reut-00022.xml 155 323 11 By BERNICE NAPACH, Reuters #> reut-00023.xml 40 59 3 <NA> #> datetimestamp description #> 1987-02-26 17:00:56 #> 1987-02-26 17:34:11 #> 1987-02-26 18:18:00 #> 1987-02-26 18:21:01 #> 1987-02-26 19:00:57 #> 1987-03-01 03:25:46 #> 1987-03-01 03:39:14 #> 1987-03-01 05:27:27 #> 1987-03-01 08:22:30 #> 1987-03-01 18:31:44 #> 1987-03-02 01:05:49 #> 1987-03-02 07:39:23 #> 1987-03-02 07:43:22 #> 1987-03-02 07:43:41 #> 1987-03-02 08:25:42 #> 1987-03-02 11:20:05 #> 1987-03-02 11:28:26 #> 1987-03-02 12:13:46 #> 1987-03-02 14:38:34 #> 1987-03-02 14:49:06 #> heading id language #> DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES 127 en #> OPEC MAY HAVE TO MEET TO FIRM PRICES - ANALYSTS 144 en #> TEXACO CANADA <TXC> LOWERS CRUDE POSTINGS 191 en #> MARATHON PETROLEUM REDUCES CRUDE POSTINGS 194 en #> HOUSTON OIL <HO> RESERVES STUDY COMPLETED 211 en #> KUWAIT SAYS NO PLANS FOR EMERGENCY OPEC TALKS 236 en #> INDONESIA SEEN AT CROSSROADS OVER ECONOMIC CHANGE 237 en #> SAUDI RIYAL DEPOSIT RATES REMAIN FIRM 242 en #> QATAR UNVEILS BUDGET FOR FISCAL 1987/88 246 en #> SAUDI ARABIA REITERATES COMMITMENT TO OPEC PACT 248 en #> SAUDI FEBRUARY CRUDE OUTPUT PUT AT 3.5 MLN BPD 273 en #> GULF ARAB DEPUTY OIL MINISTERS TO MEET IN BAHRAIN 349 en #> SAUDI ARABIA REITERATES COMMITMENT TO OPEC ACCORD 352 en #> KUWAIT MINISTER SAYS NO EMERGENCY OPEC TALKS SET 353 en #> PHILADELPHIA PORT CLOSED BY TANKER CRASH 368 en #> STUDY GROUP URGES INCREASED U.S. OIL RESERVES 489 en #> STUDY GROUP URGES INCREASED U.S. OIL RESERVES 502 en #> UNOCAL <UCL> UNIT CUTS CRUDE OIL POSTED PRICES 543 en #> NYMEX WILL EXPAND OFF-HOUR TRADING APRIL ONE 704 en #> ARGENTINE OIL PRODUCTION DOWN IN JANUARY 1987 708 en #> origin topics lewissplit cgisplit oldid #> Reuters-21578 XML YES TRAIN TRAINING-SET 5670 #> Reuters-21578 XML YES TRAIN TRAINING-SET 5687 #> Reuters-21578 XML YES TRAIN TRAINING-SET 5734 #> Reuters-21578 XML YES TRAIN TRAINING-SET 5737 #> Reuters-21578 XML YES TRAIN TRAINING-SET 5754 #> Reuters-21578 XML YES TRAIN TRAINING-SET 8321 #> Reuters-21578 XML YES TRAIN TRAINING-SET 8322 #> Reuters-21578 XML YES TRAIN TRAINING-SET 8327 #> Reuters-21578 XML YES TRAIN TRAINING-SET 8331 #> Reuters-21578 XML YES TRAIN TRAINING-SET 8333 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12456 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12532 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12535 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12536 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12550 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12672 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12685 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12726 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12887 #> Reuters-21578 XML YES TRAIN TRAINING-SET 12891 #> places people orgs #> usa <NA> <NA> #> usa <NA> opec #> canada <NA> <NA> #> usa <NA> <NA> #> usa <NA> <NA> #> c("kuwait", "ecuador") <NA> opec #> c("indonesia", "usa") <NA> worldbank #> c("bahrain", "saudi-arabia") <NA> opec #> qatar <NA> <NA> #> c("bahrain", "saudi-arabia") hisham-nazer opec #> c("saudi-arabia", "uae") <NA> opec #> c("uae", "bahrain", "saudi-arabia", "kuwait", "qatar") <NA> opec #> c("saudi-arabia", "bahrain") hisham-nazer opec #> kuwait <NA> opec #> usa <NA> <NA> #> usa <NA> <NA> #> usa <NA> <NA> #> usa <NA> <NA> #> usa <NA> <NA> #> argentina <NA> <NA> #> exchanges #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> <NA> #> nymex #> <NA> #> #> Source: Converted from tm VCorpus 'crude' #> Created: Tue May 16 20:59:43 2017 #> Notes: #> #> Corpus consisting of 50 documents, showing 5 documents. #> #> Text Types Tokens Sentences author #> reut-00001.xml 120 233 9 <NA> #> reut-00002.xml 89 146 6 <NA> #> reut-00003.xml 62 86 6 <NA> #> reut-00004.xml 232 431 22 By Cal Mankowski, Reuters #> reut-00005.xml 42 59 3 <NA> #> datetimestamp description #> 1987-02-26 15:18:06 #> 1987-02-26 15:19:15 #> 1987-02-26 15:49:56 #> 1987-02-26 15:51:17 #> 1987-02-26 16:08:33 #> heading id language origin #> COMPUTER TERMINAL SYSTEMS <CPML> COMPLETES SALE 10 en Reuters-21578 XML #> OHIO MATTRESS <OMT> MAY HAVE LOWER 1ST QTR NET 12 en Reuters-21578 XML #> MCLEAN'S <MII> U.S. LINES SETS ASSET TRANSFER 44 en Reuters-21578 XML #> CHEMLAWN <CHEM> RISES ON HOPES FOR HIGHER BIDS 45 en Reuters-21578 XML #> <COFAB INC> BUYS GULFEX FOR UNDISCLOSED AMOUNT 68 en Reuters-21578 XML #> topics lewissplit cgisplit oldid places people orgs exchanges #> YES TRAIN TRAINING-SET 5553 usa <NA> <NA> <NA> #> YES TRAIN TRAINING-SET 5555 usa <NA> <NA> <NA> #> YES TRAIN TRAINING-SET 5587 usa <NA> <NA> <NA> #> YES TRAIN TRAINING-SET 5588 usa <NA> <NA> <NA> #> YES TRAIN TRAINING-SET 5611 usa <NA> <NA> <NA> #> #> Source: Converted from tm VCorpus 'acq' #> Created: Tue May 16 20:59:43 2017 #> Notes: #> #> Corpus consisting of 9 documents. #> #> Text Types Tokens Sentences author datetimestamp description #> BNP 1126 3330 88 <NA> 2017-05-16 19:59:43 <NA> #> Coalition 144 268 4 <NA> 2017-05-16 19:59:43 <NA> #> Conservative 252 503 15 <NA> 2017-05-16 19:59:43 <NA> #> Greens 325 687 21 <NA> 2017-05-16 19:59:43 <NA> #> Labour 296 703 29 <NA> 2017-05-16 19:59:43 <NA> #> LibDem 257 499 14 <NA> 2017-05-16 19:59:43 <NA> #> PC 80 118 5 <NA> 2017-05-16 19:59:43 <NA> #> SNP 90 136 4 <NA> 2017-05-16 19:59:43 <NA> #> UKIP 346 739 27 <NA> 2017-05-16 19:59:43 <NA> #> heading id language origin #> <NA> 1 en <NA> #> <NA> 2 en <NA> #> <NA> 3 en <NA> #> <NA> 4 en <NA> #> <NA> 5 en <NA> #> <NA> 6 en <NA> #> <NA> 7 en <NA> #> <NA> 8 en <NA> #> <NA> 9 en <NA> #> #> Source: Converted from tm VCorpus 'tmCorp' #> Created: Tue May 16 20:59:43 2017 #> Notes: #>
# construct a corpus from a data.frame mydf <- data.frame(letter_factor = factor(rep(letters[1:3], each = 2)), some_ints = 1L:6L, some_text = paste0("This is text number ", 1:6, "."), stringsAsFactors = FALSE, row.names = paste0("fromDf_", 1:6)) mydf
#> letter_factor some_ints some_text #> fromDf_1 a 1 This is text number 1. #> fromDf_2 a 2 This is text number 2. #> fromDf_3 b 3 This is text number 3. #> fromDf_4 b 4 This is text number 4. #> fromDf_5 c 5 This is text number 5. #> fromDf_6 c 6 This is text number 6.
summary(corpus(mydf, text_field = "some_text", metacorpus = list(source = "From a data.frame called mydf.")))
#> Corpus consisting of 6 documents. #> #> Text Types Tokens Sentences letter_factor some_ints #> fromDf_1 6 6 1 a 1 #> fromDf_2 6 6 1 a 2 #> fromDf_3 6 6 1 b 3 #> fromDf_4 6 6 1 b 4 #> fromDf_5 6 6 1 c 5 #> fromDf_6 6 6 1 c 6 #> #> Source: From a data.frame called mydf. #> Created: Tue May 16 20:59:43 2017 #> Notes: #>
# construct a corpus from a kwic object mykwic <- kwic(data_corpus_inaugural, "southern") summary(corpus(mykwic))
#> Corpus consisting of 28 documents. #> #> Text Types Tokens Sentences docname from to keyword context #> text1.pre 5 5 1 1797-Adams 1807 1807 southern pre #> text2.pre 4 5 1 1825-Adams 2434 2434 southern pre #> text3.pre 4 5 1 1861-Lincoln 98 98 Southern pre #> text4.pre 5 5 1 1865-Lincoln 283 283 southern pre #> text5.pre 5 5 1 1877-Hayes 378 378 Southern pre #> text6.pre 5 5 1 1877-Hayes 956 956 Southern pre #> text7.pre 5 5 1 1877-Hayes 1250 1250 Southern pre #> text8.pre 5 5 1 1881-Garfield 1007 1007 Southern pre #> text9.pre 4 5 1 1909-Taft 4029 4029 Southern pre #> text10.pre 5 5 1 1909-Taft 4230 4230 Southern pre #> text11.pre 5 5 1 1909-Taft 4350 4350 Southern pre #> text12.pre 5 5 1 1909-Taft 4537 4537 Southern pre #> text13.pre 5 5 1 1909-Taft 4597 4597 Southern pre #> text14.pre 5 5 1 1953-Eisenhower 1226 1226 southern pre #> text1.post 5 5 1 1797-Adams 1807 1807 southern post #> text2.post 5 5 1 1825-Adams 2434 2434 southern post #> text3.post 5 5 1 1861-Lincoln 98 98 Southern post #> text4.post 5 5 2 1865-Lincoln 283 283 southern post #> text5.post 5 5 2 1877-Hayes 378 378 Southern post #> text6.post 5 5 1 1877-Hayes 956 956 Southern post #> text7.post 5 5 1 1877-Hayes 1250 1250 Southern post #> text8.post 5 5 2 1881-Garfield 1007 1007 Southern post #> text9.post 5 5 2 1909-Taft 4029 4029 Southern post #> text10.post 5 5 1 1909-Taft 4230 4230 Southern post #> text11.post 5 5 1 1909-Taft 4350 4350 Southern post #> text12.post 5 5 1 1909-Taft 4537 4537 Southern post #> text13.post 5 5 1 1909-Taft 4597 4597 Southern post #> text14.post 5 5 1 1953-Eisenhower 1226 1226 southern post #> #> Source: Corpus created from kwic(x, keywords = "southern") #> Created: Tue May 16 20:59:43 2017 #> Notes: #>