This vignette provides a basic overview of quanteda’s features and capabilities. For additional vignettes, see the articles at quanteda.io.

Introduction

An R package for managing and analyzing text.

quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data for documents and for the collection as a whole. quanteda includes tools to make it easy and fast to manuipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda’s functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and extremely simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.

Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set, following conversion internally to UTF-8.

quanteda is built for efficiency and speed, through its design around three infrastructures: the stringi package for text processing, the data.table package for indexing large documents efficiently, and the Matrix package for sparse matrix objects. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)

quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined “thesaurus”, and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.

quanteda Features

Corpus management tools

The tools for getting texts into a corpus object include:

  • loading texts from directories of individual files
  • loading texts ``manually’’ by inserting them into a corpus using helper functions
  • managing text encodings and conversions from source files into corpus texts
  • attaching variables to each text that can be used for grouping, reorganizing a corpus, or simply recording additional information to supplement quantitative analyses with non-textual data
  • recording meta-data about the sources and creation details for the corpus.

The tools for working with a corpus include:

  • summarizing the corpus in terms of its language units
  • reshaping the corpus into smaller units or more aggregated units
  • adding to or extracting subsets of a corpus
  • resampling texts of the corpus, for example for use in non-parametric bootstrapping of the texts
  • Easy extraction and saving, as a new data frame or corpus, key words in context (KWIC)

Natural-Language Processing tools

For extracting features from a corpus, quanteda provides the following tools:

  • extraction of word types
  • extraction of word n-grams
  • extraction of dictionary entries from user-defined dictionaries
  • feature selection through
    • stemming
    • random selection
    • document frequency
    • word frequency
  • and a variety of options for cleaning word types, such as capitalization and rules for handling punctuation.

Document-Feature Matrix analysis tools

For analyzing the resulting document-feature matrix created when features are abstracted from a corpus, quanteda provides:

  • scaling methods, such as correspondence analysis, Wordfish, and Wordscores
  • topic models, such as LDA
  • classifiers, such as Naive Bayes or k-nearest neighbour
  • sentiment analysis, using dictionaries

Additional and planned features

Additional features of quanteda include:

  • the ability to explore texts using key-words-in-context;

  • fast computation of a variety of readability indexes;

  • fast computation of a variety of lexical diversity measures;

  • quick computation of word or document association measures, for clustering or to compute similarity scores for other purposes; and

  • a comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document.

Planned features coming soon to quanteda are:

  • bootstrapping methods for texts that makes it easy to resample texts from pre-defined units, to facilitate computation of confidence intervals on textual statistics using techniques of non-parametric bootstrapping, but applied to the original texts as data.

  • expansion of the document-feature matrix structure through a standard interface called textmodel(). (As of version 0.8.0, textmodel works in a basic fashion only for the “Wordscores” and “wordfish” scaling models.)

Working with other text analysis packages

quanteda is hardly unique in providing facilities for working with text – the excellent tm package already provides many of the features we have described. quanteda is designed to complement those packages, as well to simplify the implementation of the text-to-analysis workflow. quanteda corpus structures are simpler objects than in tms, as are the document-feature matrix objects from quanteda, compared to the sparse matrix implementation found in tm. However, there is no need to choose only one package, since we provide translator functions from one matrix or corpus object to the other in quanteda.

Once constructed, a quanteda “dfm”" can be easily passed to other text-analysis packages for additional analysis of topic models or scaling, such as:

  • topic models (including converters for direct use with the topicmodels, LDA, and stm packages)

  • document scaling using quanteda’s own functions for the “wordfish” and “Wordscores” models, and a sparse method for correspondence analysis

  • document classification methods, using (for example) Naive Bayes, k-nearest neighbour, or Support Vector Machines

  • more sophisticated machine learning through a variety of other packages that take matrix or matrix-like inputs.

  • graphical analysis, including word clouds and strip plots for selected themes or words.

How to Install

Through a normal installation of the package from CRAN, or for the GitHub version, see the installation instructions at https://github.com/kbenoit/quanteda.

Creating and Working with a Corpus

require(quanteda)

Currently available corpus sources

quanteda has a simple and powerful companion package for loading texts: readtext. The main function in this package, readtext(), takes a file or fileset from disk or a URL, and returns a type of data.frame that can be used directly with the corpus() constructor function, to create a quanteda corpus object.

readtext() works on:

  • text (.txt) files;
  • comma-separated-value (.csv) files;
  • XML formatted data;
  • data from the Facebook API, in JSON format;
  • data from the Twitter API, in JSON format; and
  • generic JSON data.

The corpus constructor command corpus() works directly on:

  • a vector of character objects, for instance that you have already loaded into the workspace using other tools;
  • a VCorpus corpus object from the tm package.
  • a data.frame containing a text column and any other document-level metadata.

Example: building a corpus from a character vector

The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.

If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character object of the texts about immigration policy extracted from the 2010 election manifestos of the UK political parties (called data_char_ukimmig2010).

myCorpus <- corpus(data_char_ukimmig2010)  # build a new corpus from the texts
summary(myCorpus)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences
##           BNP  1126   3330        88
##     Coalition   144    268         4
##  Conservative   252    503        15
##        Greens   325    687        21
##        Labour   296    703        29
##        LibDem   257    499        14
##            PC    80    118         5
##           SNP    90    136         4
##          UKIP   346    739        27
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/articles/* on x86_64 by kbenoit
## Created: Tue May 16 21:00:54 2017
## Notes:

If we wanted, we could add some document-level variables – what quanteda calls docvars – to this corpus.

We can do this using the R’s names() function to get the names of the character vector data_char_ukimmig2010, and assign this to a document variable (docvar).

docvars(myCorpus, "Party") <- names(data_char_ukimmig2010)
docvars(myCorpus, "Year") <- 2010
summary(myCorpus)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences        Party Year
##           BNP  1126   3330        88          BNP 2010
##     Coalition   144    268         4    Coalition 2010
##  Conservative   252    503        15 Conservative 2010
##        Greens   325    687        21       Greens 2010
##        Labour   296    703        29       Labour 2010
##        LibDem   257    499        14       LibDem 2010
##            PC    80    118         5           PC 2010
##           SNP    90    136         4          SNP 2010
##          UKIP   346    739        27         UKIP 2010
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/articles/* on x86_64 by kbenoit
## Created: Tue May 16 21:00:54 2017
## Notes:

If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.

metadoc(myCorpus, "language") <- "english"
metadoc(myCorpus, "docsource")  <- paste("data_char_ukimmig2010", 1:ndoc(myCorpus), sep = "_")
summary(myCorpus, showmeta = TRUE)
## Corpus consisting of 9 documents.
## 
##          Text Types Tokens Sentences        Party Year _language
##           BNP  1126   3330        88          BNP 2010   english
##     Coalition   144    268         4    Coalition 2010   english
##  Conservative   252    503        15 Conservative 2010   english
##        Greens   325    687        21       Greens 2010   english
##        Labour   296    703        29       Labour 2010   english
##        LibDem   257    499        14       LibDem 2010   english
##            PC    80    118         5           PC 2010   english
##           SNP    90    136         4          SNP 2010   english
##          UKIP   346    739        27         UKIP 2010   english
##               _docsource
##  data_char_ukimmig2010_1
##  data_char_ukimmig2010_2
##  data_char_ukimmig2010_3
##  data_char_ukimmig2010_4
##  data_char_ukimmig2010_5
##  data_char_ukimmig2010_6
##  data_char_ukimmig2010_7
##  data_char_ukimmig2010_8
##  data_char_ukimmig2010_9
## 
## Source:  /Users/kbenoit/Dropbox (Personal)/GitHub/quanteda/docs/articles/* on x86_64 by kbenoit
## Created: Tue May 16 21:00:54 2017
## Notes:

The last command, metadoc, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english", R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource, we used the quanteda function ndoc() to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow() and ncol().

Example: loading in files using the readtext package

require(readtext)

# Twitter json
mytf1 <- readtext("~/Dropbox/QUANTESS/social media/zombies/tweets.json")
myCorpusTwitter <- corpus(mytf1)
summary(myCorpusTwitter, 5)
# generic json - needs a textfield specifier
mytf2 <- readtext("~/Dropbox/QUANTESS/Manuscripts/collocations/Corpora/sotu/sotu.json",
                  textfield = "text")
summary(corpus(mytf2), 5)
# text file
mytf3 <- readtext("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt", cache = FALSE)
summary(corpus(mytf3), 5)
# multiple text files
mytf4 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", cache = FALSE)
summary(corpus(mytf4), 5)
# multiple text files with docvars from filenames
mytf5 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", 
                  docvarsfrom = "filenames", sep = "-", docvarnames = c("Year", "President"))
summary(corpus(mytf5), 5)
# XML data
mytf6 <- readtext("~/Dropbox/QUANTESS/quanteda_working_files/xmlData/plant_catalog.xml", 
                  textfield = "COMMON")
summary(corpus(mytf6), 5)
# csv file
write.csv(data.frame(inaugSpeech = texts(data_corpus_inaugural), 
                     docvars(data_corpus_inaugural)),
          file = "/tmp/inaug_texts.csv", row.names = FALSE)b
mytf7 <- readtext("/tmp/inaug_texts.csv", textfield = "inaugSpeech")
summary(corpus(mytf7), 5)

How a quanteda corpus works

Corpus principles

A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.

A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analyzing a reading ease index – can be performed on the same corpus.

To extract texts from a a corpus, we use an extractor, called texts().

texts(data_corpus_inaugural)[2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1793-Washington 
## "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "

To summarize the texts from a corpus, we can call a summary() method defined for a corpus.

summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents.
## 
##                                   Text Types Tokens Sentences year debate
##        2010_BUDGET_01_Brian_Lenihan_FF  1949   8733       374 2010 BUDGET
##       2010_BUDGET_02_Richard_Bruton_FG  1042   4478       217 2010 BUDGET
##         2010_BUDGET_03_Joan_Burton_LAB  1621   6429       307 2010 BUDGET
##        2010_BUDGET_04_Arthur_Morgan_SF  1589   7185       343 2010 BUDGET
##          2010_BUDGET_05_Brian_Cowen_FF  1618   6697       250 2010 BUDGET
##           2010_BUDGET_06_Enda_Kenny_FG  1151   4254       153 2010 BUDGET
##      2010_BUDGET_07_Kieran_ODonnell_FG   681   2309       133 2010 BUDGET
##       2010_BUDGET_08_Eamon_Gilmore_LAB  1183   4217       201 2010 BUDGET
##     2010_BUDGET_09_Michael_Higgins_LAB   490   1288        44 2010 BUDGET
##        2010_BUDGET_10_Ruairi_Quinn_LAB   442   1290        59 2010 BUDGET
##      2010_BUDGET_11_John_Gormley_Green   404   1036        49 2010 BUDGET
##        2010_BUDGET_12_Eamon_Ryan_Green   512   1651        90 2010 BUDGET
##      2010_BUDGET_13_Ciaran_Cuffe_Green   444   1248        45 2010 BUDGET
##  2010_BUDGET_14_Caoimhghin_OCaolain_SF  1188   4094       176 2010 BUDGET
##  number      foren     name party
##      01      Brian  Lenihan    FF
##      02    Richard   Bruton    FG
##      03       Joan   Burton   LAB
##      04     Arthur   Morgan    SF
##      05      Brian    Cowen    FF
##      06       Enda    Kenny    FG
##      07     Kieran ODonnell    FG
##      08      Eamon  Gilmore   LAB
##      09    Michael  Higgins   LAB
##      10     Ruairi    Quinn   LAB
##      11       John  Gormley Green
##      12      Eamon     Ryan Green
##      13     Ciaran    Cuffe Green
##      14 Caoimhghin OCaolain    SF
## 
## Source:  /home/paul/Dropbox/code/quantedaData/* on x86_64 by paul
## Created: Tue Sep 16 15:58:21 2014
## Notes:

We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:

tokenInfo <- summary(data_corpus_inaugural)
## Corpus consisting of 58 documents.
## 
##             Text Types Tokens Sentences Year  President       FirstName
##  1789-Washington   626   1540        23 1789 Washington          George
##  1793-Washington    96    147         4 1793 Washington          George
##       1797-Adams   826   2584        37 1797      Adams            John
##   1801-Jefferson   716   1935        41 1801  Jefferson          Thomas
##   1805-Jefferson   804   2381        45 1805  Jefferson          Thomas
##     1809-Madison   536   1267        21 1809    Madison           James
##     1813-Madison   542   1304        33 1813    Madison           James
##      1817-Monroe  1040   3696       121 1817     Monroe           James
##      1821-Monroe  1262   4898       129 1821     Monroe           James
##       1825-Adams  1004   3154        74 1825      Adams     John Quincy
##     1829-Jackson   517   1210        25 1829    Jackson          Andrew
##     1833-Jackson   499   1271        29 1833    Jackson          Andrew
##    1837-VanBuren  1315   4175        95 1837  Van Buren          Martin
##    1841-Harrison  1893   9178       210 1841   Harrison   William Henry
##        1845-Polk  1330   5211       153 1845       Polk      James Knox
##      1849-Taylor   497   1185        22 1849     Taylor         Zachary
##      1853-Pierce  1166   3657       104 1853     Pierce        Franklin
##    1857-Buchanan   945   3106        89 1857   Buchanan           James
##     1861-Lincoln  1075   4016       135 1861    Lincoln         Abraham
##     1865-Lincoln   362    780        26 1865    Lincoln         Abraham
##       1869-Grant   486   1243        40 1869      Grant      Ulysses S.
##       1873-Grant   552   1479        43 1873      Grant      Ulysses S.
##       1877-Hayes   829   2730        59 1877      Hayes   Rutherford B.
##    1881-Garfield  1018   3240       111 1881   Garfield        James A.
##   1885-Cleveland   674   1828        44 1885  Cleveland          Grover
##    1889-Harrison  1355   4744       157 1889   Harrison        Benjamin
##   1893-Cleveland   823   2135        58 1893  Cleveland          Grover
##    1897-McKinley  1236   4383       130 1897   McKinley         William
##    1901-McKinley   857   2449       100 1901   McKinley         William
##   1905-Roosevelt   404   1089        33 1905  Roosevelt        Theodore
##        1909-Taft  1436   5844       159 1909       Taft  William Howard
##      1913-Wilson   661   1896        68 1913     Wilson         Woodrow
##      1917-Wilson   549   1656        59 1917     Wilson         Woodrow
##     1921-Harding  1172   3743       148 1921    Harding       Warren G.
##    1925-Coolidge  1221   4442       196 1925   Coolidge          Calvin
##      1929-Hoover  1086   3895       158 1929     Hoover         Herbert
##   1933-Roosevelt   744   2064        85 1933  Roosevelt     Franklin D.
##   1937-Roosevelt   729   2027        96 1937  Roosevelt     Franklin D.
##   1941-Roosevelt   527   1552        68 1941  Roosevelt     Franklin D.
##   1945-Roosevelt   276    651        26 1945  Roosevelt     Franklin D.
##      1949-Truman   781   2531       116 1949     Truman        Harry S.
##  1953-Eisenhower   903   2765       119 1953 Eisenhower       Dwight D.
##  1957-Eisenhower   621   1933        92 1957 Eisenhower       Dwight D.
##     1961-Kennedy   566   1568        52 1961    Kennedy         John F.
##     1965-Johnson   569   1725        93 1965    Johnson   Lyndon Baines
##       1969-Nixon   743   2437       103 1969      Nixon Richard Milhous
##       1973-Nixon   545   2018        68 1973      Nixon Richard Milhous
##      1977-Carter   528   1380        52 1977     Carter           Jimmy
##      1981-Reagan   904   2798       128 1981     Reagan          Ronald
##      1985-Reagan   925   2935       123 1985     Reagan          Ronald
##        1989-Bush   795   2683       141 1989       Bush          George
##     1993-Clinton   644   1837        81 1993    Clinton            Bill
##     1997-Clinton   773   2451       111 1997    Clinton            Bill
##        2001-Bush   622   1810        97 2001       Bush       George W.
##        2005-Bush   772   2325       100 2005       Bush       George W.
##       2009-Obama   939   2729       110 2009      Obama          Barack
##       2013-Obama   814   2335        88 2013      Obama          Barack
##       2017-Trump   582   1662        88 2017      Trump       Donald J.
## 
## Source:  /home/paul/Dropbox/code/quanteda/* on x86_64 by paul
## Created: Fri Sep 12 12:41:17 2014
## Notes:
if (require(ggplot2))
    ggplot(data=tokenInfo, aes(x=Year, y=Tokens, group=1)) + geom_line() + geom_point() +
        scale_x_discrete(labels=c(seq(1789,2012,12)), breaks=seq(1789,2012,12) ) 


# Longest inaugural address: William Henry Harrison
tokenInfo[which.max(tokenInfo$Tokens),] 
##                        Text Types Tokens Sentences Year President
## 1841-Harrison 1841-Harrison  1893   9178       210 1841  Harrison
##                   FirstName
## 1841-Harrison William Henry

Tools for handling corpus objects

Adding two corpus objects together

The + operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.

library(quanteda)
mycorpus1 <- corpus(data_corpus_inaugural[1:5], note = "First five inaug speeches.")
## Warning in corpus.character(data_corpus_inaugural[1:5], note = "First five
## inaug speeches."): Argument note not used.
mycorpus2 <- corpus(data_corpus_inaugural[53:58], note = "Last five inaug speeches.")
## Warning in corpus.character(data_corpus_inaugural[53:58], note = "Last five
## inaug speeches."): Argument note not used.
mycorpus3 <- mycorpus1 + mycorpus2
summary(mycorpus3)
## Corpus consisting of 11 documents.
## 
##             Text Types Tokens Sentences
##  1789-Washington   626   1540        23
##  1793-Washington    96    147         4
##       1797-Adams   826   2584        37
##   1801-Jefferson   716   1935        41
##   1805-Jefferson   804   2381        45
##     1997-Clinton   773   2451       111
##        2001-Bush   622   1810        97
##        2005-Bush   772   2325       100
##       2009-Obama   939   2729       110
##       2013-Obama   814   2335        88
##       2017-Trump   582   1662        88
## 
## Source:  Combination of corpuses mycorpus1 and mycorpus2
## Created: Tue May 16 21:00:54 2017
## Notes:

subsetting corpus objects

There is a method of the corpus_subset() function defined for corpus objects, where a new corpus can be extracted based on logical conditions applied to docvars:

summary(corpus_subset(data_corpus_inaugural, Year > 1990))
## Corpus consisting of 7 documents.
## 
##          Text Types Tokens Sentences Year President FirstName
##  1993-Clinton   644   1837        81 1993   Clinton      Bill
##  1997-Clinton   773   2451       111 1997   Clinton      Bill
##     2001-Bush   622   1810        97 2001      Bush George W.
##     2005-Bush   772   2325       100 2005      Bush George W.
##    2009-Obama   939   2729       110 2009     Obama    Barack
##    2013-Obama   814   2335        88 2013     Obama    Barack
##    2017-Trump   582   1662        88 2017     Trump Donald J.
## 
## Source:  /home/paul/Dropbox/code/quanteda/* on x86_64 by paul
## Created: Fri Sep 12 12:41:17 2014
## Notes:
summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
## Corpus consisting of 2 documents.
## 
##        Text Types Tokens Sentences Year President   FirstName
##  1797-Adams   826   2584        37 1797     Adams        John
##  1825-Adams  1004   3154        74 1825     Adams John Quincy
## 
## Source:  /home/paul/Dropbox/code/quanteda/* on x86_64 by paul
## Created: Fri Sep 12 12:41:17 2014
## Notes:

Exploring corpus texts

The kwic function (KeyWord In Context) performs a search for a word and allows us to view the contexts in which it occurs:

options(width = 200)
kwic(data_corpus_inaugural, "terror")
##                                                                                                       
##     [1797-Adams, 1327]              fraud or violence, by | terror | , intrigue, or venality          
##  [1933-Roosevelt, 112] nameless, unreasoning, unjustified | terror | which paralyzes needed efforts to
##  [1941-Roosevelt, 289]      seemed frozen by a fatalistic | terror | , we proved that this            
##    [1961-Kennedy, 868]    alter that uncertain balance of | terror | that stays the hand of           
##     [1981-Reagan, 821]     freeing all Americans from the | terror | of runaway living costs.         
##   [1997-Clinton, 1055]        They fuel the fanaticism of | terror | . And they torment the           
##   [1997-Clinton, 1655]  maintain a strong defense against | terror | and destruction. Our children    
##     [2009-Obama, 1646]     advance their aims by inducing | terror | and slaughtering innocents, we
kwic(data_corpus_inaugural, "terror", valuetype = "regex")
##                                                                                                               
##     [1797-Adams, 1327]                   fraud or violence, by |  terror   | , intrigue, or venality          
##  [1933-Roosevelt, 112]      nameless, unreasoning, unjustified |  terror   | which paralyzes needed efforts to
##  [1941-Roosevelt, 289]           seemed frozen by a fatalistic |  terror   | , we proved that this            
##    [1961-Kennedy, 868]         alter that uncertain balance of |  terror   | that stays the hand of           
##    [1961-Kennedy, 992]               of science instead of its |  terrors  | . Together let us explore        
##     [1981-Reagan, 821]          freeing all Americans from the |  terror   | of runaway living costs.         
##    [1981-Reagan, 2204]        understood by those who practice | terrorism | and prey upon their neighbors    
##   [1997-Clinton, 1055]             They fuel the fanaticism of |  terror   | . And they torment the           
##   [1997-Clinton, 1655]       maintain a strong defense against |  terror   | and destruction. Our children    
##     [2009-Obama, 1646]          advance their aims by inducing |  terror   | and slaughtering innocents, we   
##     [2017-Trump, 1119] civilized world against radical Islamic | terrorism | , which we will eradicate
kwic(data_corpus_inaugural, "communist*")
##                                                                                              
##   [1949-Truman, 838] the actions resulting from the | Communist  | philosophy are a threat to
##  [1961-Kennedy, 519]             -- not because the | Communists | may be doing it,

In the above summary, Year and President are variables associated with each document. We can access such variables with the docvars() function.

# inspect the document-level variables
head(docvars(data_corpus_inaugural))
##                 Year  President FirstName
## 1789-Washington 1789 Washington    George
## 1793-Washington 1793 Washington    George
## 1797-Adams      1797      Adams      John
## 1801-Jefferson  1801  Jefferson    Thomas
## 1805-Jefferson  1805  Jefferson    Thomas
## 1809-Madison    1809    Madison     James

# inspect the corpus-level metadata
metacorpus(data_corpus_inaugural)
## $source
## [1] "/home/paul/Dropbox/code/quanteda/* on x86_64 by paul"
## 
## $created
## [1] "Fri Sep 12 12:41:17 2014"
## 
## $notes
## NULL
## 
## $citation
## NULL

More corpora are available from the quantedaData package.

Extracting Features from a Corpus

In order to perform statistical analysis such as document scaling, we must extract a matrix associating values for certain features with each document. In quanteda, we use the dfm function to produce such a matrix. “dfm” is short for document-feature matrix, and always refers to documents in rows and “features” as columns. We fix this dimensional orientation because is is standard in data analysis to have a unit of analysis as a row, and features or variables pertaining to each unit as columns. We call them “features” rather than terms, because features are more general than terms: they can be defined as raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.

Tokenizing texts

To simply tokenize a text, quanteda provides a powerful command called tokens(). This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.

tokens() is deliberately conservative, meaning that it does not remove anything from the text unless told to do so.

txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!", 
         text2 = "@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokens(txt)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"        "999"       "different" "ways"      ","         "up"        "and"       "down"      ";"         "left"      "and"       "right"    
## [17] "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"             "#quanteda"      "2day"           "4ever"          ","              "http"           ":"              "/"             
## [12] "/"              "textasdata.com" "?"              "page"           "="              "123"            "."
tokens(txt, remove_numbers = TRUE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "in"        "different" "ways"      "up"        "and"       "down"      "left"      "and"       "right"    
## 
## text2 :
## [1] "@kenbenoit"     "working"        "on"             "#quanteda"      "2day"           "4ever"          "http"           "textasdata.com" "page"
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "10"        "in"        "999"       "different" "ways"      "up"        "and"       "down"      "left"      "and"       "right"    
## 
## text2 :
##  [1] "@kenbenoit"     "working"        "on"             "#quanteda"      "2day"           "4ever"          "http"           "textasdata.com" "page"           "123"
tokens(txt, remove_numbers = TRUE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "in"        "different" "ways"      ","         "up"        "and"       "down"      ";"         "left"      "and"       "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"             "#quanteda"      "2day"           "4ever"          ","              "http"           ":"              "/"             
## [12] "/"              "textasdata.com" "?"              "page"           "="              "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"        "999"       "different" "ways"      ","         "up"        "and"       "down"      ";"         "left"      "and"       "right"    
## [17] "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"             "#quanteda"      "2day"           "4ever"          ","              "http"           ":"              "/"             
## [12] "/"              "textasdata.com" "?"              "page"           "="              "123"            "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      " "         "is"        " "         "$"         "10"        " "         "in"        " "         "999"       " "         "different" " "         "ways"      ","         "\n"       
## [17] " "         "up"        " "         "and"       " "         "down"      ";"         " "         "left"      " "         "and"       " "         "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     " "              "working"        ":"              " "              "on"             " "              "#quanteda"      " "              "2day"           "\t"            
## [12] "4ever"          ","              " "              "http"           ":"              "/"              "/"              "textasdata.com" "?"              "page"           "="             
## [23] "123"            "."

We also have the option to tokenize characters:

tokens("Great website: http://textasdata.com?page=123.", what = "character")
## tokens from 1 document.
## Component 1 :
##  [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."
tokens("Great website: http://textasdata.com?page=123.", what = "character", 
         remove_separators = FALSE)
## tokens from 1 document.
## Component 1 :
##  [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p" ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g" "e" "=" "1" "2" "3" "."

and sentences:

# sentence level         
tokens(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"), 
          what = "sentence")
## tokens from 3 documents.
## Component 1 :
## [1] "Kurt Vongeut said; only assholes use semi-colons."
## 
## Component 2 :
## [1] "Today is Thursday in Canberra:  It is yesterday in London."
## 
## Component 3 :
## [1] "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"

Constructing a document-frequency matrix

Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing a document-feature matrix. For this, we have a Swiss-army knife function, called dfm(), which performs tokenization and tabulates the extracted features into a matrix of documents by features. Unlike the conservative approach taken by tokens(), the dfm() function applies certain options by default, such as toLower() – a separate function for lower-casing texts – and removes punctuation. All of the options to tokens() can be passed to dfm(), however.

myCorpus <- corpus_subset(data_corpus_inaugural, Year > 1990)

# make a dfm
myDfm <- dfm(myCorpus)
myDfm[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (0% sparse).
## 7 x 5 sparse Matrix of class "dfmSparse"
##               features
## docs           my fellow citizens   , today
##   1993-Clinton  7      5        2 139    10
##   1997-Clinton  6      7        7 131     5
##   2001-Bush     3      1        9 110     2
##   2005-Bush     2      3        6 120     3
##   2009-Obama    2      1        1 130     6
##   2013-Obama    3      3        6  99     4
##   2017-Trump    1      1        4  96     4

Other options for a dfm() include removing stopwords, and stemming the tokens.

# make a dfm, removing stopwords and applying stemming
myStemMat <- dfm(myCorpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
myStemMat[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (17.1% sparse).
## 7 x 5 sparse Matrix of class "dfmSparse"
##               features
## docs           fellow citizen today celebr mysteri
##   1993-Clinton      5       2    10      4       1
##   1997-Clinton      7       8     6      1       0
##   2001-Bush         1      10     2      0       0
##   2005-Bush         3       7     3      2       0
##   2009-Obama        1       1     6      2       0
##   2013-Obama        3       8     6      1       0
##   2017-Trump        1       4     5      3       1

The option remove provides a list of tokens to be ignored. Most users will supply a list of pre-defined “stop words”, defined for numerous languages, accessed through the stopwords() function:

head(stopwords("english"), 20)
##  [1] "i"          "me"         "my"         "myself"     "we"         "our"        "ours"       "ourselves"  "you"        "your"       "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"
head(stopwords("russian"), 10)
##  [1] "и"   "в"   "во"  "не"  "что" "он"  "на"  "я"   "с"   "со"
head(stopwords("arabic"), 10)
##  [1] "فى"  "في"  "كل"  "لم"  "لن"  "له"  "من"  "هو"  "هي"  "قوة"

Viewing the document-frequency matrix

The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View function. Calling plot on a dfm will display a wordcloud using the wordcloud package

mydfm <- dfm(data_char_ukimmig2010, remove = c("will", stopwords("english")), 
             remove_punct = TRUE)
mydfm
## Document-feature matrix of: 9 documents, 1,547 features (83.8% sparse).

To access a list of the most frequently occurring features, we can use topfeatures():

topfeatures(mydfm, 20)  # 20 top words
## immigration     british      people      asylum     britain          uk      system  population     country         new  immigrants      ensure       shall citizenship      social    national 
##          66          37          35          29          28          27          27          21          20          19          17          17          17          16          14          14 
##         bnp     illegal        work     percent 
##          13          13          13          12

Plotting a word cloud is done using textplot_wordcloud(), for a dfm class object. This function passes arguments through to wordcloud() from the wordcloud package, and can prettify the plot using the same arguments:

set.seed(100)
textplot_wordcloud(mydfm, min.freq = 6, random.order = FALSE,
                   rot.per = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))

Grouping documents by document variable

Often, we are interested in analysing how texts differ according to substantive factors which may be encoded in the document variables, rather than simply by the boundaries of the document files. We can group documents which share the same value for a document variable when creating a dfm:

byPartyDfm <- dfm(data_corpus_irishbudget2010, groups = "party", remove = stopwords("english"), remove_punct = TRUE)

We can sort this dfm, and inspect it:

sort(byPartyDfm)[, 1:10]
## Warning: 'sort.dfm' is deprecated.
## Use 'dfm_sort' instead.
## See help("Deprecated")
## Document-feature matrix of: 5 documents, 10 features (0% sparse).
## 5 x 10 sparse Matrix of class "dfmSparse"
##        features
## docs    will people budget government public minister tax economy pay jobs
##   FF     212     23     44         47     65       11  60      37  41   41
##   FG      93     78     71         61     47       62  11      20  29   17
##   Green   59     15     26         19      4        4  11      16   4   15
##   LAB     89     69     66         36     32       54  47      37  24   20
##   SF     104     81     53         73     31       39  34      50  24   27

Note that the most frequently occurring feature is “will”, a word usually on English stop lists, but one that is not included in quanteda’s built-in English stopword list.

Grouping words by dictionary or equivalence class

For some applications we have prior knowledge of sets of words that are indicative of traits we would like to measure from the text. For example, a general list of positive words might indicate positive sentiment in a movie review, or we might have a dictionary of political terms which are associated with a particular ideological stance. In these cases, it is sometimes useful to treat these groups of words as equivalent for the purposes of analysis, and sum their counts into classes.

For example, let’s look at how words associated with terrorism and words associated with the economy vary by President in the inaugural speeches corpus. From the original corpus, we select Presidents since Clinton:

recentCorpus <- corpus_subset(data_corpus_inaugural, Year > 1991)

Now we define a demonstration dictionary:

myDict <- dictionary(list(terror = c("terrorism", "terrorists", "threat"),
                          economy = c("jobs", "business", "grow", "work")))

We can use the dictionary when making the dfm:

byPresMat <- dfm(recentCorpus, dictionary = myDict)
byPresMat
## Document-feature matrix of: 7 documents, 2 features (14.3% sparse).
## 7 x 2 sparse Matrix of class "dfmSparse"
##               features
## docs           terror economy
##   1993-Clinton      0       8
##   1997-Clinton      1       8
##   2001-Bush         0       4
##   2005-Bush         1       6
##   2009-Obama        1      10
##   2013-Obama        1       6
##   2017-Trump        1       5

The constructor function dictionary() also works with two common “foreign” dictionary formats: the LIWC and Provalis Research’s Wordstat format. For instance, we can load the LIWC and apply this to the Presidential inaugural speech corpus:

liwcdict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2001_English.dic",
                       format = "LIWC")
liwcdfm <- dfm(data_corpus_inaugural[52:58], dictionary = liwcdict, verbose = FALSE)
liwcdfm[, 1:10]

Further Examples

Similarities between texts

presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year>1980), 
               remove = stopwords("english"),
               stem = TRUE, remove_punct = TRUE)
obamaSimil <- textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), 
                             margin = "documents", method = "cosine")
obamaSimil
##              2009-Obama 2013-Obama
## 2009-Obama    1.0000000  0.7144178
## 2013-Obama    0.7144178  1.0000000
## 1981-Reagan   0.6726373  0.6822342
## 1985-Reagan   0.6669281  0.6844686
## 1989-Bush     0.6687192  0.6257466
## 1993-Clinton  0.6288108  0.6278986
## 1997-Clinton  0.6954614  0.6826705
## 2001-Bush     0.6529447  0.6656213
## 2005-Bush     0.5766800  0.6292757
## 2017-Trump    0.5867729  0.5796322
# dotchart(as.list(obamaSimil)$"2009-Obama", xlab = "Cosine similarity")

We can use these distances to plot a dendrogram, clustering presidents:

data(data_corpus_SOTU, package="quantedaData")
presDfm <- dfm(corpus_subset(data_corpus_SOTU, Date > as.Date("1980-01-01")), 
               verbose = FALSE, stem = TRUE, remove_punct = TRUE,
               remove = c("will", stopwords("english")))
presDfm <- dfm_trim(presDfm, min_count = 5, min_docfreq = 3)
# hierarchical clustering - get distances on normalized dfm
presDistMat <- textstat_dist(dfm_weight(presDfm, "relFreq"))
# hiarchical clustering the distance object
presCluster <- hclust(presDistMat)
# label with document names
presCluster$labels <- docnames(presDfm)
# plot as a dendrogram
plot(presCluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency")

(try it!)

We can also look at term similarities:

sim <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features")
as.list(sim)
## $fair
##       economi         begin     jefferson        author         faith          call       struggl          best         creat        courag           god         pledg       compass          much 
##     0.9080252     0.9075951     0.8981462     0.8944272     0.8866586     0.8608285     0.8451543     0.8366600     0.8347300     0.8326664     0.8321715     0.8292884     0.8280787     0.8235321 
##        social          alli        believ         order        danger       continu        failur          full         limit          well           tax        govern            us          side 
##     0.8215838     0.8215838     0.8194652     0.8164966     0.8164966     0.8150876     0.8082904     0.8082904     0.8082904     0.8064778     0.8007572     0.7975845     0.7958402     0.7877264 
##      opportun        beyond        travel         stand          vice        suffer         howev          size       chariti          hold        prayer          peac        econom       preserv 
##     0.7807201     0.7784989     0.7784989     0.7751702     0.7745967     0.7745967     0.7745967     0.7745967     0.7745967     0.7745967     0.7706746     0.7665050     0.7640574     0.7640574 
##          meet         bless         among        weapon          take         earth           yet         thoma        almost        republ          cost          sign        troubl        declin 
##     0.7640574     0.7603565     0.7537347     0.7532436     0.7527727     0.7460038     0.7458699     0.7453560     0.7453560     0.7453560     0.7453560     0.7453560     0.7453560     0.7453560 
##          rest        intend          agre          upon          must           now          mani           way          time         think         assur          fall           aim       deficit 
##     0.7453560     0.7453560     0.7453560     0.7447932     0.7431854     0.7404361     0.7404361     0.7395219     0.7380203     0.7378648     0.7378648     0.7378648     0.7378648     0.7348469 
##      threaten        growth          ever        system         carri       digniti        o'neil         occas        inflat      unemploy          pace          bear       concern        ethnic 
##     0.7333333     0.7333333     0.7324670     0.7319251     0.7315635     0.7313103     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967 
##       barrier          core        revers        genius        church      prioriti        unborn       arsenal        utmost          john    accomplish       servant     enterpris      virginia 
##     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967     0.7302967 
##         wrote          abus          lead         place          year       histori         feder       program       poverti       product        resolv          pass          work          fail 
##     0.7302967     0.7302967     0.7229569     0.7197275     0.7191465     0.7174922     0.7171372     0.7161149     0.7161149     0.7089176     0.7089176     0.7084919     0.7041796     0.7006490 
##        defens      interest       societi      individu        purpos         polit          race           pay          ride          slow      independ        worthi         trust      confront 
##     0.7006490     0.6982972     0.6956656     0.6956083     0.6956083     0.6940221     0.6928203     0.6928203     0.6928203     0.6928203     0.6928203     0.6928203     0.6900656     0.6900656 
##          show          fill          play        import         peopl        nation       respons        return          sick          said          find      american         reduc          valu 
##     0.6885304     0.6885304     0.6885304     0.6885304     0.6875214     0.6868957     0.6842257     0.6831301     0.6831301     0.6831301     0.6831301     0.6816212     0.6803091     0.6791622 
##         never        declar       citizen        demand         small          deep           act        number          told       support         storm           can      strength         power 
##     0.6781327     0.6761234     0.6743403     0.6713171     0.6712486     0.6708204     0.6708204     0.6708204     0.6708204     0.6694387     0.6694387     0.6690667     0.6675920     0.6658003 
##      sacrific         today          done      principl       present        restor          kill          away      progress         futur         spend        person        solemn         bound 
##     0.6582806     0.6568567     0.6555623     0.6546537     0.6531973     0.6531973     0.6531973     0.6515838     0.6507914     0.6506000     0.6459752     0.6454972     0.6454972     0.6454972 
##     strongest        histor         georg         spoke         liber          poor        safeti         doubt          step          earn           let          just        commit         simpl 
##     0.6454972     0.6454972     0.6454972     0.6454972     0.6454972     0.6454972     0.6454972     0.6454972     0.6445034     0.6445034     0.6421476     0.6390922     0.6390097     0.6390097 
##          road         world          will         endur           man           two         reach          bush         decis         month          line         short          fate        mutual 
##     0.6390097     0.6368885     0.6368304     0.6367145     0.6330542     0.6324555     0.6324555     0.6324555     0.6324555     0.6324555     0.6324555     0.6324555     0.6324555     0.6324555 
##          west            go         found          hero          live          know           law         chanc      children        remain          last        answer        direct     technolog 
##     0.6324555     0.6303152     0.6292532     0.6282809     0.6281206     0.6270544     0.6262243     0.6262243     0.6260475     0.6233023     0.6233023     0.6227992     0.6227992     0.6227992 
##         blood          life         capac         young        effort        presid          oath         state         group          goal         price         share        famili        togeth 
##     0.6227992     0.6225318     0.6210590     0.6199304     0.6196773     0.6142951     0.6110101     0.6090002     0.6085806     0.6085806     0.6085806     0.6076436     0.6048584     0.6030227 
##          look         decad        energi        strive           tri        common          hear        servic          keep          seek      reverend           goe          cast       collect 
##     0.6028606     0.6024641     0.6024641     0.6024641     0.6024641     0.6024641     0.6024641     0.6024641     0.6000000     0.5976143     0.5962848     0.5962848     0.5962848     0.5962848 
##        whatev         manag        capabl        around         grown     establish       unleash          self         aspir        negoti         yield          knew           cut          bond 
##     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848 
##         begun        domest       conduct       conquer        uphold          rage           one       greater      determin          rais         shall         civil          help        tradit 
##     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5962848     0.5958582     0.5933661     0.5923489     0.5923489     0.5923489     0.5901689     0.5884899     0.5855400 
##           men          feed          held         allow      destruct         renew           war        father          hope          past         whose         shown        remind        depend 
##     0.5855400     0.5855400     0.5855400     0.5855400     0.5855400     0.5846883     0.5843065     0.5843065     0.5806832     0.5797509     0.5797509     0.5773503     0.5773503     0.5773503 
##       inaugur          hill          kind         local    birthright         drive         music       heritag         spare         senat         great         labor        provid           end 
##     0.5773503     0.5773503     0.5773503     0.5773503     0.5773503     0.5773503     0.5773503     0.5773503     0.5773503     0.5753560     0.5731973     0.5728919     0.5728919     0.5720776 
##       victori        affirm         speak        public         still       forward           put         secur         stori         build          noth          born          gift        though 
##     0.5715476     0.5715476     0.5680376     0.5680376     0.5677923     0.5645345     0.5634362     0.5590654     0.5543479     0.5532833     0.5520524     0.5520524     0.5520524     0.5520524 
##     communiti          make         crisi         human        fellow          alon       increas       process          busi         worst        borrow          week      boundari        patrol 
##     0.5518193     0.5505978     0.5504819     0.5499644     0.5490214     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226 
##        except      recognit        reserv       command        heroic         quiet        balanc         decid    strengthen       forbear      shoulder         humil       dignifi         river 
##     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226 
##     arlington          paid       earlier           lie          fame        repres        bestow     horseback           raw             4        wherev          took         equip        dramat 
##     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226 
##        modern          tide     overwhelm        hungri       treasur          debt           due      unfortun        origin        awesom         waver      research        missil        nowher 
##     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226 
##         youth          snow        valley        affect        expect          lest     conscienc         grace      grandest         delay          wind       serious          soil       respect 
##     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226 
##         cynic       medicar        mistak         favor     substitut      stranger       respond          lend        search         basic       subject         swift          turn        moment 
##     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5477226     0.5471885     0.5449493 
##        follow       nuclear          need       journey        spirit         happi         thank           arm         birth       problem         everi         chang          long         equal 
##     0.5449493     0.5445101     0.5422177     0.5417527     0.5402955     0.5383819     0.5370862     0.5333333     0.5333333     0.5313689     0.5308530     0.5307228     0.5296409     0.5289437 
##         chief           els        realiz     adversari           met          hall        invest        school         ambit           bad         stake         women       generat       america 
##     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5270463     0.5258738     0.5258685 
##          face          grow           say         might          care       destroy          back        achiev        requir       freedom        health      hatfield        mondal         baker 
##     0.5222330     0.5217492     0.5217492     0.5217492     0.5217492     0.5217492     0.5209237     0.5196152     0.5196152     0.5164553     0.5163978     0.5163978     0.5163978     0.5163978 
##        moomaw        occurr        routin         uniqu        realli  every-4-year        normal       transit         degre       bulwark       afflict       proport       longest       distort 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         penal        thrift         crush   fixed-incom          alik       shatter           idl        indign        burden          kept          pile       mortgag     temporari       conveni 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         trend       tremend        upheav        period misunderstand         sever       bastion         tempt       complex      self-rul          elit      superior        someon         equit 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         singl       neglect       section          food          mine         teach    profession industrialist      shopkeep         clerk         cabbi     truckdriv         breed       healthi 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         vigor     discrimin       runaway         reviv     inventori         check       consent        intent          curb      distinct       smother        foster         stifl        extent 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         avail       coincid      parallel    proportion     intervent        intrus        result   unnecessari        excess          loom       creativ          gate       counter  entrepreneur 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##     voluntari           art       address        makeup    countrymen       suffici        theori     unequivoc        emphat     paraphras       winston      churchil      dissolut         ahead 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##        propos         remov     roadblock       various         level        measur          inch          feet          mile      reawaken         giant       lighten         punit           eve 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##            dr        joseph        warren  massachusett       despair      exemplar        beacon         match      benefici   sovereignti          sale       surrend misunderstood       misjudg 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       prevail        formid       practic          prey           ten        deepli         vista          mall        shrine      monument revolutionari        infant    nationhood         eloqu 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##          pool        column       whoever       heroism       potomac         shore         slope      cemeteri           row         white        marker         david          tini      fraction 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       belleau          wood        argonn         omaha         beach       salerno       halfway     guadalcan        tarawa          pork          chop        chosin     reservoir         hundr 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##          rice         paddi         jungl        barber          shop          1917         franc       rainbow       western     battalion         heavi     artilleri         diari       flyleaf 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##      therefor         cheer       treptow       perform          deed        mathia        burger       presenc        absent        stenni         gilli     louisiana        silent         adequ 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##          50th         stood        wilder            13            60            50          gone           cri          moon        stress         glori   present-day      backward        proper 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##        machin          1980         ultim          rate        employ       vibrant        robust         climb        restat         freed          grip        sincer       meaning        reduct 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       develop          warm      sunlight          pois        golden          gain     two-parti    republican        boston        lawyer          adam       planter         rival          1800 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         later        soften         anger        letter   reestablish          1826   anniversari           die        fourth          juli       exchang        sunset         beset       valuabl 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##           oar      harmless          rode   well-intent         error         futil         chase         bloat     prescript       reelect          1984        vindic            25      straight 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##        incent entrepreneuri      interfer      simplifi         least       emancip          tear      distress     literatur        poetri         dynam      unbroken       brought        reckon 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         staff        submit         freez         desir   unconstitut       alreadi         handl     fundament        upgrad        infirm   disadvantag        instal       hearten   brotherhood 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##         hesit         abund         utter       fervent         scorn       buildup        offens       legitim       discuss        elimin        either        resort        retali         logic 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       recours        approv        shield       militar     demilitar        render       obsolet           rid      fourfold     hemispher      worldwid self-determin       inalien    staunchest 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       inflict        lightn     transcend        ribbon        unfurl        symbol         insid       general          knee          lone        darken        ponder         alamo      encourag 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       settler          sing          song        unknow     big-heart        tender      knowledg          rare          gore       contest    slave-hold          went       fallibl         grand 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##    insignific         enact          halt          rock           sea          seed          root        inborn           225        hidden        onward          deal        forgiv        appear 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##      undermin        permit        tactic          chao        inspir       condemn        apathi       prevent         recov      momentum         invit          mass        horror         arrog 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##       aggress    compassion      unworthi          view         fault      prolifer      diminish        mentor        pastor      synagogu         mosqu         wound       jericho     scapegoat 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##        option         civic       uncount       unhonor       comfort       spectat          miss     statesman         angel     whirlwind       accumul         theme          tire        finish 
##     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978     0.5163978 
##           day           see      maintain          task           ago       million          love          made        member        deserv          safe          voic       revolut          rich 
##     0.5113674     0.5101128     0.5070926     0.5070926     0.5063697     0.5061711     0.5042195     0.5041008     0.5012804     0.5012804     0.5012804     0.5012804     0.5012804     0.5012804 
##       special        soviet         union      greatest           use         ideal          less      ceremoni           ill          heal          save         child         front        memori 
##     0.5009794     0.5009794     0.5003702     0.4969040     0.4969040     0.4949747     0.4902811     0.4898979     0.4898979     0.4898979     0.4898979     0.4898979     0.4898979     0.4898979 
##       countri        cultur    understood          dark        reward          sinc        afford        embrac          hard          duti         enemi          open        threat         creed 
##     0.4885428     0.4879500     0.4879500     0.4879500     0.4879500     0.4879500     0.4879500     0.4879500     0.4879500     0.4838867     0.4830459     0.4830459     0.4830459     0.4794633 
##          come          caus        toward          also         right         bring      challeng      neighbor       capitol         light         shape          unit         other          left 
##     0.4760952     0.4746929     0.4738791     0.4738791     0.4737378     0.4724995     0.4679096     0.4670994     0.4670994     0.4670994     0.4670994     0.4670994     0.4669240     0.4666667 
##         start         sound        cooper      guarante     administr         grant        reluct      conflict           led       lincoln       willing          serv          dare         elect 
##     0.4666667     0.4666667     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802 
##          area       allianc       possibl          echo       possess       depress         wrong      generous          risk          idea        matter         readi         final          name 
##     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4618802     0.4610625     0.4564355     0.4564355     0.4564355     0.4564355 
##      prejudic        decent           era      question          join        belong       sometim       prosper          mean     democraci     constitut           eye        rather        inevit 
##     0.4564355     0.4564355     0.4564355     0.4564355     0.4564355     0.4564355     0.4557327     0.4554200     0.4511788     0.4507489     0.4472136     0.4472136     0.4472136     0.4472136 
##         ensur          firm      magnific         flame          star        welcom           aid         enjoy        reborn    presidenti          send          bold          next        corner 
##     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136 
##        spoken       instead          cold          forg         dedic          wait         battl          easi        unfold       fascism          wage         remak       persist        contin 
##     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136 
##        temper          gave        replac          plan         claim          lose         petti   distinguish         guest        vulner        exampl          weak         pursu         dream 
##     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4472136     0.4459799 
##           ask        confid           may       charact      tomorrow       loyalti         forev          hour    throughout        better           old         taken         cross       reflect 
##     0.4428074     0.4383570     0.4369902     0.4364358     0.4303315     0.4303315     0.4303315     0.4296689     0.4296689     0.4289960     0.4276995     0.4260064     0.4216370     0.4216370 
##      democrat        lesson      stronger       ancient          seiz         brave 
##     0.4216370     0.4216370     0.4216370     0.4216370     0.4216370     0.4216370 
##  [ reached getOption("max.print") -- omitted 1289 entries ]
## 
## $health
##        shape      generat        wrong       common     knowledg       planet         task       demand          eye        defin         forc       danger        child        choos         fear 
##    0.9045340    0.8971180    0.8944272    0.8888889    0.8888889    0.8819171    0.8728716    0.8666667    0.8660254    0.8642416    0.8641586    0.8432740    0.8432740    0.8432740    0.8432740 
##       extend         true      without         long       advanc       servic      commerc        vital        power       deserv         less        everi         busi        endur       spirit 
##    0.8432740    0.8432740    0.8432740    0.8357109    0.8340577    0.8333333    0.8333333    0.8333333    0.8326837    0.8320503    0.8307472    0.8255091    0.8249579    0.8219949    0.8206099 
##       reform       school        ambit          bad        brave        humbl          can         face          law       travel          set         gift     interest         just        storm 
##    0.8198916    0.8164966    0.8164966    0.8164966    0.8164966    0.8164966    0.8097763    0.8090398    0.8084521    0.8040303    0.8040303    0.8017837    0.7888106    0.7875615    0.7856742 
##          end          see         play       measur        forth      respons         give         care      instead      fascism         wage        remak        break       temper        globe 
##    0.7795794    0.7782896    0.7777778    0.7777778    0.7777778    0.7773318    0.7715167    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004 
##         role         plan       colleg         lose       narrow        petti         rage         weak        recal       surviv         evid         life         also         even        build 
##    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7698004    0.7687422    0.7647191    0.7647191    0.7619048 
##         know         find    communism   understood       afford        choic       failur         serv         race      possess     timeless       winter       imagin     reaffirm        settl 
##    0.7619048    0.7559289    0.7559289    0.7559289    0.7559289    0.7537784    0.7453560    0.7453560    0.7453560    0.7453560    0.7453560    0.7453560    0.7453560    0.7453560    0.7453560 
##      qualiti       courag         last      america       nation         hour          ill         hatr       someth        toler         real          way          may        still        faith 
##    0.7453560    0.7442084    0.7427814    0.7420462    0.7404170    0.7396003    0.7378648    0.7378648    0.7378648    0.7378648    0.7378648    0.7377372    0.7377372    0.7330167    0.7325897 
##         oath        ideal         duti       confid        carri       purpos        futur      prosper         born       threat       gather        enemi        trust         meet         valu 
##    0.7324670    0.7302967    0.7288109    0.7276069    0.7222222    0.7184212    0.7139329    0.7139329    0.7126966    0.7126966    0.7126966    0.7126966    0.7126966    0.7123956    0.7123956 
##        crisi         work         must          era        quiet     question         join         road        depth       weaken      univers     profound      foundat          saw         sacr 
##    0.7106691    0.7087836    0.7074720    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068 
##         toil         asid    conscienc      convict     scriptur      smaller         roll         fuel    everywher        stain       legaci         farm        grace     grandest         wind 
##    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068 
##      serious         soil      respect        cynic        favor     stranger       search      subject        swift         mark      distant      violenc        woman        maker        emerg 
##    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068 
##       prefer        grudg      dissent       defeat      darkest       prepar         judg       consid         flow      inhabit     document      network          sap         lash      faction 
##    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068 
##      unmatch       surest       reject       precis    uncertain       answer       defens        blood      journey       promis          war        today         guid        truth        birth 
##    0.7071068    0.7071068    0.7071068    0.7071068    0.7071068    0.7035265    0.7035265    0.7035265    0.6993997    0.6985355    0.6963106    0.6889986    0.6885304    0.6885304    0.6885304 
##        pledg     determin        shall         come        small       remain          met     stronger        often        water        habit         seiz  citizenship         read     precious 
##    0.6882472    0.6882472    0.6882472    0.6869464    0.6808829    0.6808829    0.6804138    0.6804138    0.6804138    0.6804138    0.6804138    0.6804138    0.6804138    0.6804138    0.6804138 
##          run      abandon        sourc         test           us         well      charact     challeng         time       rather       vision         grow         mind       depend       effort 
##    0.6804138    0.6804138    0.6804138    0.6804138    0.6793359    0.6767530    0.6761234    0.6751356    0.6739176    0.6735753    0.6735753    0.6735753    0.6708204    0.6708204    0.6666667 
##      chariti         mall        ultim         warm        retir      alreadi         poor        doubt         path       parent         lost    difficult         sure         bind        sight 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##     ancestor   forty-four       amidst        cloud       simpli        midst    far-reach      consequ        greed    irrespons         shed        indic         data      statist          nag 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##        lower       easili         span     grievanc         fals     recrimin     worn-out        dogma      strangl     childish         nobl    god-given     shortcut  faint-heart       leisur 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##      pleasur     risk-tak         doer   things'som       obscur          rug         pack    sweatshop         whip         plow       fought      concord   gettysburg     normandi          khe 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##         sahn       sacrif         till   undiminish          pat      unpleas         pick         dust          lay       electr         grid        digit        wield          sun    transform 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##        scale      suggest       necess        shift      beneath        stale     argument       consum        appli         wise       expand         spin        gross         abil         rout 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##        peril        scarc        draft      charter       expedi       villag         tank       sturdi       entitl        pleas         eman    restraint       keeper         iraq    hard-earn 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##  afghanistan       former          foe     tireless       lessen      specter       apolog        induc    slaughter        innoc      outlast    patchwork    christian       muslim          jew 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##       hindus   non-believ      languag         tast       bitter        swill       segreg      someday        tribe      dissolv        usher          sow        blame        cling      corrupt 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##       deceit       silenc     unclench     alongsid        clean      nourish        starv       plenti     indiffer       outsid       regard     gratitud      far-off       desert      whisper 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##     guardian       embodi         leve     selfless    firefight     stairway        smoke       nurtur   instrument      honesti       curios         glad      satisfi        sixti      restaur 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667 
##      remembr      coldest         band        huddl      campfir          ici       outcom        virtu        alarm      current       falter          fix        deliv     children       celebr 
##    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6666667    0.6585528    0.6575959 
##      citizen        stand          now      greater       fellow          let     american       father          new        women        polit          big         said    necessari        offic 
##    0.6573689    0.6543303    0.6500112    0.6481812    0.6471502    0.6405979    0.6391355    0.6382847    0.6353298    0.6350529    0.6324555    0.6324555    0.6299408    0.6299408    0.6299408 
##         feed         hard       embrac        bridg         pass         alon         home         noth       longer         call       across         seek        world          yet   understand 
##    0.6299408    0.6299408    0.6299408    0.6299408    0.6288281    0.6285394    0.6255432    0.6236096    0.6236096    0.6210344    0.6181151    0.6172134    0.6143020    0.6127634    0.6117753 
##       toward        peopl        light         fail     proclaim    communiti       public         lead        bless         live       cooper       achiev         full     conflict        grate 
##    0.6117753    0.6056471    0.6030227    0.6030227    0.6030227    0.6027963    0.6000000    0.6000000    0.5975054    0.5975054    0.5962848    0.5962848    0.5962848    0.5962848    0.5962848 
##      willing      possibl          har        skill       scienc        drawn         will         upon         make         move   strengthen          far          age       decent       resolv 
##    0.5962848    0.5962848    0.5962848    0.5962848    0.5962848    0.5962848    0.5956353    0.5952291    0.5923489    0.5896920    0.5892557    0.5892557    0.5892557    0.5892557    0.5883484 
##          men       person       declar         rule        found         cost      collect      societi        manag       inevit       declin        might         firm         knew          cut 
##    0.5879447    0.5833333    0.5819144    0.5819144    0.5802589    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503 
##   presidenti         bold         next       domest         forg         base      resourc      whether       unfold      chapter       uphold       global         cure      persist       contin 
##    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503 
##         gave      shutter       replac        avoid        claim  distinguish        guest       vulner       exampl        pursu     hardship          job      forward       famili        alway 
##    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5773503    0.5739640    0.5726371    0.5726371    0.5726371 
##        civil      liberti      economi        chanc         caus         peac     principl        anoth        creed         hope        happi         show          die        forev       market 
##    0.5714286    0.5669596    0.5659165    0.5659165    0.5656854    0.5654593    0.5634362    0.5570860    0.5570860    0.5561985    0.5560384    0.5555556    0.5555556    0.5555556    0.5555556 
##      founder      mission        equal        labor   throughout      revolut       leader        chang         word       moment         year          god          act         bush        month 
##    0.5555556    0.5555556    0.5548265    0.5547002    0.5547002    0.5547002    0.5547002    0.5533986    0.5530409    0.5527708    0.5512459    0.5496566    0.5452753    0.5443311    0.5443311 
##         line        short        whose        reach       mutual        relat         west      ancient        sworn        built        oblig        stake       belief         came        honor 
##    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5443311    0.5388159    0.5388159 
##      histori         unit         mani         rais      success       though         need        secur       justic        order       worker        learn         sake         reli       differ 
##    0.5362664    0.5360202    0.5353034    0.5353034    0.5345225    0.5345225    0.5333333    0.5318160    0.5298129    0.5270463    0.5270463    0.5270463    0.5270463    0.5270463    0.5238095 
##     sacrific       requir         fair          old         take          ask       accept        watch         ever    democraci         land      countri        never       cultur     opportun 
##    0.5229764    0.5217492    0.5163978    0.5153471    0.5144958    0.5144958    0.5091751    0.5091751    0.5091751    0.5091751    0.5087056    0.5080687    0.5050763    0.5039526    0.5039526 
##       fortun         dark         wave       reward         anew          edg         rise         feel      destini       immigr         idea      patriot    technolog        began         lift 
##    0.5039526    0.5039526    0.5039526    0.5039526    0.5039526    0.5039526    0.5039526    0.5039526    0.5039526    0.5039526    0.5036554    0.5025189    0.5025189    0.5025189    0.5025189 
##       chosen        clear       solemn      transit         deni        bound        singl         size        shore       invent        given        liber       safeti        march         hold 
##    0.5025189    0.5025189    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000    0.5000000 
##        along      freedom      digniti        right       defend       togeth        earth         free       author          say         deep        debat        creat       better        uniti 
##    0.5000000    0.4975679    0.4969040    0.4950990    0.4931970    0.4930493    0.4815434    0.4815144    0.4811252    0.4811252    0.4811252    0.4811252    0.4789475    0.4747127    0.4714045 
##         week         bear       patrol     recognit        decid       commit      forbear        humil      dignifi        river    arlington      earlier          lie         fame       bestow 
##    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045 
##          raw    enterpris         tide       hungri     prejudic        waver       missil         snow       ground       broken       reveal          sum          car      account      prudent 
##    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045 
##      discord         soon         fist       effect     reinvent        plagu     competit       urgent       harder       devast      fractur        crise        engin        sleep       expect 
##    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045 
##         dawn     privileg    yesterday       abroad      environ       shrink         lest    diplomaci       whenev       recogn       ennobl        weari          joy      slaveri        minor 
##    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045 
##    limitless        touch       taught      reclaim      pretens       crippl      succumb       region        broad         girl        natur     particip      patienc        delay         imag 
##    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045 
##   background      medicar       mistak       prison    substitut     hopeless      respond         lend       privat        basic        etern       durabl       victim       dollar     flourish 
##    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045    0.4714045 
##       fallen        refus      horizon          day        great        speak      sustain         voic     congress        bring        young     strength     greatest      tyranni        shown 
##    0.4714045    0.4714045    0.4714045    0.4685095    0.4673650    0.4666667    0.4622502    0.4622502    0.4576043    0.4574957    0.4573296    0.4536092    0.4490502    0.4490502    0.4472136 
##        divis     independ      inaugur          led      allianc      pursuit      depress        quest         vote       bright       decenc     generous     standard        treat        spare 
##    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136    0.4472136 
##         risk       attack         done        cours      centuri       suffer       energi          tri         fill      succeed        middl       thrive      soldier          yes         good 
##    0.4472136    0.4472136    0.4454354    0.4454354    0.4444738    0.4444444    0.4444444    0.4444444    0.4444444    0.4444444    0.4444444    0.4444444    0.4444444    0.4409586    0.4385290 
##        other        human         mean         like          one        renew         keep       fulfil         soul       terror       restor        drift       poster         21st       affirm 
##    0.4383973    0.4369237    0.4368520    0.4339630    0.4335770    0.4313311    0.4303315    0.4303315    0.4264014    0.4216370    0.4216370    0.4216370    0.4216370    0.4216370    0.4216370 
##          wit     institut      million       rememb       wealth        thank        decis          els         fate    adversari     democrat       invest      deepest         king      certain 
##    0.4216370    0.4216370    0.4200840    0.4190790    0.4170288    0.4160251    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483 
##       larger        ocean        becam     constant      clinton        pride      capitol       direct       follow       perman      triumph        class       presid       govern        place 
##    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483    0.4082483    0.4020151    0.4020151    0.4020151    0.4020151    0.4020151    0.4020151    0.3965258    0.3943445    0.3931079 
##        state        taken         fire      complet    constitut         cast       street       around         sign      highest       intend     magnific        thoma        flame       welcom 
##    0.3931079    0.3928371    0.3928371    0.3922323    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002 
##        enjoy         bond        begun       corner       spoken      conduct         agre         cold        dedic       behalf       heaven        assum         walk      exercis        battl 
##    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002 
##         easi       forget         near       wisdom      realiti        ignor          use         help       enough       tradit       return         held        whole         sens      control 
##    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3849002    0.3798686    0.3785166    0.3779645    0.3779645    0.3779645    0.3779645    0.3779645    0.3779645 
##     destruct       divers         tell      continu         side         safe         earn      program      poverti        capit       friend       strong         best      protect        capac 
##    0.3779645    0.3779645    0.3779645    0.3713907    0.3698001    0.3698001    0.3698001    0.3698001    0.3698001    0.3698001    0.3682298    0.3611576    0.3600411    0.3585686    0.3563483 
##       border         seem       social       balanc        readi         alli        simpl        final         name        union        begin     threaten          arm      problem        heart 
##    0.3563483    0.3553345    0.3535534    0.3535534    0.3535534    0.3535534    0.3535534    0.3535534    0.3535534    0.3478328    0.3471704    0.3442652    0.3442652    0.3429972    0.3419184 
##         vice        decad         hear        teach    strongest       import        match       strive        spoke         bodi       wonder          rob     encourag       oldest  predecessor 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
## half-centuri    steadfast       shadow      sunshin      unrival      inherit     stagnant        inequ         news       slowli         boat    broadcast    instantan    tobillion     communic 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##        mobil        magic   livelihood        shake          abl       compet     bankrupt         abid         erod       shaken      fearsom     restless       muster    construct       pillar 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##         envi     deadlock       season       massiv       wander        revit      intrigu       calcul       maneuv        posit        worri        sweat         pave        shout     advantag 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##     franklin    roosevelt   experiment        stabl      collaps       animos       engulf       intern         defi      persian         gulf      somalia    testament       rejoic     unmistak 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##     undertak    reconnect         torn         inde        reded       myriad       upward    disciplin      well-do        faint  mountaintop        guard         20th     prospect         18th 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##         19th      abolish           aw      turmoil       explod         onto        stage    mightiest        unriv        split         atom       explor       comput    microchip       deepen 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##      wellspr      african        circl        third        coast      conserv       inform      perfect      tragedi      exhilar 
##    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333    0.3333333 
##  [ reached getOption("max.print") -- omitted 1289 entries ]
## 
## $terror
##        potenti      adversari     commonplac         miracl         racial         bounti         martin          dream          polit       guarante           solv          grate           open 
##      0.9036961      0.9036961      0.8944272      0.8944272      0.8944272      0.8944272      0.8944272      0.8624394      0.8500000      0.8485281      0.8485281      0.8485281      0.8451543 
##          solut          whose         cultur       maintain           upon           educ           told        factori        product       industri          match           mall           land 
##      0.8432740      0.8391464      0.8366600      0.8280787      0.8253073      0.8215838      0.8215838      0.8164966      0.8062258      0.7980239      0.7905694      0.7905694      0.7874008 
##     strengthen            far          cross         realiz             go       children         answer        problem          decad        loyalti       strength            eye         street 
##      0.7826238      0.7826238      0.7745967      0.7745967      0.7719754      0.7667485      0.7627701      0.7592566      0.7378648      0.7378648      0.7315635      0.7302967      0.7302967 
##           grow        unleash            tie        highest       magnific          flame        shutter       greatest           will         enough         within         fortun     understood 
##      0.7302967      0.7302967      0.7302967      0.7302967      0.7302967      0.7302967      0.7302967      0.7302967      0.7202643      0.7181848      0.7171372      0.7171372      0.7171372 
##            yes          offic          bridg            let           full      administr         remind           kind        sustain          labor            end            big        special 
##      0.7171372      0.7171372      0.7171372      0.7149700      0.7071068      0.7071068      0.7071068      0.7071068      0.7016464      0.7016464      0.7006490      0.7000000      0.6902685 
##          chanc             us         govern           face            new       opportun           noth           born           gift           back          group          price       gracious 
##      0.6902685      0.6870845      0.6858647      0.6822423      0.6780677      0.6772962      0.6761234      0.6761234      0.6761234      0.6755280      0.6708204      0.6708204      0.6708204 
##         miseri           week         patrol       recognit            era         balanc          decid          readi        forbear          humil        dignifi          river      arlington 
##      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204 
##        earlier            lie           fame        convict     millennium         affair         center        smaller           hire           roll         behind           gang           fuel 
##      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204 
##      everywher          stain         legaci        airport           farm        benefit           armi          plain         nation          class           hero         wealth          pledg 
##      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6708204      0.6697505      0.6674238      0.6668859      0.6593805      0.6529286 
##          shall           come           live           last          small          assur           fate         welfar    citizenship           read       precious         fellow        centuri 
##      0.6529286      0.6516946      0.6478211      0.6459422      0.6459422      0.6454972      0.6454972      0.6454972      0.6454972      0.6454972      0.6454972      0.6431759      0.6397674 
##         around          never         promis          futur         achiev           give       american       hatfield         mondal          baker         moomaw         occurr         routin 
##      0.6390097      0.6388766      0.6381449      0.6374553      0.6363961      0.6343350      0.6326997      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          uniqu         realli   every-4-year         normal        transit          degre        bulwark        afflict        proport        longest        distort          penal         thrift 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          crush    fixed-incom           alik        shatter            idl         indign           deni         burden           kept           pile        mortgag      temporari        conveni 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          trend        tremend         upheav         period  misunderstand          sever        bastion          tempt        complex       self-rul           elit       superior         someon 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          equit        neglect        section           food           mine          teach     profession  industrialist       shopkeep          clerk          cabbi      truckdriv          breed 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##        healthi          vigor      discrimin        runaway          reviv           play      inventori          check        consent         intent           curb         demand       distinct 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##        smother         foster          stifl         energi         extent          avail        coincid       parallel     proportion      intervent         intrus         result    unnecessari 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         excess           loom        creativ           gate        counter   entrepreneur      voluntari            art        address         makeup     countrymen        suffici         theori 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##      unequivoc         emphat      paraphras        winston       churchil       dissolut      strongest          ahead         propos          remov      roadblock        various          level 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##           inch           feet           mile       reawaken          giant        lighten          punit            eve             dr         joseph         warren   massachusett        despair 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##       exemplar         beacon       benefici    sovereignti           sale        surrend  misunderstood        misjudg        prevail         formid        practic           prey       thousand 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         deepli          vista         shrine       monument  revolutionari         infant     nationhood          eloqu           pool         column        whoever        heroism        potomac 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          shore          slope       cemeteri            row         marker          david           tini       fraction          spoke        belleau           wood         argonn          omaha 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          beach        salerno        halfway      guadalcan         tarawa           pork           chop         chosin      reservoir          hundr           rice          paddi          jungl 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         barber           shop           1917          franc        rainbow        western            tri      battalion          heavi      artilleri           bodi          diari        flyleaf 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##       therefor          cheer        treptow        perform           deed          sight           20th       prospect           18th           19th        abolish             aw        turmoil 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         explod           onto          stage      mightiest          unriv          split           atom         explor         comput      microchip         deepen        wellspr        african 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          circl          third          coast        conserv         inform        perfect        tragedi        exhilar      indispens        cleaner         destin           bend          safer 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         record        flexibl       everyday        preemin           lock          divid           curs       contempt          cloak         religi        fanatic        torment         obsess 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##           hate         impuls           lurk        overcom         textur        godsend       approach         outlin       internet         mystic        provinc      physicist   encyclopedia 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
## schoolchildren      scientist          decod      blueprint         hostil           camp   dictatorship        surpass        bloodsh        resound         sought          prize          ignit 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##          spark            boy      classroom        librari        kitchen           tabl       laughter          shoot           sell         anymor        medicin       hardwork         chemic 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         biolog           port          innov       grandpar  grandchildren        fortifi         majest         louder            din         regain    thirty-four        prophet         luther 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##      ceaseless         redeem         extrem   partisanship         deplor         repair         breach         cardin      bernardin           wast       acrimoni           wide          belov 
##      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555      0.6324555 
##         height         summit            job       individu           meet            one         spirit           ever            now           side     throughout          everi         worker 
##      0.6324555      0.6324555      0.6282809      0.6247580      0.6238503      0.6236252      0.6227992      0.6210590      0.6166548      0.6139406      0.6139406      0.6058305      0.6000000 
##            ill          parti           save         extend          order        present          front         spread           21st        destini         return        prosper           said 
##      0.6000000      0.6000000      0.6000000      0.6000000      0.6000000      0.6000000      0.6000000      0.6000000      0.6000000      0.5976143      0.5976143      0.5976143      0.5976143 
##           held           take           home         famili       confront          capac          cours       interest          build          today         across          power          stand 
##      0.5976143      0.5965588      0.5934424      0.5926378      0.5916080      0.5916080      0.5916080      0.5879747      0.5872801      0.5866013      0.5863955      0.5860943      0.5842374 
##          reach           away           even           long        million        economi          peopl           time            man          great         beyond        patriot       neighbor 
##      0.5809475      0.5803810      0.5803810      0.5766000      0.5756497      0.5752237      0.5745673      0.5731964      0.5730699      0.5727009      0.5720776      0.5720776      0.5720776 
##          shape           work          bless         cooper            pay          grant           slow         depend         worthi         reluct       conflict        inaugur            led 
##      0.5720776      0.5700877      0.5668434      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854 
##        lincoln           hill        willing            har          quest         bright         decenc            can           less        greater          becom           join       prejudic 
##      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5656854      0.5633623      0.5629400      0.5590170      0.5590170      0.5590170      0.5590170 
##          world        struggl            put        collect          manag      establish         inevit         declin          might           firm          impos         intend           citi 
##      0.5558806      0.5520524      0.5520524      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226 
##           issu     presidenti           next           forg           form          trade        chapter         forget          globe           role           near           plan         colleg 
##      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226 
##           lose         narrow          petti         wisdom        realiti         rather         vision            day         believ          crisi            may           just        america 
##      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5477226      0.5454824      0.5404193      0.5393599      0.5352015      0.5336761      0.5317196 
##          among         suffer       tomorrow         import         strive            ten           drug       knowledg          middl         thrive           much         provid           alon 
##      0.5275044      0.5270463      0.5270463      0.5270463      0.5270463      0.5270463      0.5270463      0.5270463      0.5270463      0.5270463      0.5262348      0.5262348      0.5217492 
##        preserv         togeth           bush          decis          month           line          short           fall         produc        reflect            get         mutual           west 
##      0.5198752      0.5169843      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978 
##            win       stronger        certain          water          built          humbl          women          creat           look           mani     understand        success         longer 
##      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5163978      0.5111657      0.5111657      0.5078334      0.5078334      0.5070926      0.5070926 
##         gather         effort            age         presid       ceremoni           heal         restor         memori          child          choos           fear           real            yet 
##      0.5070926      0.5059644      0.5031153      0.5015699      0.5000000      0.5000000      0.5000000      0.5000000      0.5000000      0.5000000      0.5000000      0.5000000      0.4982729 
##           life          share          divis         strong          earth          begin           well           keep           make            use         higher           sick           feed 
##      0.4972452      0.4961389      0.4949747      0.4949015      0.4949015      0.4940322      0.4938648      0.4898979      0.4870246      0.4868645      0.4780914      0.4780914      0.4780914 
##          whole           sens            edg         planet         system          faith         moment       transfer         action          began           lift            set          state 
##      0.4780914      0.4780914      0.4780914      0.4780914      0.4780914      0.4778095      0.4767313      0.4767313      0.4767313      0.4767313      0.4767313      0.4767313      0.4746445 
##          carri           size        chariti         histor         beauti          white          howev         invent            rob          retir        alreadi          close         parent 
##      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416      0.4743416 
##           sure           bind        commerc          along            god       challeng          anoth           must           valu          first          happi           cost        instead 
##      0.4743416      0.4743416      0.4743416      0.4743416      0.4740455      0.4719399      0.4697762      0.4688974      0.4678877      0.4662524      0.4615663      0.4564355      0.4564355 
##        citizen            see            old            way          think            two           past         school         differ          taken        support         o'neil          occas 
##      0.4550849      0.4543695      0.4539797      0.4528628      0.4518481      0.4518481      0.4518481      0.4518481      0.4517540      0.4472136      0.4472136      0.4472136      0.4472136 
##        process           busi          worst         inflat          elder       unemploy           pace         borrow           bear        concern       boundari         ethnic         object 
##      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136 
##        barrier        bigotri           core         except         revers         reserv         genius          unwil        command         heroic         church       prioriti      compromis 
##      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136 
##         unborn        arsenal            fit       shoulder        abraham            add           paid           town         messag        written         utmost           aliv         remark 
##      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136 
##           tool         target        noblest         mighti            air          apart          await         bicker        connect         window         scourg          plagu        fractur 
##      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136 
##          sleep           dawn        environ        slaveri          minor      limitless          touch         taught        reclaim        pretens         crippl        succumb         region 
##      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136 
##          broad           girl          natur       particip        patienc         dollar       flourish         fallen          refus        horizon        forward          alway          speak 
##      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4472136      0.4444783      0.4444783      0.4427189 
##          heart         father            war           seek         deserv        revolut          capit           call       sacrific          young        whether          shown          limit 
##      0.4423259      0.4403855      0.4403855      0.4391550      0.4385290      0.4385290      0.4385290      0.4341216      0.4341216      0.4338609      0.4260064      0.4242641      0.4242641 
##         failur           race        possess       timeless          wrong      forgotten           done         health           vice           stop         measur         common        succeed 
##      0.4242641      0.4242641      0.4242641      0.4242641      0.4242641      0.4242641      0.4225771      0.4216370      0.4216370      0.4216370      0.4216370      0.4216370      0.4216370 
##          forth        founder        mission         econom          endur        protect           mean         declar           year            say           left           fair         growth 
##      0.4216370      0.4216370      0.4216370      0.4159002      0.4159002      0.4157609      0.4144342      0.4140393      0.4128614      0.4107919      0.4082483      0.4082483      0.4082483 
##          place           seem        respons        deficit         danger        victori          fight          learn           true       progress            men         advanc          thank 
##      0.4068381      0.4045199      0.4022409      0.4000000      0.4000000      0.4000000      0.4000000      0.4000000      0.4000000      0.3985267      0.3984095      0.3956283      0.3946761 
##        histori         courag       question          feder          chief           high            aim          relat       mountain           move          ocean          becam       constant 
##      0.3931227      0.3922323      0.3913119      0.3903600      0.3872983      0.3872983      0.3872983      0.3872983      0.3872983      0.3872983      0.3872983      0.3872983      0.3872983 
##        journey           know            tax         follow         defens          choic        triumph         public          union         better           oath          defin       congress 
##      0.3841367      0.3839909      0.3837613      0.3813850      0.3813850      0.3813850      0.3813850      0.3794733      0.3771236      0.3752933      0.3741657      0.3726780      0.3721042 
##        everyon           hand           like         weapon           hope       reverend      constitut         almost         republ            goe           cast         whatev         capabl 
##      0.3721042      0.3708099      0.3705241      0.3690125      0.3670652      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484 
##           sign          grown         troubl         number           self           rest          ensur          aspir         negoti          thoma           star          enjoy           bond 
##      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484 
##        destroy           cold         heaven          assum         global           cure         contin           gave         replac          avoid          claim         exampl          pursu 
##      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484      0.3651484 
##       hardship          equal         celebr         toward           word           free           help     friendship           find           four           dark         reward        control 
##      0.3651484      0.3643993      0.3639127      0.3627381      0.3607042      0.3606353      0.3603750      0.3585686      0.3585686      0.3585686      0.3585686      0.3585686      0.3585686 
##          crime           rise           tell         immigr         reform         remain           hour           rich           need      democraci          watch           made          renew 
##      0.3585686      0.3585686      0.3585686      0.3585686      0.3535534      0.3523321      0.3508232      0.3508232      0.3478505      0.3450328      0.3450328      0.3429972      0.3409972 
##           came         rememb          trust         threat         border           forc          human         matter          quiet          simpl           road            act           good 
##      0.3407771      0.3407771      0.3380617      0.3380617      0.3380617      0.3375700      0.3367830      0.3354102      0.3354102      0.3354102      0.3354102      0.3347193      0.3328201 
##        countri          found           part           guid           care           unit         solemn         person          bound           hear          georg     washington         wonder 
##      0.3324112      0.3302891      0.3299832      0.3265986      0.3195048      0.3178209      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##          march          forev       ancestor     forty-four         amidst          cloud         simpli          midst      far-reach        consequ          greed      irrespons           shed 
##      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##          indic           data        statist            nag          lower         easili           span       grievanc           fals       recrimin       worn-out          dogma        strangl 
##      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##       childish           nobl      god-given       shortcut    faint-heart         leisur        pleasur       risk-tak           doer     things'som         obscur            rug           pack 
##      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##      sweatshop           whip           plow         fought        concord     gettysburg       normandi            khe           sahn         sacrif           till     undiminish            pat 
##      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##        unpleas           pick           dust            lay         electr           grid          digit          wield            sun      transform          scale        suggest         necess 
##      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##          shift        beneath          stale       argument         consum          appli           wise         expand           spin          gross           abil           rout 
##      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278      0.3162278 
##  [ reached getOption("max.print") -- omitted 1289 entries ]

Scaling document positions

We have a lot of development work to do on the textmodel() function, but here is a demonstration of unsupervised document scaling comparing the “wordfish” model:

# make prettier document names
ieDfm <- dfm(data_corpus_irishbudget2010)
textmodel(ieDfm, model = "wordfish", dir=c(2,1))
## Fitted wordfish model:
## Call:
##  textmodel_wordfish(x = x, dir = ..1)
## 
## Estimated document positions:
## 
##                                Documents      theta         SE       lower       upper
## 1        2010_BUDGET_01_Brian_Lenihan_FF  1.8268036 0.02020113  1.78720939  1.86639781
## 2       2010_BUDGET_02_Richard_Bruton_FG -0.5855453 0.02767752 -0.63979326 -0.53129738
## 3         2010_BUDGET_03_Joan_Burton_LAB -1.0696093 0.01556654 -1.10011977 -1.03909892
## 4        2010_BUDGET_04_Arthur_Morgan_SF -0.1141058 0.02791284 -0.16881502 -0.05939668
## 5          2010_BUDGET_05_Brian_Cowen_FF  1.7742535 0.02354795  1.72809951  1.82040749
## 6           2010_BUDGET_06_Enda_Kenny_FG -0.7016675 0.02613599 -0.75289405 -0.65044099
## 7      2010_BUDGET_07_Kieran_ODonnell_FG -0.4967371 0.04081043 -0.57672554 -0.41674864
## 8       2010_BUDGET_08_Eamon_Gilmore_LAB -0.5438955 0.02931964 -0.60136202 -0.48642903
## 9     2010_BUDGET_09_Michael_Higgins_LAB -1.0104557 0.03669246 -1.08237288 -0.93853843
## 10       2010_BUDGET_10_Ruairi_Quinn_LAB -0.9879423 0.03741468 -1.06127505 -0.91460950
## 11     2010_BUDGET_11_John_Gormley_Green  1.1913710 0.07324285  1.04781505  1.33492703
## 12       2010_BUDGET_12_Eamon_Ryan_Green  0.1500193 0.06254512  0.02743087  0.27260773
## 13     2010_BUDGET_13_Ciaran_Cuffe_Green  0.7196468 0.07340590  0.57577127  0.86352238
## 14 2010_BUDGET_14_Caoimhghin_OCaolain_SF -0.1521357 0.03644625 -0.22357033 -0.08070101
## 
## Estimated feature scores: showing first 30 beta-hats for features
## 
##            when               i       presented             the   supplementary          budget              to            this           house            last           april               , 
##     -0.15058982      0.34726053      0.36922895      0.21628040      1.09099461      0.05829019      0.33579912      0.26839752      0.14935164      0.25353066     -0.13851438      0.30630139 
##            said              we           could            work             our             way         through          period              of          severe        economic        distress 
##     -0.77779200      0.43921107     -0.59703182      0.54219871      0.70824736      0.29607747      0.62274642      0.51835687      0.30095010      1.25529502      0.44375418      1.83470247 
##               .           today             can          report            that notwithstanding 
##      0.23398491      0.14174074      0.32480755      0.65459489      0.04184343      1.83470247

Topic models

quanteda makes it very easy to fit topic models as well, e.g.:

quantdfm <- dfm(data_corpus_irishbudget2010,
                remove = c("will", stopwords("english")))

if (require(topicmodels)) {
    myLDAfit20 <- LDA(convert(quantdfm, to = "topicmodels"), k = 20)
    get_terms(myLDAfit20, 5)
    topics(myLDAfit20, 3)
}
##      2010_BUDGET_01_Brian_Lenihan_FF 2010_BUDGET_02_Richard_Bruton_FG 2010_BUDGET_03_Joan_Burton_LAB 2010_BUDGET_04_Arthur_Morgan_SF 2010_BUDGET_05_Brian_Cowen_FF 2010_BUDGET_06_Enda_Kenny_FG
## [1,]                               9                                2                             14                              15                            17                           11
## [2,]                               4                               16                              8                              10                             6                            6
## [3,]                               6                                3                             12                               6                            12                           16
##      2010_BUDGET_07_Kieran_ODonnell_FG 2010_BUDGET_08_Eamon_Gilmore_LAB 2010_BUDGET_09_Michael_Higgins_LAB 2010_BUDGET_10_Ruairi_Quinn_LAB 2010_BUDGET_11_John_Gormley_Green
## [1,]                                18                               19                                  5                               6                                16
## [2,]                                16                                7                                  7                              14                                 4
## [3,]                                 6                                3                                  3                               8                                 9
##      2010_BUDGET_12_Eamon_Ryan_Green 2010_BUDGET_13_Ciaran_Cuffe_Green 2010_BUDGET_14_Caoimhghin_OCaolain_SF
## [1,]                               1                                 3                                    13
## [2,]                               3                                16                                    20
## [3,]                               8                                 1                                     3