In this vignette we show how the quanteda package can be used to replicate the text analysis part (Chapter 5.1) from Kosuke Imai’s book Quantitative Social Science: An Introduction (Princeton: Princeton University Press, 2017).

Download the Corpus

To get the textual data, you need to install and load the qss package first that comes with the book.

devtools::install_github("kosukeimai/qss-package", build_vignettes = TRUE)

Section 5.1.1: The Disputed Authorship of ‘The Federalist Papers’

First, we use the readtext package to import the Federalist Papers as a data frame and create a quanteda corpus.

library("readtext")
# use readtext package to import all documents as a dataframe
corpus_texts <- readtext(system.file("extdata/federalist/", package = "qss"))

# create docvar with number of paper
corpus_texts$paper_number <- paste("No.", seq_len(nrow(corpus_texts)), sep = " ")

# transform to a quanteda corpus object
corpus_raw <- corpus(corpus_texts, text_field = "text", docid_field = "paper_number")

# create docvar with authorship (used in Section  5.1.4)
docvars(corpus_raw, "paper_numeric") <- seq_len(ndoc(corpus_raw))

# create docvar with authorship (used in Section  5.1.4)
docvars(corpus_raw, "author") <- factor(NA, levels = c("madison", "hamilton"))
docvars(corpus_raw, "author")[c(1, 6:9, 11:13, 15:17, 21:36, 59:61, 65:85)] <- "hamilton"
docvars(corpus_raw, "author")[c(10, 14, 37:48, 58)] <- "madison"
## AMONG the numerous advantages promised by a well-constructed Union, none 
##         deserves to be more accurately developed than its tendency to break and 
##         control the violence of faction. The friend of popular governments never 
## 

Section 5.1.2: Document-Term Matrix

Next, we transform the corpus to a document-feature matrix. dfm_prep (used in sections 5.1.4 and 5.1.5) is a dfm in which numbers and punctuation have been removed, and in which terms have been converted to lowercase. In dfm_papers, the words have also been stemmed and a standard set of stopwords removed.

## Creating a dfm from a corpus input...
##    ... lowercasing
##    ... found 85 documents, 8,630 features
##    ... created a 85 x 8,630 sparse dfm
##    ... complete. 
## Elapsed time: 0.264 seconds.
## Document-feature matrix of: 85 documents, 4,859 features (89.1% sparse).
## Document-feature matrix of: 6 documents, 8 features (97.9% sparse).
## 6 x 8 sparse Matrix of class "dfm"
##        features
## docs    1st 2d 3d 4th 5th abandon abat abb
##   No. 1   0  0  0   0   0       0    0   0
##   No. 2   0  0  0   0   0       0    0   0
##   No. 3   0  0  0   0   0       0    0   0
##   No. 4   0  0  0   0   0       0    0   0
##   No. 5   1  0  0   0   0       0    0   0
##   No. 6   0  0  0   0   0       0    0   0

The tm package considers features such as “1st” to be numbers, whereas quanteda does not. We can remove these easily using a wildcard removal:

dfm_papers <- dfm_remove(dfm_papers, "[0-9]", valuetype = "regex", verbose = TRUE)
## removed 5 features
## Document-feature matrix of: 6 documents, 8 features (91.7% sparse).
## 6 x 8 sparse Matrix of class "dfm"
##        features
## docs    abandon abat abb abet abhorr abil abject abl
##   No. 1       0    0   0    0      0    0      0   1
##   No. 2       0    0   0    0      0    1      0   0
##   No. 3       0    0   0    0      0    0      0   2
##   No. 4       0    0   0    0      0    0      0   1
##   No. 5       0    0   0    0      0    0      0   0
##   No. 6       0    0   0    0      0    0      0   0

Section 5.1.3: Topic Discovery

We can use the textplot_wordcloud() function to plot word clouds of the most frequent words in Papers 12 and 24.

Since quanteda cannot do stem completion, we will skip that part.

Next, we identify clusters of similar essay based on term frequency-inverse document frequency (tf-idf) and apply the \(k\)-means algorithm to the weighted dfm.

##     revenu contraband     patrol      excis      coast      trade 
##   19.42088   19.22817   19.22817   19.12214   16.22817   15.01500 
##        per        tax       cent     gallon 
##   14.47329   13.20080   12.81878   12.81878
##   garrison  dock-yard settlement      spain       armi   frontier 
##  24.524777  19.228173  16.228173  13.637564  12.770999  12.262389 
##    arsenal    western       post     nearer 
##  10.818782  10.806108  10.228173   9.648857

We can match the clustering as follows:

## [1] 3
## CLUSTER 1 
## Top 10 words:
##     court     appel jurisdict    suprem      juri    tribun    cogniz 
##  69.68857  35.27513  25.46591  24.79126  22.16104  21.27125  19.12214 
##  inferior    appeal re-examin 
##  18.76875  16.21098  13.52348 
## 
## Federalist Papers classified: 
## [1] "No. 81" "No. 82"
## 
## CLUSTER 2 
## Top 10 words:
##       juri      trial      court     crimin  admiralti     equiti 
##  218.20102   84.74567   62.47940   42.06871   40.87463   38.24428 
##   chanceri common-law     probat      civil 
##   37.86574   27.04695   27.04695   26.77843 
## 
## Federalist Papers classified: 
## [1] "No. 83"
## 
## CLUSTER 3 
## Top 10 words:
##    senat     upon   presid     armi    court    claus  militia    offic 
## 3.912991 3.640413 3.100660 2.919085 2.776862 2.636133 2.428567 2.374450 
## governor  appoint 
## 2.294413 2.125990 
## 
## Federalist Papers classified: 
##  [1] "No. 1"  "No. 6"  "No. 7"  "No. 8"  "No. 9"  "No. 13" "No. 15"
##  [8] "No. 16" "No. 17" "No. 21" "No. 22" "No. 23" "No. 24" "No. 25"
## [15] "No. 26" "No. 27" "No. 28" "No. 29" "No. 30" "No. 31" "No. 32"
## [22] "No. 33" "No. 34" "No. 36" "No. 59" "No. 60" "No. 61" "No. 65"
## [29] "No. 66" "No. 67" "No. 68" "No. 69" "No. 70" "No. 71" "No. 72"
## [36] "No. 73" "No. 74" "No. 75" "No. 76" "No. 77" "No. 78" "No. 79"
## [43] "No. 80" "No. 84" "No. 85"
## 
## CLUSTER 4 
## Top 10 words:
##      trade   merchant manufactur     market    commerc      navig 
##  18.351669  18.010180  14.037686  13.637564  11.214252  10.816517 
##     revenu   landhold       navi       ship 
##   9.416185   8.545855   8.233234   8.040714 
## 
## Federalist Papers classified: 
## [1] "No. 11" "No. 12" "No. 35"

Section 5.1.4: Authorship Prediction

In a next step, we want to predict authorship for the Federalist Papers whose authorship is unknown. As the topics of the Papers differs remarkably, Imai focuses on 10 articles, prepositions and conjunctions to predicte authorship.

## Warning: 'as.data.frame.dfm' is deprecated.
## Use 'convert(x, to = "data.frame")' instead.
## See help("Deprecated")
## 
## Call:
## lm(formula = author_numeric ~ upon + there + consequently + whilst, 
##     data = author_data_known)
## 
## Coefficients:
##  (Intercept)          upon         there  consequently        whilst  
##      -0.2729        0.2240        0.1278       -0.5596       -0.8378
## [1] 0.7175527

Section 5.1.5: Cross-Validation

Finally, we assess how well the model fits the data by classifying each essay based on its fitted value.

## [1] 1
## [1] 1
## [1] 1
## [1] 1
## Warning: 'as.data.frame.dfm' is deprecated.
## Use 'convert(x, to = "data.frame")' instead.
## See help("Deprecated")
##  [1]  0.73682138 -0.13810334 -1.35072048 -0.03823549 -0.27285552
##  [6]  0.71537231  1.33178340  0.19414036  0.37516964 -0.20369832
## [11]  0.67584796  0.99478176  1.39348825 -1.18908379  1.20446180
## [16]  0.63999272  0.91390560 -0.38341254 -0.62471327 -0.12453888
## [21]  0.91552544  1.08097849  0.88528696  1.01192444  0.08226598
## [26]  0.72338300  1.08177034  0.71354396  1.47718825  1.53672871
## [31]  1.85572895  0.71920518  1.07664097  1.26163462  0.90800379
## [36]  1.06339810 -0.81826934 -0.56126822 -0.27285552 -0.10309170
## [41] -0.47343432 -0.17492007 -0.60654100 -0.86944061 -1.67765248
## [46] -0.91487748 -0.13268399 -0.13500254 -0.96784742 -0.06907196
## [51] -1.46790700 -0.27285552 -0.54229524 -0.54529610  0.04081584
## [56] -0.56418213 -1.18177840 -0.41902525  0.55200092  0.98676075
## [61]  0.59237415 -0.97795312 -0.21203157 -0.18295716  1.15640459
## [66]  1.12403237  0.78348571  0.11197467  1.12824159  0.70981781
## [71]  0.34751104  0.84257351  1.24182206  0.91612261  0.50276872
## [76]  1.05480578  1.05926761  0.90585578  0.54292995  0.64035410
## [81]  0.91773469  0.31189155  0.80869221  0.73649069  1.00895939

Finally, we plot the fitted values for each Federalist paper with the ggplot2 package.