Installing the package

Since quanteda is available on CRAN, you can install by using your GUI’s R package installer, or execute:

install.packages("quanteda") 

See an instructions at https://github.com/quanteda/quanteda to install the GitHub version,

Creating a Corpus

You load the package to access to functions and data in the package.

library(quanteda)

Currently available corpus sources

quanteda has a simple and powerful companion package for loading texts: readtext. The main function in this package, readtext(), takes a file or fileset from disk or a URL, and returns a type of data.frame that can be used directly with the corpus() constructor function, to create a quanteda corpus object.

readtext() works on:

  • text (.txt) files;
  • comma-separated-value (.csv) files;
  • XML formatted data;
  • data from the Facebook API, in JSON format;
  • data from the Twitter API, in JSON format; and
  • generic JSON data.

The corpus constructor command corpus() works directly on:

  • a vector of character objects, for instance that you have already loaded into the workspace using other tools;
  • a VCorpus corpus object from the tm package.
  • a data.frame containing a text column and any other document-level metadata.

Building a corpus from a character vector

The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.

If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character object of the texts about immigration policy extracted from the 2010 election manifestos of the UK political parties (called data_char_ukimmig2010).

my_corpus <- corpus(data_char_ukimmig2010)  # build a new corpus from the texts
summary(my_corpus)
## Corpus consisting of 9 documents, showing 9 documents:
## 
##          Text Types Tokens Sentences
##           BNP  1125   3280        88
##     Coalition   142    260         4
##  Conservative   251    499        15
##        Greens   322    679        21
##        Labour   298    683        29
##        LibDem   251    483        14
##            PC    77    114         5
##           SNP    88    134         4
##          UKIP   346    723        27

If we wanted, we could add some document-level variables – what quanteda calls docvars – to this corpus.

We can do this using the R’s names() function to get the names of the character vector data_char_ukimmig2010, and assign this to a document variable (docvar).

docvars(my_corpus, "Party") <- names(data_char_ukimmig2010)
docvars(my_corpus, "Year") <- 2010
summary(my_corpus)
## Corpus consisting of 9 documents, showing 9 documents:
## 
##          Text Types Tokens Sentences        Party Year
##           BNP  1125   3280        88          BNP 2010
##     Coalition   142    260         4    Coalition 2010
##  Conservative   251    499        15 Conservative 2010
##        Greens   322    679        21       Greens 2010
##        Labour   298    683        29       Labour 2010
##        LibDem   251    483        14       LibDem 2010
##            PC    77    114         5           PC 2010
##           SNP    88    134         4          SNP 2010
##          UKIP   346    723        27         UKIP 2010

If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.

metadoc(my_corpus, "language") <- "english"
## Warning: metadoc is deprecated
metadoc(my_corpus, "docsource")  <- paste("data_char_ukimmig2010", 1:ndoc(my_corpus), sep = "_")
## Warning: metadoc is deprecated
summary(my_corpus, showmeta = TRUE)
## Corpus consisting of 9 documents, showing 9 documents:
## 
##          Text Types Tokens Sentences        Party Year _language
##           BNP  1125   3280        88          BNP 2010   english
##     Coalition   142    260         4    Coalition 2010   english
##  Conservative   251    499        15 Conservative 2010   english
##        Greens   322    679        21       Greens 2010   english
##        Labour   298    683        29       Labour 2010   english
##        LibDem   251    483        14       LibDem 2010   english
##            PC    77    114         5           PC 2010   english
##           SNP    88    134         4          SNP 2010   english
##          UKIP   346    723        27         UKIP 2010   english
##               _docsource
##  data_char_ukimmig2010_1
##  data_char_ukimmig2010_2
##  data_char_ukimmig2010_3
##  data_char_ukimmig2010_4
##  data_char_ukimmig2010_5
##  data_char_ukimmig2010_6
##  data_char_ukimmig2010_7
##  data_char_ukimmig2010_8
##  data_char_ukimmig2010_9

The last command, metadoc, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english", R has recycled the value until it matches the number of documents in the corpus. In creating a simple tag for our custom metadoc field docsource, we used the quanteda function ndoc() to retrieve the number of documents in our corpus. This function is deliberately designed to work in a way similar to functions you may already use in R, such as nrow() and ncol().

Loading in files using the readtext package

require(readtext)

# Twitter json
mytf1 <- readtext("~/Dropbox/QUANTESS/social media/zombies/tweets.json")
my_corpusTwitter <- corpus(mytf1)
summary(my_corpusTwitter, 5)
# generic json - needs a textfield specifier
mytf2 <- readtext("~/Dropbox/QUANTESS/Manuscripts/collocations/Corpora/sotu/sotu.json",
                  textfield = "text")
summary(corpus(mytf2), 5)
# text file
mytf3 <- readtext("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt", cache = FALSE)
summary(corpus(mytf3), 5)
# multiple text files
mytf4 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", cache = FALSE)
summary(corpus(mytf4), 5)
# multiple text files with docvars from filenames
mytf5 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", 
                  docvarsfrom = "filenames", sep = "-", docvarnames = c("Year", "President"))
summary(corpus(mytf5), 5)
# XML data
mytf6 <- readtext("~/Dropbox/QUANTESS/quanteda_working_files/xmlData/plant_catalog.xml", 
                  textfield = "COMMON")
summary(corpus(mytf6), 5)
# csv file
write.csv(data.frame(inaugSpeech = texts(data_corpus_inaugural), 
                     docvars(data_corpus_inaugural)),
          file = "/tmp/inaug_texts.csv", row.names = FALSE)
mytf7 <- readtext("/tmp/inaug_texts.csv", textfield = "inaugSpeech")
summary(corpus(mytf7), 5)

How a quanteda corpus works

Corpus principles

A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.

A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses – for instance those in which stems and punctuation were required, such as analyzing a reading ease index – can be performed on the same corpus.

To extract texts from a corpus, we use an extractor, called texts().

texts(data_corpus_inaugural)[2]
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              1793-Washington 
## "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor, and of the confidence which has been reposed in me by the people of united America.\n\nPrevious to the execution of any official act of the President the Constitution requires an oath of office. This oath I am now about to take, and in your presence: That if it shall be found during my administration of the Government I have in any instance violated willingly or knowingly the injunctions thereof, I may (besides incurring constitutional punishment) be subject to the upbraidings of all who are now witnesses of the present solemn ceremony.\n\n "

To summarize the texts from a corpus, we can call a summary() method defined for a corpus.

summary(data_corpus_irishbudget2010)
## Corpus consisting of 14 documents, showing 14 documents:
## 
##                       Text Types Tokens Sentences year debate number      foren
##        Lenihan, Brian (FF)  1953   8641       374 2010 BUDGET     01      Brian
##       Bruton, Richard (FG)  1040   4446       217 2010 BUDGET     02    Richard
##         Burton, Joan (LAB)  1624   6393       307 2010 BUDGET     03       Joan
##        Morgan, Arthur (SF)  1595   7107       343 2010 BUDGET     04     Arthur
##          Cowen, Brian (FF)  1629   6599       250 2010 BUDGET     05      Brian
##           Kenny, Enda (FG)  1148   4232       153 2010 BUDGET     06       Enda
##      ODonnell, Kieran (FG)   678   2297       133 2010 BUDGET     07     Kieran
##       Gilmore, Eamon (LAB)  1181   4177       201 2010 BUDGET     08      Eamon
##     Higgins, Michael (LAB)   488   1286        44 2010 BUDGET     09    Michael
##        Quinn, Ruairi (LAB)   439   1284        59 2010 BUDGET     10     Ruairi
##      Gormley, John (Green)   401   1030        49 2010 BUDGET     11       John
##        Ryan, Eamon (Green)   510   1643        90 2010 BUDGET     12      Eamon
##      Cuffe, Ciaran (Green)   442   1240        45 2010 BUDGET     13     Ciaran
##  OCaolain, Caoimhghin (SF)  1188   4044       176 2010 BUDGET     14 Caoimhghin
##      name party
##   Lenihan    FF
##    Bruton    FG
##    Burton   LAB
##    Morgan    SF
##     Cowen    FF
##     Kenny    FG
##  ODonnell    FG
##   Gilmore   LAB
##   Higgins   LAB
##     Quinn   LAB
##   Gormley Green
##      Ryan Green
##     Cuffe Green
##  OCaolain    SF

We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:

tokenInfo <- summary(data_corpus_inaugural)
if (require(ggplot2))
    ggplot(data=tokenInfo, aes(x = Year, y = Tokens, group = 1)) + geom_line() + geom_point() +
        scale_x_continuous(labels = c(seq(1789, 2017, 12)), breaks = seq(1789, 2017, 12)) +
    theme_bw()
## Loading required package: ggplot2

# Longest inaugural address: William Henry Harrison
tokenInfo[which.max(tokenInfo$Tokens), ] 
##             Text Types Tokens Sentences Year President     FirstName
## 14 1841-Harrison  1896   9144       210 1841  Harrison William Henry

Tools for handling corpus objects

Adding two corpus objects together

The + operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.

library(quanteda)
my_corpus1 <- corpus(data_corpus_inaugural[1:5])
my_corpus2 <- corpus(data_corpus_inaugural[53:58])
my_corpus3 <- my_corpus1 + my_corpus2
summary(my_corpus3)
## Corpus consisting of 11 documents, showing 11 documents:
## 
##    Text Types Tokens Sentences Year  President FirstName
##   text1   625   1538        23 1789 Washington    George
##   text2    96    147         4 1793 Washington    George
##   text3   826   2578        37 1797      Adams      John
##   text4   717   1927        41 1801  Jefferson    Thomas
##   text5   804   2381        45 1805  Jefferson    Thomas
##   text6   773   2449       111 1997    Clinton      Bill
##   text7   621   1808        97 2001       Bush George W.
##   text8   773   2319       100 2005       Bush George W.
##   text9   938   2711       110 2009      Obama    Barack
##  text10   814   2317        88 2013      Obama    Barack
##  text11   582   1660        88 2017      Trump Donald J.

Subsetting corpus objects

There is a method of the corpus_subset() function defined for corpus objects, where a new corpus can be extracted based on logical conditions applied to docvars:

summary(corpus_subset(data_corpus_inaugural, Year > 1990))
## Corpus consisting of 7 documents, showing 7 documents:
## 
##          Text Types Tokens Sentences Year President FirstName
##  1993-Clinton   642   1833        81 1993   Clinton      Bill
##  1997-Clinton   773   2449       111 1997   Clinton      Bill
##     2001-Bush   621   1808        97 2001      Bush George W.
##     2005-Bush   773   2319       100 2005      Bush George W.
##    2009-Obama   938   2711       110 2009     Obama    Barack
##    2013-Obama   814   2317        88 2013     Obama    Barack
##    2017-Trump   582   1660        88 2017     Trump Donald J.
summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
## Corpus consisting of 2 documents, showing 2 documents:
## 
##        Text Types Tokens Sentences Year President   FirstName
##  1797-Adams   826   2578        37 1797     Adams        John
##  1825-Adams  1003   3152        74 1825     Adams John Quincy

Exploring corpus texts

The kwic function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs:

kwic(data_corpus_inaugural, "terror")
##                                                                     
##     [1797-Adams, 1325]              fraud or violence, by | terror |
##  [1933-Roosevelt, 112] nameless, unreasoning, unjustified | terror |
##  [1941-Roosevelt, 287]      seemed frozen by a fatalistic | terror |
##    [1961-Kennedy, 866]    alter that uncertain balance of | terror |
##     [1981-Reagan, 813]     freeing all Americans from the | terror |
##   [1997-Clinton, 1055]        They fuel the fanaticism of | terror |
##   [1997-Clinton, 1655]  maintain a strong defense against | terror |
##     [2009-Obama, 1632]     advance their aims by inducing | terror |
##                                   
##  , intrigue, or venality          
##  which paralyzes needed efforts to
##  , we proved that this            
##  that stays the hand of           
##  of runaway living costs.         
##  . And they torment the           
##  and destruction. Our children    
##  and slaughtering innocents, we
kwic(data_corpus_inaugural, "terror", valuetype = "regex")
##                                                                             
##     [1797-Adams, 1325]                   fraud or violence, by |  terror   |
##  [1933-Roosevelt, 112]      nameless, unreasoning, unjustified |  terror   |
##  [1941-Roosevelt, 287]           seemed frozen by a fatalistic |  terror   |
##    [1961-Kennedy, 866]         alter that uncertain balance of |  terror   |
##    [1961-Kennedy, 990]               of science instead of its |  terrors  |
##     [1981-Reagan, 813]          freeing all Americans from the |  terror   |
##    [1981-Reagan, 2196]        understood by those who practice | terrorism |
##   [1997-Clinton, 1055]             They fuel the fanaticism of |  terror   |
##   [1997-Clinton, 1655]       maintain a strong defense against |  terror   |
##     [2009-Obama, 1632]          advance their aims by inducing |  terror   |
##     [2017-Trump, 1117] civilized world against radical Islamic | terrorism |
##                                   
##  , intrigue, or venality          
##  which paralyzes needed efforts to
##  , we proved that this            
##  that stays the hand of           
##  . Together let us explore        
##  of runaway living costs.         
##  and prey upon their neighbors    
##  . And they torment the           
##  and destruction. Our children    
##  and slaughtering innocents, we   
##  , which we will eradicate
kwic(data_corpus_inaugural, "communist*")
##                                                                   
##   [1949-Truman, 834] the actions resulting from the | Communist  |
##  [1961-Kennedy, 519]             -- not because the | Communists |
##                            
##  philosophy are a threat to
##  may be doing it,

In the above summary, Year and President are variables associated with each document. We can access such variables with the docvars() function.

# inspect the document-level variables
head(docvars(data_corpus_inaugural))
##   Year  President FirstName
## 1 1789 Washington    George
## 2 1793 Washington    George
## 3 1797      Adams      John
## 4 1801  Jefferson    Thomas
## 5 1805  Jefferson    Thomas
## 6 1809    Madison     James
# inspect the corpus-level metadata
metacorpus(data_corpus_inaugural)
## $source
## [1] "Gerhard Peters and John T. Woolley. The American Presidency Project."
## 
## $notes
## [1] "http://www.presidency.ucsb.edu/inaugurals.php"

More corpora are available from the quanteda.corpora package.

Extracting Features from a Corpus

In order to perform statistical analysis such as document scaling, we must extract a matrix associating values for certain features with each document. In quanteda, we use the dfm function to produce such a matrix. “dfm” is short for document-feature matrix, and always refers to documents in rows and “features” as columns. We fix this dimensional orientation because it is standard in data analysis to have a unit of analysis as a row, and features or variables pertaining to each unit as columns. We call them “features” rather than terms, because features are more general than terms: they can be defined as raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.

Tokenizing texts

To simply tokenize a text, quanteda provides a powerful command called tokens(). This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.

tokens() is deliberately conservative, meaning that it does not remove anything from the text unless told to do so.

txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!", 
         text2 = "@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.")
tokens(txt)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"        "999"      
##  [7] "different" "ways"      ","         "up"        "and"       "down"     
## [13] ";"         "left"      "and"       "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"            
##  [5] "#quanteda"      "2day"           "4ever"          ","             
##  [9] "http"           ":"              "/"              "/"             
## [13] "textasdata.com" "?"              "page"           "="             
## [17] "123"            "."
tokens(txt, remove_numbers = TRUE,  remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "in"        "different" "ways"      "up"       
##  [7] "and"       "down"      "left"      "and"       "right"    
## 
## text2 :
## [1] "@kenbenoit"     "working"        "on"             "#quanteda"     
## [5] "2day"           "4ever"          "http"           "textasdata.com"
## [9] "page"
tokens(txt, remove_numbers = FALSE, remove_punct = TRUE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "10"        "in"        "999"       "different"
##  [7] "ways"      "up"        "and"       "down"      "left"      "and"      
## [13] "right"    
## 
## text2 :
##  [1] "@kenbenoit"     "working"        "on"             "#quanteda"     
##  [5] "2day"           "4ever"          "http"           "textasdata.com"
##  [9] "page"           "123"
tokens(txt, remove_numbers = TRUE,  remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "in"        "different" "ways"     
##  [7] ","         "up"        "and"       "down"      ";"         "left"     
## [13] "and"       "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"            
##  [5] "#quanteda"      "2day"           "4ever"          ","             
##  [9] "http"           ":"              "/"              "/"             
## [13] "textasdata.com" "?"              "page"           "="             
## [17] "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      "is"        "$"         "10"        "in"        "999"      
##  [7] "different" "ways"      ","         "up"        "and"       "down"     
## [13] ";"         "left"      "and"       "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     "working"        ":"              "on"            
##  [5] "#quanteda"      "2day"           "4ever"          ","             
##  [9] "http"           ":"              "/"              "/"             
## [13] "textasdata.com" "?"              "page"           "="             
## [17] "123"            "."
tokens(txt, remove_numbers = FALSE, remove_punct = FALSE, remove_separators = FALSE)
## tokens from 2 documents.
## text1 :
##  [1] "This"      " "         "is"        " "         "$"         "10"       
##  [7] " "         "in"        " "         "999"       " "         "different"
## [13] " "         "ways"      ","         "\n"        " "         "up"       
## [19] " "         "and"       " "         "down"      ";"         " "        
## [25] "left"      " "         "and"       " "         "right"     "!"        
## 
## text2 :
##  [1] "@kenbenoit"     " "              "working"        ":"             
##  [5] " "              "on"             " "              "#quanteda"     
##  [9] " "              "2day"           "\t"             "4ever"         
## [13] ","              " "              "http"           ":"             
## [17] "/"              "/"              "textasdata.com" "?"             
## [21] "page"           "="              "123"            "."

We also have the option to tokenize characters:

tokens("Great website: http://textasdata.com?page=123.", what = "character")
## tokens from 1 document.
## text1 :
##  [1] "G" "r" "e" "a" "t" "w" "e" "b" "s" "i" "t" "e" ":" "h" "t" "t" "p" ":" "/"
## [20] "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p" "a" "g"
## [39] "e" "=" "1" "2" "3" "."
tokens("Great website: http://textasdata.com?page=123.", what = "character", 
         remove_separators = FALSE)
## tokens from 1 document.
## text1 :
##  [1] "G" "r" "e" "a" "t" " " "w" "e" "b" "s" "i" "t" "e" ":" " " "h" "t" "t" "p"
## [20] ":" "/" "/" "t" "e" "x" "t" "a" "s" "d" "a" "t" "a" "." "c" "o" "m" "?" "p"
## [39] "a" "g" "e" "=" "1" "2" "3" "."

and sentences:

# sentence level         
tokens(c("Kurt Vongeut said; only assholes use semi-colons.", 
           "Today is Thursday in Canberra:  It is yesterday in London.", 
           "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"), 
          what = "sentence")
## tokens from 3 documents.
## text1 :
## [1] "Kurt Vongeut said; only assholes use semi-colons."
## 
## text2 :
## [1] "Today is Thursday in Canberra:  It is yesterday in London."
## 
## text3 :
## [1] "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"

Constructing a document-feature matrix

Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing a document-feature matrix. For this, we have a Swiss-army knife function, called dfm(), which performs tokenization and tabulates the extracted features into a matrix of documents by features. Unlike the conservative approach taken by tokens(), the dfm() function applies certain options by default, such as tolower() – a separate function for lower-casing texts – and removes punctuation. All of the options to tokens() can be passed to dfm(), however.

my_corpus <- corpus_subset(data_corpus_inaugural, Year > 1990)

# make a dfm
my_dfm <- dfm(my_corpus)
my_dfm[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (0.0% sparse).
## 7 x 5 sparse Matrix of class "dfm"
##               features
## docs           my fellow citizens   , today
##   1993-Clinton  7      5        2 139    10
##   1997-Clinton  6      7        7 131     5
##   2001-Bush     3      1        9 110     2
##   2005-Bush     2      3        6 120     3
##   2009-Obama    2      1        1 130     6
##   2013-Obama    3      3        6  99     4
##   2017-Trump    1      1        4  96     4

Other options for a dfm() include removing stopwords, and stemming the tokens.

# make a dfm, removing stopwords and applying stemming
myStemMat <- dfm(my_corpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE)
myStemMat[, 1:5]
## Document-feature matrix of: 7 documents, 5 features (17.1% sparse).
## 7 x 5 sparse Matrix of class "dfm"
##               features
## docs           fellow citizen today celebr mysteri
##   1993-Clinton      5       2    10      4       1
##   1997-Clinton      7       8     6      1       0
##   2001-Bush         1      10     2      0       0
##   2005-Bush         3       7     3      2       0
##   2009-Obama        1       1     6      2       0
##   2013-Obama        3       8     6      1       0
##   2017-Trump        1       4     5      3       1

The option remove provides a list of tokens to be ignored. Most users will supply a list of pre-defined “stop words”, defined for numerous languages, accessed through the stopwords() function:

head(stopwords("english"), 20)
##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"
head(stopwords("russian"), 10)
##  [1] "и"   "в"   "во"  "не"  "что" "он"  "на"  "я"   "с"   "со"
head(stopwords("arabic"), 10)
## Warning: 'stopwords(language = "ar")' is deprecated.
## Use 'stopwords(language = "ar", source = "misc")' instead.
## See help("Deprecated")
##  [1] "فى"  "في"  "كل"  "لم"  "لن"  "له"  "من"  "هو"  "هي"  "قوة"

Viewing the document-feature matrix

The dfm can be inspected in the Enviroment pane in RStudio, or by calling R’s View function. Calling plot on a dfm will display a wordcloud.

my_dfm <- dfm(data_char_ukimmig2010, remove = stopwords("english"), remove_punct = TRUE)
my_dfm
## Document-feature matrix of: 9 documents, 1,547 features (83.8% sparse).

To access a list of the most frequently occurring features, we can use topfeatures():

topfeatures(my_dfm, 20)  # 20 top words
## immigration     british      people      asylum     britain          uk 
##          66          37          35          29          28          27 
##      system  population     country         new  immigrants      ensure 
##          27          21          20          19          17          17 
##       shall citizenship      social    national         bnp     illegal 
##          17          16          14          14          13          13 
##        work     percent 
##          13          12

Plotting a word cloud is done using textplot_wordcloud(), for a dfm class object. This function passes arguments through to wordcloud() from the wordcloud package, and can prettify the plot using the same arguments:

set.seed(100)
textplot_wordcloud(my_dfm, min_count = 6, random_order = FALSE,
                   rotation = .25, 
                   colors = RColorBrewer::brewer.pal(8,"Dark2"))