Convert a quanteda dfm object to a format useable by other text analysis packages. The general function convert provides easy conversion from a dfm to the document-term representations used in all other text analysis packages for which conversions are defined.

convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels", "lsa",
  "matrix", "data.frame", "tripletlist"), docvars = NULL,
  omit_empty = TRUE)

Arguments

x

a dfm to be converted

to

target conversion format, consisting of the name of the package into whose document-term matrix representation the dfm will be converted:

"lda"

a list with components "documents" and "vocab" as needed by the function lda.collapsed.gibbs.sampler from the lda package

"tm"

a DocumentTermMatrix from the tm package

"stm"

the format for the stm package

"austin"

the wfm format from the austin package

"topicmodels"

the "dtm" format as used by the topicmodels package

"lsa"

the "textmatrix" format as used by the lsa package

"data.frame"

a data.frame where each feature is a variable

"tripletlist"

a named "triplet" format list consisting of document, feature, and frequency

docvars

optional data.frame of document variables used as the meta information in conversion to the stm package format. This aids in selecting the document variables only corresponding to the documents with non-zero counts. Only affects the "stm" format.

omit_empty

logical; if TRUE, omit empty documents and features from the converted dfm. This is required for some formats (such as STM) that do not accept empty documents. Only used when to = "lda" or to = "topicmodels". For to = "stm" format, `omit_empty`` is always TRUE.

Value

A converted object determined by the value of to (see above). See conversion target package documentation for more detailed descriptions of the return formats.

Examples

corp <- corpus_subset(data_corpus_inaugural, Year > 1970) dfmat1 <- dfm(corp) # austin's wfm format identical(dim(dfmat1), dim(convert(dfmat1, to = "austin")))
#> [1] TRUE
# stm package format stmmat <- convert(dfmat1, to = "stm") str(stmmat)
#> List of 3 #> $ documents:List of 12 #> ..$ 1973-Nixon : int [1:2, 1:515] 1 34 2 96 3 1 4 5 6 3 ... #> ..$ 1977-Carter : int [1:2, 1:501] 1 18 2 65 3 7 4 4 7 52 ... #> ..$ 1981-Reagan : int [1:2, 1:850] 1 19 2 174 3 7 4 3 6 5 ... #> ..$ 1985-Reagan : int [1:2, 1:876] 1 24 2 177 3 13 4 7 6 3 ... #> ..$ 1989-Bush : int [1:2, 1:756] 1 15 2 166 3 14 4 16 6 5 ... #> ..$ 1993-Clinton: int [1:2, 1:605] 2 139 3 6 4 5 7 81 9 4 ... #> ..$ 1997-Clinton: int [1:2, 1:726] 1 26 2 131 3 13 4 7 6 3 ... #> ..$ 2001-Bush : int [1:2, 1:592] 1 4 2 110 3 4 4 7 6 1 ... #> ..$ 2005-Bush : int [1:2, 1:735] 1 2 2 120 3 3 4 8 6 2 ... #> ..$ 2009-Obama : int [1:2, 1:900] 1 44 2 130 3 22 4 4 5 1 ... #> ..$ 2013-Obama : int [1:2, 1:786] 1 13 2 99 3 14 4 5 7 89 ... #> ..$ 2017-Trump : int [1:2, 1:547] 1 11 2 96 3 9 4 8 7 88 ... #> $ vocab : chr [1:3462] "-" "," ";" ":" ... #> $ meta :'data.frame': 12 obs. of 3 variables: #> ..$ Year : num [1:12] 1973 1977 1981 1985 1989 ... #> ..$ President: chr [1:12] "Nixon" "Carter" "Reagan" "Reagan" ... #> ..$ FirstName: chr [1:12] "Richard Milhous" "Jimmy" "Ronald" "Ronald" ...
#' # triplet tripletmat <- convert(dfmat1, to = "tripletlist") str(tripletmat)
#> List of 3 #> $ document : chr [1:8389] "1973-Nixon" "1981-Reagan" "1989-Bush" "2005-Bush" ... #> $ feature : chr [1:8389] "mr" "mr" "mr" "mr" ... #> $ frequency: num [1:8389] 3 3 6 1 1 69 52 130 124 142 ...
# illustrate what happens with zero-length documents dfmat2 <- dfm(c(punctOnly = "!!!", corp[-1])) rowSums(dfmat2)
#> punctOnly 1977-Carter 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton #> 3 1376 2790 2921 2681 1833 #> 1997-Clinton 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2449 1808 2319 2711 2317 1660
str(convert(dfmat2, to = "stm", docvars = docvars(corp)))
#> List of 3 #> $ documents:List of 12 #> ..$ punctOnly : int [1:2, 1] 5 3 #> ..$ 1977-Carter : int [1:2, 1:501] 1 18 2 65 3 7 4 4 7 52 ... #> ..$ 1981-Reagan : int [1:2, 1:850] 1 19 2 174 3 7 4 3 6 5 ... #> ..$ 1985-Reagan : int [1:2, 1:876] 1 24 2 177 3 13 4 7 6 3 ... #> ..$ 1989-Bush : int [1:2, 1:756] 1 15 2 166 3 14 4 16 6 5 ... #> ..$ 1993-Clinton: int [1:2, 1:605] 2 139 3 6 4 5 7 81 9 4 ... #> ..$ 1997-Clinton: int [1:2, 1:726] 1 26 2 131 3 13 4 7 6 3 ... #> ..$ 2001-Bush : int [1:2, 1:592] 1 4 2 110 3 4 4 7 6 1 ... #> ..$ 2005-Bush : int [1:2, 1:735] 1 2 2 120 3 3 4 8 6 2 ... #> ..$ 2009-Obama : int [1:2, 1:900] 1 44 2 130 3 22 4 4 5 1 ... #> ..$ 2013-Obama : int [1:2, 1:786] 1 13 2 99 3 14 4 5 7 89 ... #> ..$ 2017-Trump : int [1:2, 1:547] 1 11 2 96 3 9 4 8 7 88 ... #> $ vocab : chr [1:3376] "-" "," ";" ":" ... #> $ meta :'data.frame': 12 obs. of 3 variables: #> ..$ Year : num [1:12] 1973 1977 1981 1985 1989 ... #> ..$ President: chr [1:12] "Nixon" "Carter" "Reagan" "Reagan" ... #> ..$ FirstName: chr [1:12] "Richard Milhous" "Jimmy" "Ronald" "Ronald" ...
# NOT RUN { # tm's DocumentTermMatrix format tmdfm <- convert(dfmat1, to = "tm") str(tmdfm) # topicmodels package format str(convert(dfmat1, to = "topicmodels")) # lda package format str(convert(dfmat1, to = "lda")) # }