Construct a sparse document-feature matrix, from a character, corpus, tokens, or even other dfm object.

dfm(x, tolower = TRUE, stem = FALSE, select = NULL, remove = NULL,
  thesaurus = NULL, dictionary = NULL, valuetype = c("glob", "regex",
  "fixed"), groups = NULL, verbose = quanteda_options("verbose"), ...)

Arguments

x

character, corpus, tokens, or dfm object

tolower

convert all tokens to lowercase

stem

if TRUE, stem words

select

a user supplied regular expression defining which features to keep, while excluding all others. This can be used in lieu of a dictionary if there are only specific features that a user wishes to keep. To extract only Twitter usernames, for example, set select = "@*" and make sure that remove_twitter = FALSE as an additional argument passed to tokenize. Note: select = "^@\\w+\\b" would be the regular expression version of this matching pattern. The pattern matching type will be set by valuetype.

remove

a character vector of user-supplied features to ignore, such as "stop words". To access one possible list (from any list you wish), use stopwords(). The pattern matching type will be set by valuetype. For behaviour of remove with ngrams > 1, see Details.

thesaurus

A list of character vector "thesaurus" entries, in a dictionary list format, which operates as a dictionary but without excluding values not matched from the dictionary. Thesaurus keys are converted to upper case to create a feature label in the dfm, as a reminder that this was not a type found in the text, but rather the label of a thesaurus key. For more fine-grained control over this and other aspects of converting features into dictionary/thesaurus keys from pattern matches to values, you can use dfm_lookup after creating the dfm.

dictionary

A list of character vector dictionary entries, including regular expressions (see examples)

valuetype

how to interpret keyword expressions: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

groups

character vector containing the names of document variables for aggregating documents; only applies when calling dfm on a corpus object. When x is a dfm object, groups provides a convenient and fast method of combining and refactoring the documents of the dfm according to the groups.

verbose

display messages if TRUE

...

additional arguments passed to tokens, for character and corpus

Value

a dfm-class object

Details

The default behavior for remove/select when constructing ngrams using dfm(x, ngrams > 1) is to remove/select any ngram constructed from a matching feature. If you wish to remove these before constructing ngrams, you will need to first tokenize the texts with ngrams, then remove the features to be ignored, and then construct the dfm using this modified tokenization object. See the code examples for an illustration.

See also

dfm_select, dfm-class

Examples

## for a corpus corpus_post80inaug <- corpus_subset(data_corpus_inaugural, Year > 1980) dfm(corpus_post80inaug)
#> Document-feature matrix of: 10 documents, 3,239 features (77.2% sparse).
dfm(corpus_post80inaug, tolower = FALSE)
#> Document-feature matrix of: 10 documents, 3,458 features (77.5% sparse).
# grouping documents by docvars in a corpus dfm(corpus_post80inaug, groups = "President", verbose = TRUE)
#> Creating a dfm from a corpus ...
#> ... grouping texts by variable: President
#> ... tokenizing grouped texts
#> ... lowercasing
#> ... found 5 documents, 3,239 features
#> ... created a 5 x 3,239 sparse dfm #> ... complete. #> Elapsed time: 0 seconds.
#> Document-feature matrix of: 5 documents, 3,239 features (64% sparse).
# with English stopwords and stemming dfm(corpus_post80inaug, remove = stopwords("english"), stem = TRUE, verbose = TRUE)
#> Creating a dfm from a corpus ...
#> ... tokenizing texts
#> ... lowercasing
#> ... found 10 documents, 3,239 features
#> ...
#> dfm_select removed 130 features and 0 documents, padding 0s for 0 features and 0 documents.
#> ... stemming features (English)
#> , trimmed 832 feature variants
#> ... created a 10 x 2,277 sparse dfm #> ... complete. #> Elapsed time: 0.02 seconds.
#> Document-feature matrix of: 10 documents, 2,277 features (74.8% sparse).
# works for both words in ngrams too dfm("Banking industry", stem = TRUE, ngrams = 2, verbose = FALSE)
#> Document-feature matrix of: 1 document, 1 feature (0% sparse). #> 1 x 1 sparse Matrix of class "dfmSparse" #> features #> docs bank_industri #> text1 1
# with dictionaries corpus_post1900inaug <- corpus_subset(data_corpus_inaugural, Year>1900) mydict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"), opposition = c("Opposition", "reject", "notincorpus"), taxing = "taxing", taxation = "taxation", taxregex = "tax*", country = "states")) dfm(corpus_post1900inaug, dictionary = mydict)
#> Document-feature matrix of: 30 documents, 6 features (73.3% sparse).
# removing stopwords testText <- "The quick brown fox named Seamus jumps over the lazy dog also named Seamus, with the newspaper from a boy named Seamus, in his mouth." testCorpus <- corpus(testText) # note: "also" is not in the default stopwords("english") featnames(dfm(testCorpus, select = stopwords("english")))
#> [1] "the" "over" "with" "from" "a" "in" "his"
# for ngrams featnames(dfm(testCorpus, ngrams = 2, select = stopwords("english"), remove_punct = TRUE))
#> character(0)
featnames(dfm(testCorpus, ngrams = 1:2, select = stopwords("english"), remove_punct = TRUE))
#> character(0)
# removing stopwords before constructing ngrams tokensAll <- tokens(char_tolower(testText), remove_punct = TRUE) tokensNoStopwords <- removeFeatures(tokensAll, stopwords("english")) tokensNgramsNoStopwords <- tokens_ngrams(tokensNoStopwords, 2) featnames(dfm(tokensNgramsNoStopwords, verbose = FALSE))
#> [1] "quick_brown" "brown_fox" "fox_named" "named_seamus" #> [5] "seamus_jumps" "jumps_lazy" "lazy_dog" "dog_also" #> [9] "also_named" "seamus_newspaper" "newspaper_boy" "boy_named" #> [13] "seamus_mouth"
# keep only certain words dfm(testCorpus, select = "*s", verbose = FALSE) # keep only words ending in "s"
#> Document-feature matrix of: 1 document, 3 features (0% sparse). #> 1 x 3 sparse Matrix of class "dfmSparse" #> features #> docs seamus jumps his #> text1 3 1 1
dfm(testCorpus, select = "s$", valuetype = "regex", verbose = FALSE)
#> Document-feature matrix of: 1 document, 3 features (0% sparse). #> 1 x 3 sparse Matrix of class "dfmSparse" #> features #> docs seamus jumps his #> text1 3 1 1
# testing Twitter functions testTweets <- c("My homie @justinbieber #justinbieber shopping in #LA yesterday #beliebers", "2all the ha8ers including my bro #justinbieber #emabiggestfansjustinbieber", "Justin Bieber #justinbieber #belieber #fetusjustin #EMABiggestFansJustinBieber") dfm(testTweets, select = "#*", remove_twitter = FALSE) # keep only hashtags
#> Document-feature matrix of: 3 documents, 6 features (50% sparse). #> 3 x 6 sparse Matrix of class "dfmSparse" #> features #> docs #justinbieber #la #beliebers #emabiggestfansjustinbieber #belieber #> text1 1 1 1 0 0 #> text2 1 0 0 1 0 #> text3 1 0 0 1 1 #> features #> docs #fetusjustin #> text1 0 #> text2 0 #> text3 1
dfm(testTweets, select = "^#.*$", valuetype = "regex", remove_twitter = FALSE)
#> Document-feature matrix of: 3 documents, 6 features (50% sparse). #> 3 x 6 sparse Matrix of class "dfmSparse" #> features #> docs #justinbieber #la #beliebers #emabiggestfansjustinbieber #belieber #> text1 1 1 1 0 0 #> text2 1 0 0 1 0 #> text3 1 0 0 1 1 #> features #> docs #fetusjustin #> text1 0 #> text2 0 #> text3 1
# for a dfm dfm1 <- dfm(data_corpus_irishbudget2010) dfm2 <- dfm(dfm1, groups = ifelse(docvars(data_corpus_irishbudget2010, "party") %in% c("FF", "Green"), "Govt", "Opposition"), tolower = FALSE, verbose = TRUE)
#> ... grouping texts
#> ... created a 2 x 5,058 sparse dfm #> ... complete. #> Elapsed time: 0.006 seconds.